MX3 benchmark performance CPU consumption

Hi folks - I’ve been benchmarking MX3 on i.MX 8MP processor with 1-lane Gen 3 PCIe (4 chip MX3). Here are the results. What’s surprising is how much CPU is used during this benchmarking of YOLOv8 Nano Object Detection with random inputs.

Analyzing what in acclBench consumes most CPU, it looks to be mostly data transfers and PCIe ioctls to read and write tensors. This consumes 70% of each of 4 A53 cores (280%). Is this expected? Why is there such a bottleneck?

yucca-317[xx yy]:~# acclBench -d YOLO_v8_nano_640_640_3_onnx.dfp -f 10000
*************************************************
*      Evaluate dfp performance using MX3       *
*************************************************

Number of chips the dfp is compiled for = 4

     Model    Stream            FPS
-----------------------------------
         0         0           61.5

    Device       Temp (C)
-------------------------
         0           56.4

Average FPS per stream : 61.5
Average FPS for DFP    : 61.5


 Bench for 1 Model(s) Done

Can you help us identify the bottlenecks and suggest ways to overcome them? It could just be the benchmarker, but still surprising. I just used the object detection model from your examples.

Vai

Hi vaigen,

Regarding i.MX8M Usage

What you’re seeing is indeed reasonable on the i.MX8MP platform. MX3 itself is not the bottleneck — the limiting factors are the base A53 host cores, along with RAM and PCIe Gen3 x1 bandwidth constraints. To explain in detail

  • Moving large featuremaps between system DRAM and the PCIe device is expensive on i.MX8MP. Unlike on stronger hosts (ex x86,or ARM A76), the transfers don’t scale efficiently and end up consuming a significant share of the A53 cores’ cycles.

  • In addition to the data movement itself, the path relies on many small ioctls (read/write featuremaps). Each incurs context-switch overhead, which adds up quickly on baseline A53 cores.

  • PCIe Gen3 x1 can theoretically deliver around 8 GT/s (~985 MB/s), but in reality you never get the full number. Protocol overhead, and software layers all chip away at the usable bandwidth, so the practical throughput on i.MX8MP ends up quite a bit lower.

Next steps to try:

  • Run a reduced-resolution model (e.g., 640 → 480 or 320) to lower the transfer and processing overhead.

  • Cap the frame rate at 30 or 60 FPS using the --max_fps parameter in acclBench. This better reflects real-time deployment targets and helps contain CPU load.

In terms of ARM platforms, we’ve observed better performance with boards using RK3588 CPUs (e.g., OrangePi, Rock 5B) and Raspberry Pi 5 compared to i.MX8M SOMs.

I havn’t looked at your benchmark, but for my application, the decoding an mp4, resizing the image, converting to rgb, convert to float, then submit to the MX3 is a significant overhead.

I think an optimised scale/convert to rgb/to float all in one function could likely provide a lot of benefit.