SDK 2.1 question: placement signals for multi-device / multi-DFP setups

Hello all,

I’ve been reading through the SDK 2.1 docs and a few related threads, and I had a question about placement in multi-device or multi-DFP setups.

When compile-time capabilities and runtime deployment constraints don’t line up perfectly, what do you usually rely on to make placement decisions? Is that mainly busy factor and device/group selection, or do DFP residency/co-mapping and host-side bottlenecks factor in just as much?

I’m mostly trying to understand what MemryX considers the supported placement boundary under SDK 2.1.

Thanks.

Hi @Jerem6 , happy to explain some details here. Let’s first separate some scenarios, and apologies if this post is overly long :sweat_smile:

Multiple Models per DFP

Sometimes called “co-compiled” or “co-mapped”, the NeuralCompiler can map multiple models into one DFP. With this method, the models run simultaneously, in parallel on the chip (as opposed to some other methods we’ll discuss next).

The advantages of mapping multiple models per DFP are performance and simplicity: since there is only one DFP, no “context switching” between DFPs has to be done, and the same DFP can just run continuously.

However, the disadvantage of co-compiling is MX3 resource capacity: the combination of both models has to fit within the MX3’s memory and compute core resources. So for example, two 25M parameter models won’t be able to co-map into 1 DFP, because the (4-chip) MX3 M.2 has a max of 42M parameters. In this case, you will need to use multiple DFPs.


Multiple DFPs

To handle multiple DFPs, and/or multiple processes trying to access the same hardware, we introduced mxa-manager, which sits between the user application and the MX3s. Applications are now “clients” that send/recv feature maps to/from the mxa-manager “server”, using a socket file for communication.

To use multiple DFPs with mxa-manager, just either make multiple MxAccl objects in a single program, or separate programs with separate DFPs. mxa-manager will use the SchedulerOptions and ClientOptions from clients to decide when to run each DFP.

SchedulerOptions are focused on setting how long the DFP runs for. The current implementation of mxa-manager can be considered a form of “cooperative multitasking”, where the running DFP has control of the MX3 until it voluntarily yields back to mxa-manager’s scheduler.

The SchedulerOptions that control this behavior are:

  • frame_limit: count the frames processed and every time this DFP (all models within it) hit at least this number, yield back to scheduler.
  • time_limit: if this many milliseconds pass and no new inputs are received, yield back to the scheduler. This can be used in combination with frame_limit – the timeout will override the frame counter if it is hit first.
  • ifmap_queue_size: incoming frames for this DFP, shared among all client apps, are put into a queue. When a DFP is not currently running, mxa-manager still accepts inputs from clients and puts them into this queue. Then when this DFP runs again, the queue is drained. So it’s always a good idea to have frame_limit >= ifmap_queue_size
  • ofmap_queue_size: clients may be slow to post-process outputs, yet we want to get the DFP to run through its ifmap queue without waiting, so the next DFP can be scheduled. Thus, MX3 outputs are dumped into the ofmap queue first, then clients pull from here however fast they can. Rule of thumb is this should also be >= ifmap_queue_size.

There’s also ClientOptions, which currently only focuses on “frame smoothing”. This option sets a minimum time between outputs for the client, which increases average latency but delivers a “smooth” constant FPS to the client. Client apps could do this on their own, but this option is there for convenience.

The mxa-manager schedules DFPs to run by creating a queue of “tasks”: each DFP is a task, and each MX3 device has a queue of tasks to process. Tasks pop from the front the queue, run, then yield (based on SchedulerOptions) and push the task to the back of the queue again.


Multi-Device

To use multiple MX3 devices, just pass a list of device numbers you want to use to the device_ids parameter in the Python/C++ API constructors.

Under-the-hood, this has different effects for Shared (mxa-manager) or Local (direct hardware access) modes:

In Local mode, which only supports a single DFP, the DFP is downloaded to each MX3 device, and frames are sent to each in round-robin order.

In Shared mode, mxa-manager creates duplicate “tasks” for the scheduler, each restricted to one device. For example, DFP A with device_ids=0,1,2,3 will make 4 tasks for the scheduler, each task able to run only on devices 0,1,2,3 respectively.

Note: Only the device_ids option from the first client app to download the DFP will be used. if another app requests to use the same DFP, but has a different device_ids setting, this will be ignored and we’ll continue using the original set of devices.


Multi-DFP + Multi-Device

You can also use multiple DFPs, each with multiple devices given to device_ids.

For example, DFP A could be set to run on devices 0 and 1, while DFP B is set to run on device 2.

You can then submit DFP C to run on devices 1 and 2. Now device 0’s task queue will always be DFP A, while device 1 will alternate between A and B, and device 2 will alternate between B and C.


Now for the real topic: what are some best practices for deciding how to distribute the workload?

1. If your models can co-map into a single DFP, this is usually the best option.

DFP swapping incurs overheads that can hurt FPS and latency. In addition, choosing the best SchedulerOptions for a set of DFPs can involve a lot of trial and error.

**
2. If you want to prioritize one model over another, DFP swapping may make more sense**

For example, let’s say a YOLO model gets 400 FPS when it is compiled alone, but when compiled together with a UNet, the YOLO’s FPS drops to 100. This may happen because the NeuralCompiler statically allocates compute and memory resources. But if the UNet doesn’t run often in your application, while the YOLO is very important to keep at max FPS, you may consider using DFP swapping. Set a small time_limit value for the UNet model, and a large frame_limit for the YOLO model: this way you’ll mostly run YOLO, and in the common case that there’s nothing for the UNet to do, the timeout will hit and you’ll swap right back to YOLO.

3. If you have multiple devices + DFPs, consider restricting their placement

For example, if you have 2 devices + 2 DFPs, and they’re both running at maximum FPS, it would be best to statically assign DFP A to device 0 and DFP B to device 1, thus avoiding any context switching. Assigning A+B to devices 0+1 instead would be slower.

4. If you have a lot of client apps for a single DFP, just let mxa-manager handle distribution

For a reasonable* number of MX3 devices, it’s usually best to assign all devices to DFP and the let the runtime handle load distribution.

*if you start having 8+ devices in a single system, you should consider grouping them and statically assigning different clients/streams to each group.


More General Performance Tips

1. Check your full application pipeline for bottlenecks before adding more devices

For example, we find that decoding h264/hevc streams can often bottleneck the host CPU before a single MX3 device is maxed out.

A good way to tell if the bottleneck lies in the application or in the MX3 is to compare your end-to-end FPS with your acclBench/mx_bench FPS. The DFP benchmark tools give you an upper limit on the FPS, so if your application is already falling short of this number, then additional MX3 devices won’t help.

Once you get your end-to-end FPS is close to the DFP-only benchmark FPS, then you can consider adding more MX3 devices.

2. If you only have 1 DFP and 1 application, use Local mode

Local mode removes the overhead of send/recv data with mxa-manager. On powerful host systems, there’s usually no FPS difference between Shared and Local, but on weaker systems like ARM boards there can be a noticeable benefit with Local.

3. Consider your model architecture

Here’s where referring to the Model Explorer can be useful. For example, YOLOv8 architectures have substantially higher FPS than YOLO11 on MX3, with only minor accuracy differences (v8m mAP=50.2, v11m mAP=51.5).

Be sure to check Model Explorer with each new SDK though, because for example YOLO11 & YOLO26 have significant FPS boosts in SDK 2.2.


I hope this has been a helpful read. Let me know if you have any followup questions or would like to know more about something.

Thanks!

Thanks, this is very helpful — honestly this was one of those cases where the long answer was the useful answer :face_exhaling::sweat_smile:

The way you broke it down made the boundary much clearer to me. Especially the distinction between co-mapping first, then mxa-manager/scheduler behavior when multiple DFPs are involved.

One quick follow-up: in Shared mode, is busy factor mainly something to watch from the application side, while actual placement/switching is still driven mostly by mxa-manager scheduling, SchedulerOptions, and device_ids?

Just want to make sure I’m reading that boundary correctly.