Planning a governed multi-node edge setup around MX3
Using a bit of hardware downtime to document the direction I’m pushing toward next.
The goal is to move beyond “run inference and hope” toward a governed edge runtime: where execution is explicitly controlled, outcomes are verifiable, and enough traceability is preserved to reason about mismatches or failure states after the fact.
This is still early and not production-ready. The immediate focus is defining the next proof boundary without overcommitting to the wrong topology.
The hardware direction I’m exploring is a small multi-node edge fabric:
MX3 as the primary accelerator path
eventual scaling to multiple MX3 modules for device-aware scheduling and failover testing
dual-port QSFP+/40GbE links between nodes for higher-throughput data movement
a three-node layout where execution, verification, and state can be separated
What I’m trying to better understand is how others are approaching MX3 beyond single-host / single-device configurations.
Specifically interested in practical experience around:
running multiple MX3 modules within a single system vs distributing across nodes
PCIe layout and lane allocation strategies
cooling and enclosure considerations at higher densities
host-to-host movement of frames or inference artifacts
maintaining observability and reproducibility in accelerator-backed workflows
I’m intentionally avoiding overbuilding until I have a clearer understanding of the constraints. The goal is to make the next hardware step reinforce a clean execution boundary, not just scale capacity.
If anyone has hands-on experience with multi-MX3 or multi-node edge setups, I’d appreciate any lessons learned.
I think how you’ll distribute will depend on how the workload is split between MX3 (inference) and the CPU (video decode, pre/post, etc.) in your application.
For example, if you’re working with h264/hevc streams from cameras, the frame decoding can be quite taxing on the CPU. In this case, your options would be either:
Stick to a few compute nodes with many streams each, and use a bigger CPU, or a hardware decoder (GPU?), in order to not underutilize your connected MX3 modules (in a multi-MX3 setup).
Use many smaller nodes with 1 MX3 each, and distribute streams across them. For example, inputs first arrive at an “orchestrator node” that decides which worker node to assign the task to.
If you can share a few details on the end use case, we’d be happy to offer more specific recommendations. Key considerations are:
The type of original input data, e.g., real-time cameras vs. offline/batch data
Whether you’re using one DFP on all nodes, or if there’s a mixture
If there are multiple DFPs, how is it decided where inputs need to go?
Approximate magnitude of performance needed for the system, e.g. hundreds of FPS vs. thousands.
DFP-only benchmarks can be used here to estimate the number of MX3s
Thanks, this is helpful — the framing makes sense.
My current prototype is closer to the second model: a small orchestrator node assigning governed work to accelerator-backed worker lanes.
Right now the workload is not raw camera ingest. I’m focused on proving the execution boundary first-- permit-issued dispatch, MX3 execution, a verification/falsification lane, and capturing structured runtime events with sealed receipts. I’m validating that path before adding more workload complexity.
Once that is stable, the next input class will likely be framed video or artifact streams, where decode placement and host constraints will start to matter.
For now, I’m assuming a single DFP across the MX3 lane while I prove routing behavior, failure handling, and observability. Multi-DFP support can come later once the control surface is stable.
The next things I’m trying to measure are:
MX3 throughput and latency under a fixed DFP
host-side preprocessing cost
host-to-host artifact movement
how many lanes can be kept saturated without losing reproducibility or traceability
Your orchestrator/worker framing lines up closely with the topology I’m working toward.