It looks like this model is like 95% supported, then there’s a transformer block at the end with a graph pattern not yet detected by the Compiler. We’re continuing to add support over time for more vision transformer models such as these and yolov11, and RTMpose can likely be added in a future SDK release.
In the meantime, that end2end.onnx file in the download you linked above can be compiled by cropping at the point right before the transformer block. This would be the command:
mx_nc -vv -m end2end.onnx --outputs "810"
Be sure to use the set_postprocessing_model function in your application, using the cropped post.onnx output of the compiler.
in this case, what would be the correct way to implement the inference pipeline using the cropped model (e.g., with --outputs "810")?
Since the model is cut before the transformer, I assume the initial CNN takes a fixed-size image as input, and then the output tensor (e.g., 810) must be passed to some post-processing logic.
Should I call set_postprocessing_model() with the compiled cropped model and handle the rest manually?
If possible, could you share an example or general reference for implementing such a pipeline (e.g., how to use DFP and post-processing functions together)?
Once the model is cropped, you’ll get a .dfp plus a post.onnx model. To use the post model, call set_postprocessing_model() while initializing the accelerator object – the runtime will internally create an onnxruntime session to run the post model on the CPU.
So the output returned to the user will be the final model output, not the 810 tensor.
# opens the chip and downloads the DFP, but doesn't start it yet
accl = AsyncAccl(dfp)
# Adds the post model to the pipeline.
accl.set_postprocessing_model('yolov8m-pose_post.onnx', model_idx=0)
# Connect the input and output callback functions, which provide a new
# input frame & receive final output frames respectively
accl.connect_input( my_input_callback_function )
accl.connect_output( my_output_callback_function )
Without the set_postprocessing_model call, the returned tensors to the output callback function would be intermediate values at the model crop points (like 810 for RTMpose) . I think it’s around 6 tensors of varying shapes for yolov8m-pose, for example.
But with the connected post model, the returned tensor is just [8400, 56], which is the final output of the model.
Hope this helps clear things up – let me know if you’d like more examples.