"Hello, MobileNet!" example hangs forever

Hello,

I was able to assemble the heatsink on the MX3 module (thanks for the great video).

I was able to install the drivers (thanks for the great instructions).
Here is what “lspci -k” reports:

(mx) ~/memryx$ lspci
	2e:00.0 Processing accelerators: Device 1fe9:0100
	Subsystem: Device 1fe9:0000
	Flags: bus master, fast devsel, latency 0, IRQ 46, NUMA node 0
	Memory at a0000000 (32-bit, non-prefetchable) [size=256M]
	Memory at b0000000 (32-bit, non-prefetchable) [size=1M]
	Expansion ROM at <ignored> [disabled]
	Capabilities: <access denied>
	Kernel driver in use: memx_pcie_ai_chip
	Kernel modules: memx_cascade_plus_pcie

I was able to install the MemryX SDK (thanks for the great instructions).

I am able to successfully run the “Hello, MXA!” example:

(mx) ~/memryx$ mx_bench --hello
	Hello from MXA!
	Group: 0
	Number of chips: 4
	Interface: PCIe 3.0

When I run the “Hello, MobileNet!” example, however, it hangs forever:

(mx) ~/memryx$ python3 -c "import tensorflow as tf; tf.keras.applications.MobileNet().save('mobilenet.h5');"
mx_nc -v -m mobilenet.h5

2025-01-20 12:47:01.901202: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-01-20 12:47:01.912572: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-01-20 12:47:01.926487: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-01-20 12:47:01.930667: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-01-20 12:47:01.940631: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-01-20 12:47:02.727591: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2025-01-20 12:47:04.193382: I tensorflow/core/common_runtime/gpu/gpu_device.cc:2021] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 1588 MB memory:  -> device: 0, name: NVIDIA T400, pci bus id: 0000:21:00.0, compute capability: 7.5
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
W0000 00:00:1737395225.449573    8130 gpu_backend_lib.cc:593] Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice. This may result in compilation or runtime failures, if the program we try to run uses routines from libdevice.
Searched for CUDA in the following directories:
  ./cuda_sdk_lib
  /usr/local/cuda-12.3
  /usr/local/cuda
  /home/albertabeef/mx/lib/python3.10/site-packages/tensorflow/python/platform/../../../nvidia/cuda_nvcc
  /home/albertabeef/mx/lib/python3.10/site-packages/tensorflow/python/platform/../../../../nvidia/cuda_nvcc
  .
You can choose the search directory by setting xla_gpu_cuda_data_dir in HloModule's DebugOptions.  For most apps, setting the environment variable XLA_FLAGS=--xla_gpu_cuda_data_dir=/path/to/cuda will work.
W0000 00:00:1737395225.725834    8130 gpu_kernel_to_blob_pass.cc:190] Failed to compile generated PTX with ptxas. Falling back to compilation by driver.
W0000 00:00:1737395225.727424    8132 gpu_kernel_to_blob_pass.cc:190] Failed to compile generated PTX with ptxas. Falling back to compilation by driver.
W0000 00:00:1737395225.730188    8129 gpu_kernel_to_blob_pass.cc:190] Failed to compile generated PTX with ptxas. Falling back to compilation by driver.
W0000 00:00:1737395225.731763    8134 gpu_kernel_to_blob_pass.cc:190] Failed to compile generated PTX with ptxas. Falling back to compilation by driver.
W0000 00:00:1737395225.734495    8131 gpu_kernel_to_blob_pass.cc:190] Failed to compile generated PTX with ptxas. Falling back to compilation by driver.
W0000 00:00:1737395225.737649    8133 gpu_kernel_to_blob_pass.cc:190] Failed to compile generated PTX with ptxas. Falling back to compilation by driver.
W0000 00:00:1737395225.739870    8128 gpu_kernel_to_blob_pass.cc:190] Failed to compile generated PTX with ptxas. Falling back to compilation by driver.
W0000 00:00:1737395225.741305    8127 gpu_kernel_to_blob_pass.cc:190] Failed to compile generated PTX with ptxas. Falling back to compilation by driver.
W0000 00:00:1737395225.751963    8130 gpu_kernel_to_blob_pass.cc:190] Failed to compile generated PTX with ptxas. Falling back to compilation by driver.
W0000 00:00:1737395225.753423    8132 gpu_kernel_to_blob_pass.cc:190] Failed to compile generated PTX with ptxas. Falling back to compilation by driver.
W0000 00:00:1737395225.756344    8129 gpu_kernel_to_blob_pass.cc:190] Failed to compile generated PTX with ptxas. Falling back to compilation by driver.
W0000 00:00:1737395225.757803    8134 gpu_kernel_to_blob_pass.cc:190] Failed to compile generated PTX with ptxas. Falling back to compilation by driver.
W0000 00:00:1737395225.759228    8131 gpu_kernel_to_blob_pass.cc:190] Failed to compile generated PTX with ptxas. Falling back to compilation by driver.
Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/mobilenet/mobilenet_1_0_224_tf.h5
17225924/17225924 ━━━━━━━━━━━━━━━━━━━━ 0s 0us/step
WARNING:absl:You are saving your model as an HDF5 file via `model.save()` or `keras.saving.save_model(model)`. This file format is considered legacy. We recommend using instead the native Keras format, e.g. `model.save('my_model.keras')` or `keras.saving.save_model(model, 'my_model.keras')`. 

╭─────────────────┬─────┬─────┬────────╮
│                 │     │     │        │
│                 │           ├────    │
│     │     │     ╞══       ══╡        │
│     │     │     │           ├────    │
│     │     │     │     │     │        │
╰─────┴─────┴─────┴─────┴─────┴────────╯

╔══════════════════════════════════════╗
║            Neural Compiler           ║
║  Copyright (c) 2019-2024 MemryX Inc. ║
╚══════════════════════════════════════╝

════════════════════════════════════════
Anonymously share diagnostic data to support optimizing performance & enabling debug support (Y/N)?
Y
Selected: Yes
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
W0000 00:00:1737395250.360234    8192 gpu_backend_lib.cc:593] Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice. This may result in compilation or runtime failures, if the program we try to run uses routines from libdevice.
Searched for CUDA in the following directories:
  ./cuda_sdk_lib
  /usr/local/cuda-12.3
  /usr/local/cuda
  /home/albertabeef/mx/lib/python3.10/site-packages/tensorflow/python/platform/../../../nvidia/cuda_nvcc
  /home/albertabeef/mx/lib/python3.10/site-packages/tensorflow/python/platform/../../../../nvidia/cuda_nvcc
  .
You can choose the search directory by setting xla_gpu_cuda_data_dir in HloModule's DebugOptions.  For most apps, setting the environment variable XLA_FLAGS=--xla_gpu_cuda_data_dir=/path/to/cuda will work.
W0000 00:00:1737395250.368492    8188 gpu_kernel_to_blob_pass.cc:190] Failed to compile generated PTX with ptxas. Falling back to compilation by driver.
W0000 00:00:1737395250.369851    8185 gpu_kernel_to_blob_pass.cc:190] Failed to compile generated PTX with ptxas. Falling back to compilation by driver.
W0000 00:00:1737395250.371372    8191 gpu_kernel_to_blob_pass.cc:190] Failed to compile generated PTX with ptxas. Falling back to compilation by driver.
W0000 00:00:1737395250.372728    8186 gpu_kernel_to_blob_pass.cc:190] Failed to compile generated PTX with ptxas. Falling back to compilation by driver.
W0000 00:00:1737395250.374126    8187 gpu_kernel_to_blob_pass.cc:190] Failed to compile generated PTX with ptxas. Falling back to compilation by driver.
W0000 00:00:1737395250.375482    8192 gpu_kernel_to_blob_pass.cc:190] Failed to compile generated PTX with ptxas. Falling back to compilation by driver.
W0000 00:00:1737395250.376827    8189 gpu_kernel_to_blob_pass.cc:190] Failed to compile generated PTX with ptxas. Falling back to compilation by driver.
W0000 00:00:1737395250.378235    8190 gpu_kernel_to_blob_pass.cc:190] Failed to compile generated PTX with ptxas. Falling back to compilation by driver.
W0000 00:00:1737395250.386866    8188 gpu_kernel_to_blob_pass.cc:190] Failed to compile generated PTX with ptxas. Falling back to compilation by driver.
W0000 00:00:1737395250.390006    8185 gpu_kernel_to_blob_pass.cc:190] Failed to compile generated PTX with ptxas. Falling back to compilation by driver.
W0000 00:00:1737395250.391396    8186 gpu_kernel_to_blob_pass.cc:190] Failed to compile generated PTX with ptxas. Falling back to compilation by driver.
W0000 00:00:1737395250.392748    8191 gpu_kernel_to_blob_pass.cc:190] Failed to compile generated PTX with ptxas. Falling back to compilation by driver.
W0000 00:00:1737395250.394102    8187 gpu_kernel_to_blob_pass.cc:190] Failed to compile generated PTX with ptxas. Falling back to compilation by driver.
Converting Model: (Done)                                         
Optimizing Graph: (Done)                                         
Cores optimization: (Done)                                         
Flow optimization: (Done)                                         
. . . . . . . . . . . . . . . . . . . . 
Ports mapping: (Done)
MPU 0 input port 0: {'model_index': 0, 'layer_name': 'input_layer', 'shape': [224, 224, 1, 3]}
MPU 3 output port 0: {'model_index': 0, 'layer_name': 'predictions', 'shape': [1, 1, 1, 1000]}
────────────────────────────────────────
Assembling DFP: (Done)                                         
════════════════════════════════════════

(mx) ~/memryx$ mx_bench -v -d mobilenet.dfp -f 1000

╭─────────────────┬─────┬─────┬────────╮
│                 │     │     │        │
│                 │           ├────    │
│     │     │     ╞══       ══╡        │
│     │     │     │           ├────    │
│     │     │     │     │     │        │
╰─────┴─────┴─────┴─────┴─────┴────────╯

╔══════════════════════════════════════╗
║               Benchmark              ║
║  Copyright (c) 2019-2024 MemryX Inc. ║
╚══════════════════════════════════════╝

Any idea what could be causing this ?

Cheers !

Mario (AlbertaBeef)

I have another data point …

I went through the same installation on a Raspberry Pi 5, and the “Hello, MobileNet!” worked on that platform.

So it seems that my issue is specific to my HP Z4 G4 workstation.

I really would like to get the MX3 working on my workstation as well.

Thanks in advance for any help,

Mario (AlbertaBeef)