Skip to content

3D ESDF computation hangs on DGX Spark (GB10, CUDA 13.0) #122

Description

@ekuama-wim

Environment

Component Value
Hardware NVIDIA DGX Spark
GPU NVIDIA GB10 (Blackwell architecture, NVIDIA RTX)
CPU 4 cores, aarch64 (ARM)
Memory 119 GB unified (shared CPU/GPU)
OS Ubuntu 24.04.3 LTS (Noble Numbat)
Kernel 6.14.0-1015-nvidia (PREEMPT_DYNAMIC)
Driver 580.126.09
CUDA 13.0 (toolkit V13.0.88)
nvblox 4.2.0 (ros-jazzy-nvblox-ros 4.2.0-0noble.20260218225704237)
Isaac ROS 4.2.0
Isaac Sim 5.1
ROS 2 Jazzy
Architecture arm64 (not Jetson/L4T — full Ubuntu desktop)

Note: This is a desktop arm64 Blackwell system, not a Jetson/L4T platform. No tegra or JetPack packages are installed.
GPU uses ATS (Address Translation Services) addressing mode with unified memory.

Problem

The 3D ESDF computation in nvblox hangs indefinitely on GB10 (DGX Spark). The TSDF map builds correctly — GPU hash resizing and
depth integration proceed normally — but any attempt to compute the ESDF causes the node to freeze with no error output.

This happens when cuMotion calls the get_esdf_and_gradient service with update_esdf: true.

Steps to Reproduce

  1. Launch Isaac Manipulator pick-and-place workflow with Isaac Sim on DGX Spark
  2. nvblox runs in component_container_isolated (see note below about same-container crash)
  3. cuMotion requests ESDF via get_esdf_and_gradient service call
  4. nvblox receives the request and never responds

Observed Behavior

nvblox logs show the request is received, then nothing — the node becomes completely unresponsive:

[nvblox_node] Received request for ESDF with:
[nvblox_node] update_esdf: 1
[nvblox_node] use_aabb: 1
[nvblox_node] visualize_esdf: 1
[nvblox_node] aabb_min_m: [-1.50, -1.00, -0.20],
[nvblox_node] aabb_size_m: [2.00, 2.00, 1.00]

The node becomes unresponsive. cuMotion times out after 60 seconds. No error, no crash, no GPU error in dmesg — the ESDF
kernel simply never returns.

What Works

What I Tried (all failed)

Approach Result
CUDA_LAUNCH_BLOCKING=1 env variable Still hangs
cuda_stream_type: 0 (default CUDA stream) Still hangs
Both CUDA_LAUNCH_BLOCKING=1 + cuda_stream_type: 0 together Still hangs
esdf_mode: "2d" FATAL: "The ESDF service is only intended for mapping with 3D ESDFs"
update_esdf_rate_hz: 2.0 (periodic instead of on-demand) nvblox freezes — confirms kernel itself hangs
voxel_size: 0.03 (larger voxels, smaller map) Still hangs
Loading nvblox in same container (component_container_mt) Segfault (exit code -11) during NITROS init

nvblox Configuration

/**:
  ros__parameters:
    cuda_stream_type: 2
    voxel_size: 0.01
    input_qos: "SENSOR_DATA"
    global_frame: "base_link"
    esdf_mode: "3d"
    tick_period_ms: 10
    integrate_color_rate_hz: 5.0
    update_esdf_rate_hz: 0.0  # on-demand via service
    update_mesh_rate_hz: 0.0
    publish_layer_rate_hz: 0.0
    decay_tsdf_rate_hz: 0.0

GPU State at Time of Hang

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.09             Driver Version: 580.126.09     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|=========================================+========================+======================|
|   0  NVIDIA GB10                    On  |   0000000F:01:00.0  On |                  N/A |
| N/A   58C    P0             18W /  N/A  | Not Supported          |      3%      Default |
+-----------------------------------------+------------------------+----------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|   0   N/A  N/A         1820758    C+G   /isaac-sim/kit/kit                     5735MiB |
|   0   N/A  N/A         1850379      C   ...onents/component_container_mt       5335MiB |
|   0   N/A  N/A         1850380      C   /usr/bin/python3                        196MiB |
|   0   N/A  N/A         1850381      C   /usr/bin/python3                       3029MiB |
|   0   N/A  N/A         1850382      C   /usr/bin/python3                       4024MiB |
|   0   N/A  N/A         1850383      C   /usr/bin/python3                        198MiB |
|   0   N/A  N/A         1850384      C   .../component_container_isolated        973MiB |
|   0   N/A  N/A         1850391      G   /opt/ros/jazzy/lib/rviz2/rviz2          130MiB |
+-----------------------------------------------------------------------------------------+

Total GPU memory in use: ~21 GB (128 GB unified memory available — memory is not the issue).

System Info

$ uname -a
Linux spark-ef98 6.14.0-1015-nvidia #15-Ubuntu SMP PREEMPT_DYNAMIC aarch64 GNU/Linux

$ nvcc --version
Cuda compilation tools, release 13.0, V13.0.88
Build cuda_13.0.r13.0/compiler.36424714_0

$ nvidia-smi -q (excerpt)
Product Name:          NVIDIA GB10
Product Architecture:  Blackwell
Addressing Mode:       ATS

Additional Notes
- Not a Jetson/L4T system: DGX Spark runs full Ubuntu 24.04 on arm64, not JetPack/L4T. nvblox may not have been tested on this
platform configuration (arm64 desktop Blackwell with ATS addressing and unified memory).
- Same-container crash: Loading nvblox into component_container_mt (alongside NITROS nodes) causes a segfault (exit code -11)
during NITROS initialization on GB10. Using component_container_isolated avoids the crash but requires SENSOR_DATA QoS for
depth topics published via the NITROS DDS bridge.
- Periodic ESDF also hangs: Setting update_esdf_rate_hz: 2.0 (automatic computation every 0.5s, not on-demand) also freezes the
 nvblox node, confirming the ESDF GPU kernel itself is the problem regardless of how it's triggered.
- The Isaac Manipulator pick-and-place workflow works on other NVIDIA platforms per NVIDIA documentation.

Expected Behavior

ESDF should compute and return results within a reasonable timeout, or nvblox should fail gracefully with an error message
indicating platform incompatibility rather than hanging indefinitely.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions