Skip to content

An illegal memory access was encountered during visualization of multi-GPU training on Trellis2 #164

@XiaokunSun

Description

@XiaokunSun

I'm running the following command to train Shape Dit and TeX Dit on a single machine with multiple GPUs, but I encounter an illegal memory access error during the visualization phase of the training samples.

Command:
python train.py \ --config configs/gen/slat_flow_imgshape2tex_dit_1_3B_512_bf16.json \ --output_dir results/slat_flow_imgshape2tex_dit_1_3B_512_bf16_10000 \ --data_dir "{\"Toys4k\": {\"base\": \"datasets/Toys4k\", \"shape_latent\": \"datasets/Toys4k/shape_latents/shape_enc_next_dc_f16c32_fp16_512\", \"pbr_latent\": \"datasets/Toys4k/pbr_latents/tex_enc_next_dc_f16c32_fp16_512\", \"render_cond\": \"datasets/Toys4k/renders_cond\"}}" \ --auto_retry 0 \ --num_nodes 1
or
python train.py \ --config configs/gen/slat_flow_img2shape_dit_1_3B_512_bf16.json \ --output_dir results/slat_flow_img2shape_dit_1_3B_512_bf16_10000 \ --data_dir "{\"Toys4k\": {\"base\": \"datasets/Toys4k\", \"shape_latent\": \"datasets/Toys4k/shape_latents/shape_enc_next_dc_f16c32_fp16_512\", \"render_cond\": \"datasets/Toys4k/renders_cond\"}}" \ --auto_retry 0 \ --num_nodes 1

Error log:

Step: 1000/10000 (10.00%) | Elapsed: 1.23 h           | Speed: 810.12 steps/h     | ETA: 11.11 h             

Sampling 64 images...[rank1]:[I525 09:54:48.207301910 RasterImpl.cpp:195] Internal buffers grown to 80 MB
[rank1]:[I525 09:54:49.718524310 RasterImpl.cpp:195] Internal buffers grown to 120 MB
[rank1]:[I525 09:54:50.213231270 RasterImpl.cpp:195] Internal buffers grown to 130 MB
[rank1]:[I525 09:54:51.713233719 RasterImpl.cpp:195] Internal buffers grown to 170 MB
[rank0]:[I525 09:55:07.092763051 RasterImpl.cpp:195] Internal buffers grown to 30 MB
[rank0]:[I525 09:55:07.172713316 RasterImpl.cpp:195] Internal buffers grown to 80 MB
[rank0]:[I525 09:55:07.298698145 RasterImpl.cpp:195] Internal buffers grown to 220 MB
[rank0]:[I525 09:55:08.578346297 RasterImpl.cpp:195] Internal buffers grown to 690 MB
[rank0]:[W525 09:55:08.962935921 CUDAGuardImpl.h:119] Warning: CUDA warning: an illegal memory access was encountered (function destroyEvent)
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f4dd776c1b6 in /root/miniconda3/envs/trellis2/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f4dd7715a76 in /root/miniconda3/envs/trellis2/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f4dd7ba3918 in /root/miniconda3/envs/trellis2/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x20d8e (0x7f4dd7b69d8e in /root/miniconda3/envs/trellis2/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x22507 (0x7f4dd7b6b507 in /root/miniconda3/envs/trellis2/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x2270f (0x7f4dd7b6b70f in /root/miniconda3/envs/trellis2/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x6417b2 (0x7f4dcf4d07b2 in /root/miniconda3/envs/trellis2/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x6f30f (0x7f4dd774d30f in /root/miniconda3/envs/trellis2/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x21b (0x7f4dd774633b in /root/miniconda3/envs/trellis2/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #9: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f4dd77464e9 in /root/miniconda3/envs/trellis2/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #10: c10d::Reducer::~Reducer() + 0x3d8 (0x7f4dc010cdc8 in /root/miniconda3/envs/trellis2/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #11: std::_Sp_counted_ptr<c10d::Reducer*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x12 (0x7f4dcfcd7532 in /root/miniconda3/envs/trellis2/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #12: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x48 (0x7f4dcf39acb8 in /root/miniconda3/envs/trellis2/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #13: <unknown function> + 0xe51ed1 (0x7f4dcfce0ed1 in /root/miniconda3/envs/trellis2/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #14: <unknown function> + 0x516907 (0x7f4dcf3a5907 in /root/miniconda3/envs/trellis2/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #15: <unknown function> + 0x5174d1 (0x7f4dcf3a64d1 in /root/miniconda3/envs/trellis2/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #16: <unknown function> + 0x128356 (0x55faed4d1356 in /root/miniconda3/envs/trellis2/bin/python)
frame #17: <unknown function> + 0x14cb10 (0x55faed4f5b10 in /root/miniconda3/envs/trellis2/bin/python)
frame #18: <unknown function> + 0x128377 (0x55faed4d1377 in /root/miniconda3/envs/trellis2/bin/python)
frame #19: <unknown function> + 0x128356 (0x55faed4d1356 in /root/miniconda3/envs/trellis2/bin/python)
frame #20: <unknown function> + 0x14cb10 (0x55faed4f5b10 in /root/miniconda3/envs/trellis2/bin/python)
frame #21: <unknown function> + 0x134677 (0x55faed4dd677 in /root/miniconda3/envs/trellis2/bin/python)
frame #22: <unknown function> + 0x146356 (0x55faed4ef356 in /root/miniconda3/envs/trellis2/bin/python)
frame #23: <unknown function> + 0x1463c5 (0x55faed4ef3c5 in /root/miniconda3/envs/trellis2/bin/python)
frame #24: <unknown function> + 0x14b603 (0x55faed4f4603 in /root/miniconda3/envs/trellis2/bin/python)
frame #25: <unknown function> + 0x120912 (0x55faed4c9912 in /root/miniconda3/envs/trellis2/bin/python)
frame #26: <unknown function> + 0x19c4b0 (0x55faed5454b0 in /root/miniconda3/envs/trellis2/bin/python)
frame #27: <unknown function> + 0x20b5b1 (0x55faed5b45b1 in /root/miniconda3/envs/trellis2/bin/python)
frame #28: _PyEval_EvalFrameDefault + 0x46cf (0x55faed4d901f in /root/miniconda3/envs/trellis2/bin/python)
frame #29: _PyFunction_Vectorcall + 0x6c (0x55faed4e4e7c in /root/miniconda3/envs/trellis2/bin/python)
frame #30: _PyEval_EvalFrameDefault + 0x6fb (0x55faed4d504b in /root/miniconda3/envs/trellis2/bin/python)
frame #31: _PyFunction_Vectorcall + 0x6c (0x55faed4e4e7c in /root/miniconda3/envs/trellis2/bin/python)
frame #32: _PyEval_EvalFrameDefault + 0x313 (0x55faed4d4c63 in /root/miniconda3/envs/trellis2/bin/python)
frame #33: _PyFunction_Vectorcall + 0x6c (0x55faed4e4e7c in /root/miniconda3/envs/trellis2/bin/python)
frame #34: _PyEval_EvalFrameDefault + 0x130f (0x55faed4d5c5f in /root/miniconda3/envs/trellis2/bin/python)
frame #35: <unknown function> + 0x1d268c (0x55faed57b68c in /root/miniconda3/envs/trellis2/bin/python)
frame #36: PyEval_EvalCode + 0x85 (0x55faed57b5d5 in /root/miniconda3/envs/trellis2/bin/python)
frame #37: <unknown function> + 0x2042aa (0x55faed5ad2aa in /root/miniconda3/envs/trellis2/bin/python)
frame #38: <unknown function> + 0x1febf3 (0x55faed5a7bf3 in /root/miniconda3/envs/trellis2/bin/python)
frame #39: PyRun_StringFlags + 0x7d (0x55faed5a001d in /root/miniconda3/envs/trellis2/bin/python)
frame #40: PyRun_SimpleStringFlags + 0x3c (0x55faed59febc in /root/miniconda3/envs/trellis2/bin/python)
frame #41: Py_RunMain + 0x23a (0x55faed59f0aa in /root/miniconda3/envs/trellis2/bin/python)
frame #42: Py_BytesMain + 0x37 (0x55faed56dec7 in /root/miniconda3/envs/trellis2/bin/python)
frame #43: <unknown function> + 0x2a1ca (0x7f4dd87d01ca in /lib/x86_64-linux-gnu/libc.so.6)
frame #44: __libc_start_main + 0x8b (0x7f4dd87d028b in /lib/x86_64-linux-gnu/libc.so.6)
frame #45: _start + 0x2e (0x55faed56ddde in /root/miniconda3/envs/trellis2/bin/python)

W0525 09:55:10.731000 659079 /mnt/project/user/xiaokun.sun/Software/miniconda3/envs/trellis2/lib/python3.10/site-packages/torch/multiprocessing/spawn.py:169] Terminating process 659458 via signal SIGTERM
Traceback (most recent call last):
  File "/mnt/project/user/xiaokun.sun/Project/TRELLIS.2/train.py", line 143, in <module>
    mp.spawn(main, args=(cfg,), nprocs=cfg.num_gpus, join=True)
  File "/root/miniconda3/envs/trellis2/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 340, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
  File "/root/miniconda3/envs/trellis2/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 296, in start_processes
    while not context.join():
  File "/root/miniconda3/envs/trellis2/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 215, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/root/miniconda3/envs/trellis2/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 90, in _wrap
    fn(i, *args)
  File "/mnt/project/user/xiaokun.sun/Project/TRELLIS.2/train.py", line 94, in main
    trainer.run()
  File "/mnt/project/user/xiaokun.sun/Project/TRELLIS.2/trellis2/trainers/basic.py", line 860, in run
    self.snapshot()
  File "/root/miniconda3/envs/trellis2/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/mnt/project/user/xiaokun.sun/Project/TRELLIS.2/trellis2/trainers/basic.py", line 543, in snapshot
    vis = self.visualize_sample(samples[key]['value'])
  File "/root/miniconda3/envs/trellis2/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/mnt/project/user/xiaokun.sun/Project/TRELLIS.2/trellis2/trainers/basic.py", line 481, in visualize_sample
    return self.dataset.visualize_sample(sample)
  File "/root/miniconda3/envs/trellis2/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/mnt/project/user/xiaokun.sun/Project/TRELLIS.2/trellis2/datasets/structured_latent_shape.py", line 46, in visualize_sample
    res = renderer.render(representation, ext, intr)
  File "/mnt/project/user/xiaokun.sun/Project/TRELLIS.2/trellis2/renderers/mesh_renderer.py", line 139, in render
    rast, rast_db = dr.rasterize(
  File "/root/miniconda3/envs/trellis2/lib/python3.10/site-packages/nvdiffrast/torch/ops.py", line 135, in rasterize
    return _rasterize_func.apply(glctx, pos, tri, resolution, ranges, grad_db, -1)
  File "/root/miniconda3/envs/trellis2/lib/python3.10/site-packages/torch/autograd/function.py", line 575, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/root/miniconda3/envs/trellis2/lib/python3.10/site-packages/nvdiffrast/torch/ops.py", line 78, in forward
    out, out_db = _nvdiffrast_c.rasterize_fwd_cuda(raster_ctx.cpp_wrapper, pos, tri, resolution, ranges, peeling_idx)
RuntimeError: Cuda error: 700[cudaStreamSynchronize(stream);]

Some additional information: Single-GPU training runs normally, but the error only occurs with multiple GPUs. The first stage of multi-GPU training for SS Dit runs normally; the above problem only occurs when training Shape Dit and TeX Dit with multiple GPUs. Can anyone provide assistance? Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions