Skip to content

Facing issues when trying to reproduce the same run with 4xH200s #27

@harisarang

Description

@harisarang

Description

I followed the installation instructions in the repository's README.md
and attempted to run the SDPO generalization experiment. However, the
training process fails during model initialization.

Environment

  • Python version: 3.12\
  • CUDA version: 12.8\
  • PyTorch version: 2.8.0\
  • GPU: 2 × NVIDIA H200

Steps to Reproduce

  1. Follow the installation instructions from README.md.
  2. Run the experiment script:
bash experiments/generalization/run_sdpo_all.sh
  1. The process fails during model initialization.

Error Logs

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
(TaskRunner pid=37717) Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::WorkerDict.actor_rollout_ref_init_model() (pid=38309, ip=172.17.0.3, actor_id=d6503e2665e2d40fc6795be001000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x7114d32adb20>)
(TaskRunner pid=37717)   File "/venv/main/lib/python3.12/concurrent/futures/_base.py", line 449, in result
(TaskRunner pid=37717)     return self.__get_result()
(TaskRunner pid=37717)            ^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=37717)   File "/venv/main/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
(TaskRunner pid=37717)     raise self._exception
(TaskRunner pid=37717)            ^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=37717)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=37717)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=37717)   File "/workspace/verl/single_controller/ray/base.py", line 844, in func
(TaskRunner pid=37717)     return getattr(self.worker_dict[key], name)(*args, **kwargs)
(TaskRunner pid=37717)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=37717)   File "/workspace/verl/single_controller/base/decorator.py", line 462, in inner
(TaskRunner pid=37717)     return func(*args, **kwargs)
(TaskRunner pid=37717)            ^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=37717)   File "/workspace/verl/utils/transferqueue_utils.py", line 314, in dummy_inner
(TaskRunner pid=37717)     output = func(*args, **kwargs)
(TaskRunner pid=37717)              ^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=37717)   File "/workspace/verl/workers/fsdp_workers.py", line 812, in init_model
(TaskRunner pid=37717)     ) = self._build_model_optimizer(
(TaskRunner pid=37717)         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=37717)   File "/workspace/verl/workers/fsdp_workers.py", line 400, in _build_model_optimizer
(TaskRunner pid=37717)     actor_module = actor_module_class.from_pretrained(
(TaskRunner pid=37717)                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=37717)   File "/venv/main/lib/python3.12/site-packages/transformers/models/auto/auto_factory.py", line 604, in from_pretrained
(TaskRunner pid=37717)     return model_class.from_pretrained(
(TaskRunner pid=37717)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=37717)   File "/venv/main/lib/python3.12/site-packages/transformers/modeling_utils.py", line 288, in _wrapper
(TaskRunner pid=37717)     return func(*args, **kwargs)
(TaskRunner pid=37717)            ^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=37717)   File "/venv/main/lib/python3.12/site-packages/transformers/modeling_utils.py", line 5103, in from_pretrained
(TaskRunner pid=37717)     model = cls(config, *model_args, **model_kwargs)
(TaskRunner pid=37717)             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=37717)   File "/venv/main/lib/python3.12/site-packages/transformers/models/qwen3/modeling_qwen3.py", line 435, in __init__
(TaskRunner pid=37717)     super().__init__(config)
(TaskRunner pid=37717)   File "/venv/main/lib/python3.12/site-packages/transformers/modeling_utils.py", line 2197, in __init__
(TaskRunner pid=37717)     self.config._attn_implementation_internal = self._check_and_adjust_attn_implementation(
(TaskRunner pid=37717)                                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=37717)   File "/venv/main/lib/python3.12/site-packages/transformers/modeling_utils.py", line 2812, in _check_and_adjust_attn_implementation
(TaskRunner pid=37717)     lazy_import_flash_attention(applicable_attn_implementation)
(TaskRunner pid=37717)   File "/venv/main/lib/python3.12/site-packages/transformers/modeling_flash_attention_utils.py", line 136, in lazy_import_flash_attention
(TaskRunner pid=37717)     _flash_fn, _flash_varlen_fn, _pad_fn, _unpad_fn = _lazy_imports(implementation)
(TaskRunner pid=37717)                                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=37717)   File "/venv/main/lib/python3.12/site-packages/transformers/modeling_flash_attention_utils.py", line 83, in _lazy_imports
(TaskRunner pid=37717)     from flash_attn import flash_attn_func, flash_attn_varlen_func
(TaskRunner pid=37717)   File "/venv/main/lib/python3.12/site-packages/flash_attn/__init__.py", line 3, in <module>
(TaskRunner pid=37717)     from flash_attn.flash_attn_interface import (
(TaskRunner pid=37717)   File "/venv/main/lib/python3.12/site-packages/flash_attn/flash_attn_interface.py", line 15, in <module>
(TaskRunner pid=37717)     import flash_attn_2_cuda as flash_attn_gpu
(TaskRunner pid=37717) ImportError: /venv/main/lib/python3.12/site-packages/flash_attn_2_cuda.cpython-312-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda29c10_cuda_check_implementationEiPKcS2_jb
(WorkerDict pid=38309) /workspace/verl/utils/tokenizer.py:107: UserWarning: Failed to create processor: Unsupported processor type: Qwen2TokenizerFast. This may affect multimodal processing [repeated 3x across cluster]
(WorkerDict pid=38309)   warnings.warn(f"Failed to create processor: {e}. This may affect multimodal processing", stacklevel=1) [repeated 3x across cluster]
(WorkerDict pid=38309) `torch_dtype` is deprecated! Use `dtype` instead! [repeated 3x across cluster]

Any guidance would be appreciated. Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions