Skip to content

A simple env for training on H100 nodes #34

@hrlics

Description

@hrlics

Hi @jonhue, nice work and thanks for open-sourcing the code!

I also encountered package confilcts (H100) following the instructions. Looking at #30, it seems that verl:vllm017.latest works for people. However, the assertion error below still happens at my side, possbily due to some mismatches between the infra in SDPO and vllm017:

AssertionError: local_world_size (2) must be less than or equal to the number of visible devices (1).

In my case, simply using docker pull verlai/verl:vllm012.latest makes training work on H100. Hope this helps folks using H100s :)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions