Skip to content

Cherry-picking upstream commit 478dfe0 to fix checkpoint writing in NGC 25.03 onwards#72

Open
lukasgd wants to merge 1 commit into
swiss-ai:main-run-70b-v1from
lukasgd:main-run-70b-v1-25.03
Open

Cherry-picking upstream commit 478dfe0 to fix checkpoint writing in NGC 25.03 onwards#72
lukasgd wants to merge 1 commit into
swiss-ai:main-run-70b-v1from
lukasgd:main-run-70b-v1-25.03

Conversation

@lukasgd

@lukasgd lukasgd commented May 7, 2025

Copy link
Copy Markdown

This integrates minimal changes from

ADLR/megatron-lm!2796 - Extend FileSystemWriterAsync for DCP stream extensions

as suggested in

https://swissai-initiative.slack.com/archives/C07TW7B58A3/p1746632802582019?thread_ts=1746598424.180579&cid=C07TW7B58A3

and is tested with ap3-70b on 16 GH200 nodes:

  • writing checkpoints now works with NGC 25.03
  • these checkpoints can also be resumed succesfully, both in an NGC 25.03 container and also in an older one (24.12)
  • additionally, checkpoint writing with older images (24.12) is unaffected

…onwards

ADLR/megatron-lm!2796 - Extend FileSystemWriterAsync for DCP stream extensions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant