Skip to content

NGC25.05 + Fix Xielu#87

Open
TJ-Solergibert wants to merge 57 commits into
main-run-70b-v1from
ngc25.05-70-8-b
Open

NGC25.05 + Fix Xielu#87
TJ-Solergibert wants to merge 57 commits into
main-run-70b-v1from
ngc25.05-70-8-b

Conversation

@TJ-Solergibert

@TJ-Solergibert TJ-Solergibert commented Jul 31, 2025

Copy link
Copy Markdown

This branch has proven to work both for the 70B & 8B models. Recall that we started training with 25.01, so we had to patch some stuff from the distributed checkpointing to make it compatible with the updated PyTorch within the container. It also hardcodes a modelopt import that triggered an error with the newer version when resuming from a checkpoint and the xielu fix by @AleHD.

Use with: /iopsstor/scratch/cscs/dealmeih/images/ngc-transformers+25.05.sqsh

cc @C-TC @AleHD @dhia680 @martinjaggi @ischlag

dhia680 and others added 30 commits February 26, 2025 00:27
Co-authored-by: Antoni-Joan Solergibert <74564958+TJ-Solergibert@users.noreply.github.com>
Co-authored-by: Antoni-Joan Solergibert <74564958+TJ-Solergibert@users.noreply.github.com>
Co-authored-by: Antoni-Joan Solergibert <74564958+TJ-Solergibert@users.noreply.github.com>
If state["exp_avg_fast"] is set to None, the checkpointing is crashing.
added lm_head in converter when tied and also tempdir logic
Co-authored-by: Alex Hägele <alexander.hagele@epfl.ch>
ischlag and others added 27 commits March 17, 2025 15:21
OP block implementation
 conversions: [torch_dist --> torch] & [core --> HF]
fix rope scaling during conversion
Fix core forward pass (rope) during logits-test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants