Training Instability after 10k Steps #428
Unanswered
kevinlu1248
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I always get a bit of training instability around the 10k training step mark. Training Eagle for qwen2-based model on 8xH100. Any ideas why?
Flags:
torchrun \ --standalone \ --nproc_per_node 8 \ $ROOT_DIR/scripts/train_eagle3.py \ --target-model-path $MODEL_PATH \ --draft-model-config $ROOT_DIR/configs/qwen2.5-7b-eagle3.json \ --train-data-path /cache/$TRAIN_PATH \ --eval-data-path /cache/$EVAL_PATH \ --build-dataset-num-proc $BUILD_DATASET_NUM_PROC \ --output-dir /cache/$OUTPUT_DIR \ --num-epochs 3 \ --batch-size 1 \ --learning-rate 1e-4 \ --eval-interval 200 \ --save-interval 200 \ --max-length 16384 \ --chat-template prompt-completion \ --cache-dir $ROOT_DIR/cache \ --embedding-key model.embed_tokens.weight \ --tp-size 1 \ --ttt-length 12 \ --target-model-backend sglang \ --report-to wandb \Beta Was this translation helpful? Give feedback.
All reactions