Hi, thanks for releasing SpecForge and the Qwen3-8B EAGLE3 training example.
I ran a learning-rate ablation based on the current examples/run_qwen3_8b_eagle3_online.sh recipe and observed that the default --learning-rate 1e-4 may be too aggressive for the Qwen3-8B EAGLE3 setting, at least during early training.
This is not a full convergence study. Due to compute and disk limits, I only trained for about one epoch plus a small part of the second epoch. The main signal I want to report is the logged grad_norm: with the official 1e-4 LR, the run shows many very large gradient-norm outliers in the middle of training, while smaller learning rates are much more stable under the same setup.
Setup
Base recipe:
- Script:
examples/run_qwen3_8b_eagle3_online.sh
- Target model: Qwen3-8B
- Draft config:
configs/qwen3-8b-eagle3.json
- Dataset: ShareGPT-style training data, following the official example's
sharegpt_train.jsonl setup
num_epochs=10
batch_size=1
max_length=4096
chat_template=qwen
target_model_backend=sglang
warmup_ratio=0.015
max_grad_norm=0.5
ttt_length=7
tp_size=1
The only intended ablation variable was LR:
One implementation detail: I used --attention-backend fa for the draft model because the default flex_attention path hit a Triton resource-limit compile error on our RTX A6000 machine. All LR runs used the same FA2 backend, so this should not explain the difference between LR settings.
Training length
The ShareGPT-style training file contained 68,623 rows, so one epoch is about 68.6k steps with batch_size=1.
The runs reached about global_step=75,999 before stopping because the disk filled while saving a checkpoint. The last complete checkpoint was global_step=75,000; the last checkpoint before the end of the first epoch was epoch_0_step_68000.
Main observation: logged grad_norm outliers
The 1e-4 run produced many large logged grad_norm outliers. Smaller learning rates were much more stable.
Stats up to global_step=75,000:
| LR |
median grad_norm |
p95 |
p99 |
max |
count grad_norm > 100 |
count grad_norm > 200 |
1e-4 |
10.79 |
53.74 |
144.71 |
5900.89 |
1316 |
434 |
5e-5 |
9.10 |
26.26 |
36.33 |
218.68 |
6 |
1 |
2e-5 |
14.76 |
33.37 |
43.92 |
443.87 |
13 |
2 |
1e-5 |
17.72 |
38.33 |
49.50 |
407.52 |
18 |
2 |
Visually, the 1e-4 run has a large cluster of grad-norm spikes around the middle of the first epoch, while the lower-LR runs remain much smoother.
Training metrics near the last complete checkpoint
At global_step=75,000, the lower-LR runs also had better training metrics:
| LR |
loss |
acc |
grad_norm |
1e-4 |
1.85 |
0.51 |
11.22 |
5e-5 |
1.40 |
0.60 |
12.96 |
2e-5 |
1.43 |
0.60 |
19.66 |
1e-5 |
1.64 |
0.56 |
23.69 |
Among the LRs I tried, 5e-5 looked best overall. 2e-5 was close in training loss/acc but slightly less favorable, and 1e-5 appeared slower. I do not claim 5e-5 is globally optimal, only that it looked clearly better than 1e-4 in this run.
Small GSM8K decode-style check
I also ran a small EAGLE3 teacher-forced / greedy-rollout check on the first 16 GSM8K test examples, formatted with the Qwen chat template.
For this check I used the last checkpoint before the end of the first epoch, epoch_0_step_68000. I could only evaluate 1e-4 and 5e-5 because I had deleted the 2e-5 and 1e-5 checkpoints during disk cleanup.
| LR |
teacher-forced accept len |
greedy-rollout accept len |
teacher-forced mean acc |
greedy-rollout mean acc |
1e-4 |
0.9367 |
0.8919 |
0.3589 |
0.2064 |
5e-5 |
1.6861 |
1.5612 |
0.5545 |
0.3122 |
Greedy-rollout per-position accuracy:
| LR |
greedy-rollout per-position acc |
1e-4 |
[0.5706, 0.3225, 0.1946, 0.1300, 0.0864, 0.0706, 0.0701] |
5e-5 |
[0.7310, 0.4846, 0.3388, 0.2365, 0.1664, 0.1292, 0.0992] |
This small decode-style check is consistent with the training curves: 5e-5 looked substantially better than 1e-4 at approximately the same training stage.
Question / suggestion
Is 1e-4 the intended LR for Qwen3-8B EAGLE3, or could it be too high for this example?
Would you consider lowering the default LR in examples/run_qwen3_8b_eagle3_online.sh, or documenting that users may want to try a smaller LR such as 5e-5 for Qwen3-8B?
Again, this is only an early-training ablation and not a full best-LR sweep. I am mainly reporting that 1e-4 shows unusually large grad-norm outliers under a setup close to the official example, while smaller LRs look more stable and 5e-5 also gives better early decode-style metrics.
Hi, thanks for releasing SpecForge and the Qwen3-8B EAGLE3 training example.
I ran a learning-rate ablation based on the current
examples/run_qwen3_8b_eagle3_online.shrecipe and observed that the default--learning-rate 1e-4may be too aggressive for the Qwen3-8B EAGLE3 setting, at least during early training.This is not a full convergence study. Due to compute and disk limits, I only trained for about one epoch plus a small part of the second epoch. The main signal I want to report is the logged
grad_norm: with the official1e-4LR, the run shows many very large gradient-norm outliers in the middle of training, while smaller learning rates are much more stable under the same setup.Setup
Base recipe:
examples/run_qwen3_8b_eagle3_online.shconfigs/qwen3-8b-eagle3.jsonsharegpt_train.jsonlsetupnum_epochs=10batch_size=1max_length=4096chat_template=qwentarget_model_backend=sglangwarmup_ratio=0.015max_grad_norm=0.5ttt_length=7tp_size=1The only intended ablation variable was LR:
1e-45e-52e-51e-5One implementation detail: I used
--attention-backend fafor the draft model because the defaultflex_attentionpath hit a Triton resource-limit compile error on our RTX A6000 machine. All LR runs used the same FA2 backend, so this should not explain the difference between LR settings.Training length
The ShareGPT-style training file contained 68,623 rows, so one epoch is about 68.6k steps with
batch_size=1.The runs reached about
global_step=75,999before stopping because the disk filled while saving a checkpoint. The last complete checkpoint wasglobal_step=75,000; the last checkpoint before the end of the first epoch wasepoch_0_step_68000.Main observation: logged grad_norm outliers
The
1e-4run produced many large loggedgrad_normoutliers. Smaller learning rates were much more stable.Stats up to
global_step=75,000:1e-45e-52e-51e-5Visually, the
1e-4run has a large cluster of grad-norm spikes around the middle of the first epoch, while the lower-LR runs remain much smoother.Training metrics near the last complete checkpoint
At
global_step=75,000, the lower-LR runs also had better training metrics:1e-45e-52e-51e-5Among the LRs I tried,
5e-5looked best overall.2e-5was close in training loss/acc but slightly less favorable, and1e-5appeared slower. I do not claim5e-5is globally optimal, only that it looked clearly better than1e-4in this run.Small GSM8K decode-style check
I also ran a small EAGLE3 teacher-forced / greedy-rollout check on the first 16 GSM8K test examples, formatted with the Qwen chat template.
For this check I used the last checkpoint before the end of the first epoch,
epoch_0_step_68000. I could only evaluate1e-4and5e-5because I had deleted the2e-5and1e-5checkpoints during disk cleanup.1e-45e-5Greedy-rollout per-position accuracy:
1e-4[0.5706, 0.3225, 0.1946, 0.1300, 0.0864, 0.0706, 0.0701]5e-5[0.7310, 0.4846, 0.3388, 0.2365, 0.1664, 0.1292, 0.0992]This small decode-style check is consistent with the training curves:
5e-5looked substantially better than1e-4at approximately the same training stage.Question / suggestion
Is
1e-4the intended LR for Qwen3-8B EAGLE3, or could it be too high for this example?Would you consider lowering the default LR in
examples/run_qwen3_8b_eagle3_online.sh, or documenting that users may want to try a smaller LR such as5e-5for Qwen3-8B?Again, this is only an early-training ablation and not a full best-LR sweep. I am mainly reporting that
1e-4shows unusually large grad-norm outliers under a setup close to the official example, while smaller LRs look more stable and5e-5also gives better early decode-style metrics.