Skip to content

Qwen3-8B EAGLE3 example: learning_rate=1e-4 shows large grad-norm outliers in early training #577

Description

@heiheiha798

Hi, thanks for releasing SpecForge and the Qwen3-8B EAGLE3 training example.

  • I have searched related issues but did not find the same LR / grad_norm issue.
  • I have checked the current example script and this still applies to the latest version I tested.
  • I am providing the environment, reproduction setup, logs/metrics, and a small decode-style check.
  • This is a bug / training-recipe stability report rather than a general usage question.
  • I am using English for this issue.

I ran a learning-rate ablation based on the current examples/run_qwen3_8b_eagle3_online.sh recipe and observed that the default --learning-rate 1e-4 may be too aggressive for the Qwen3-8B EAGLE3 setting, at least during early training.

This is not a full convergence study. Due to compute and disk limits, I only trained for about one epoch plus a small part of the second epoch. The main signal I want to report is the logged grad_norm: with the official 1e-4 LR, the run shows many very large gradient-norm outliers in the middle of training, while smaller learning rates are much more stable under the same setup.

Setup

Base recipe:

  • Script: examples/run_qwen3_8b_eagle3_online.sh
  • Target model: Qwen3-8B
  • Draft config: configs/qwen3-8b-eagle3.json
  • Dataset: ShareGPT-style training data, following the official example's sharegpt_train.jsonl setup
  • num_epochs=10
  • batch_size=1
  • max_length=4096
  • chat_template=qwen
  • target_model_backend=sglang
  • warmup_ratio=0.015
  • max_grad_norm=0.5
  • ttt_length=7
  • tp_size=1

The only intended ablation variable was LR:

  • 1e-4
  • 5e-5
  • 2e-5
  • 1e-5

One implementation detail: I used --attention-backend fa for the draft model because the default flex_attention path hit a Triton resource-limit compile error on our RTX A6000 machine. All LR runs used the same FA2 backend, so this should not explain the difference between LR settings.

Training length

The ShareGPT-style training file contained 68,623 rows, so one epoch is about 68.6k steps with batch_size=1.

The runs reached about global_step=75,999 before stopping because the disk filled while saving a checkpoint. The last complete checkpoint was global_step=75,000; the last checkpoint before the end of the first epoch was epoch_0_step_68000.

Main observation: logged grad_norm outliers

The 1e-4 run produced many large logged grad_norm outliers. Smaller learning rates were much more stable.

Stats up to global_step=75,000:

LR median grad_norm p95 p99 max count grad_norm > 100 count grad_norm > 200
1e-4 10.79 53.74 144.71 5900.89 1316 434
5e-5 9.10 26.26 36.33 218.68 6 1
2e-5 14.76 33.37 43.92 443.87 13 2
1e-5 17.72 38.33 49.50 407.52 18 2

Visually, the 1e-4 run has a large cluster of grad-norm spikes around the middle of the first epoch, while the lower-LR runs remain much smoother.

Image

Training metrics near the last complete checkpoint

At global_step=75,000, the lower-LR runs also had better training metrics:

LR loss acc grad_norm
1e-4 1.85 0.51 11.22
5e-5 1.40 0.60 12.96
2e-5 1.43 0.60 19.66
1e-5 1.64 0.56 23.69

Among the LRs I tried, 5e-5 looked best overall. 2e-5 was close in training loss/acc but slightly less favorable, and 1e-5 appeared slower. I do not claim 5e-5 is globally optimal, only that it looked clearly better than 1e-4 in this run.

Image Image

Small GSM8K decode-style check

I also ran a small EAGLE3 teacher-forced / greedy-rollout check on the first 16 GSM8K test examples, formatted with the Qwen chat template.

For this check I used the last checkpoint before the end of the first epoch, epoch_0_step_68000. I could only evaluate 1e-4 and 5e-5 because I had deleted the 2e-5 and 1e-5 checkpoints during disk cleanup.

LR teacher-forced accept len greedy-rollout accept len teacher-forced mean acc greedy-rollout mean acc
1e-4 0.9367 0.8919 0.3589 0.2064
5e-5 1.6861 1.5612 0.5545 0.3122

Greedy-rollout per-position accuracy:

LR greedy-rollout per-position acc
1e-4 [0.5706, 0.3225, 0.1946, 0.1300, 0.0864, 0.0706, 0.0701]
5e-5 [0.7310, 0.4846, 0.3388, 0.2365, 0.1664, 0.1292, 0.0992]

This small decode-style check is consistent with the training curves: 5e-5 looked substantially better than 1e-4 at approximately the same training stage.

Question / suggestion

Is 1e-4 the intended LR for Qwen3-8B EAGLE3, or could it be too high for this example?

Would you consider lowering the default LR in examples/run_qwen3_8b_eagle3_online.sh, or documenting that users may want to try a smaller LR such as 5e-5 for Qwen3-8B?

Again, this is only an early-training ablation and not a full best-LR sweep. I am mainly reporting that 1e-4 shows unusually large grad-norm outliers under a setup close to the official example, while smaller LRs look more stable and 5e-5 also gives better early decode-style metrics.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions