Qwen3-8B EAGLE3 example: learning_rate=1e-4 shows large grad-norm outliers in early training

Hi, thanks for releasing SpecForge and the Qwen3-8B EAGLE3 training example.

- [x] I have searched related issues but did not find the same LR / grad_norm issue.
- [x] I have checked the current example script and this still applies to the latest version I tested.
- [x] I am providing the environment, reproduction setup, logs/metrics, and a small decode-style check.
- [x] This is a bug / training-recipe stability report rather than a general usage question.
- [x] I am using English for this issue.

I ran a learning-rate ablation based on the current `examples/run_qwen3_8b_eagle3_online.sh` recipe and observed that the default `--learning-rate 1e-4` may be too aggressive for the Qwen3-8B EAGLE3 setting, at least during early training.

This is not a full convergence study. Due to compute and disk limits, I only trained for about one epoch plus a small part of the second epoch. The main signal I want to report is the logged `grad_norm`: with the official `1e-4` LR, the run shows many very large gradient-norm outliers in the middle of training, while smaller learning rates are much more stable under the same setup.

## Setup

Base recipe:

- Script: `examples/run_qwen3_8b_eagle3_online.sh`
- Target model: Qwen3-8B
- Draft config: `configs/qwen3-8b-eagle3.json`
- Dataset: ShareGPT-style training data, following the official example's `sharegpt_train.jsonl` setup
- `num_epochs=10`
- `batch_size=1`
- `max_length=4096`
- `chat_template=qwen`
- `target_model_backend=sglang`
- `warmup_ratio=0.015`
- `max_grad_norm=0.5`
- `ttt_length=7`
- `tp_size=1`

The only intended ablation variable was LR:

- `1e-4`
- `5e-5`
- `2e-5`
- `1e-5`

One implementation detail: I used `--attention-backend fa` for the draft model because the default `flex_attention` path hit a Triton resource-limit compile error on our RTX A6000 machine. All LR runs used the same FA2 backend, so this should not explain the difference between LR settings.

## Training length

The ShareGPT-style training file contained 68,623 rows, so one epoch is about 68.6k steps with `batch_size=1`.

The runs reached about `global_step=75,999` before stopping because the disk filled while saving a checkpoint. The last complete checkpoint was `global_step=75,000`; the last checkpoint before the end of the first epoch was `epoch_0_step_68000`.

## Main observation: logged grad_norm outliers

The `1e-4` run produced many large logged `grad_norm` outliers. Smaller learning rates were much more stable.

Stats up to `global_step=75,000`:

| LR | median grad_norm | p95 | p99 | max | count grad_norm > 100 | count grad_norm > 200 |
|---|---:|---:|---:|---:|---:|---:|
| `1e-4` | 10.79 | 53.74 | 144.71 | 5900.89 | 1316 | 434 |
| `5e-5` | 9.10 | 26.26 | 36.33 | 218.68 | 6 | 1 |
| `2e-5` | 14.76 | 33.37 | 43.92 | 443.87 | 13 | 2 |
| `1e-5` | 17.72 | 38.33 | 49.50 | 407.52 | 18 | 2 |

Visually, the `1e-4` run has a large cluster of grad-norm spikes around the middle of the first epoch, while the lower-LR runs remain much smoother.

<img width="1800" height="990" alt="Image" src="https://github.com/user-attachments/assets/92993bd0-af05-4d65-b224-8466ce22a5f7" />

## Training metrics near the last complete checkpoint

At `global_step=75,000`, the lower-LR runs also had better training metrics:

| LR | loss | acc | grad_norm |
|---|---:|---:|---:|
| `1e-4` | 1.85 | 0.51 | 11.22 |
| `5e-5` | 1.40 | 0.60 | 12.96 |
| `2e-5` | 1.43 | 0.60 | 19.66 |
| `1e-5` | 1.64 | 0.56 | 23.69 |

Among the LRs I tried, `5e-5` looked best overall. `2e-5` was close in training loss/acc but slightly less favorable, and `1e-5` appeared slower. I do not claim `5e-5` is globally optimal, only that it looked clearly better than `1e-4` in this run.

<img width="1800" height="990" alt="Image" src="https://github.com/user-attachments/assets/c5dafff7-3a7b-4eef-8db2-5d0805b741fe" />

<img width="1800" height="990" alt="Image" src="https://github.com/user-attachments/assets/8de39a3f-88de-4dc9-9dd9-4cb872a3f8db" />

## Small GSM8K decode-style check

I also ran a small EAGLE3 teacher-forced / greedy-rollout check on the first 16 GSM8K test examples, formatted with the Qwen chat template.

For this check I used the last checkpoint before the end of the first epoch, `epoch_0_step_68000`. I could only evaluate `1e-4` and `5e-5` because I had deleted the `2e-5` and `1e-5` checkpoints during disk cleanup.

| LR | teacher-forced accept len | greedy-rollout accept len | teacher-forced mean acc | greedy-rollout mean acc |
|---|---:|---:|---:|---:|
| `1e-4` | 0.9367 | 0.8919 | 0.3589 | 0.2064 |
| `5e-5` | 1.6861 | 1.5612 | 0.5545 | 0.3122 |

Greedy-rollout per-position accuracy:

| LR | greedy-rollout per-position acc |
|---|---|
| `1e-4` | `[0.5706, 0.3225, 0.1946, 0.1300, 0.0864, 0.0706, 0.0701]` |
| `5e-5` | `[0.7310, 0.4846, 0.3388, 0.2365, 0.1664, 0.1292, 0.0992]` |

This small decode-style check is consistent with the training curves: `5e-5` looked substantially better than `1e-4` at approximately the same training stage.

## Question / suggestion

Is `1e-4` the intended LR for Qwen3-8B EAGLE3, or could it be too high for this example?

Would you consider lowering the default LR in `examples/run_qwen3_8b_eagle3_online.sh`, or documenting that users may want to try a smaller LR such as `5e-5` for Qwen3-8B?

Again, this is only an early-training ablation and not a full best-LR sweep. I am mainly reporting that `1e-4` shows unusually large grad-norm outliers under a setup close to the official example, while smaller LRs look more stable and `5e-5` also gives better early decode-style metrics.


LR	greedy-rollout per-position acc
`1e-4`	`[0.5706, 0.3225, 0.1946, 0.1300, 0.0864, 0.0706, 0.0701]`
`5e-5`	`[0.7310, 0.4846, 0.3388, 0.2365, 0.1664, 0.1292, 0.0992]`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Qwen3-8B EAGLE3 example: learning_rate=1e-4 shows large grad-norm outliers in early training #577

Setup

Training length

Main observation: logged grad_norm outliers

Training metrics near the last complete checkpoint

Small GSM8K decode-style check

Question / suggestion

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

LR	median grad_norm	p95	p99	max	count grad_norm > 100	count grad_norm > 200
`1e-4`	10.79	53.74	144.71	5900.89	1316	434
`5e-5`	9.10	26.26	36.33	218.68	6	1
`2e-5`	14.76	33.37	43.92	443.87	13	2
`1e-5`	17.72	38.33	49.50	407.52	18	2

LR	loss	acc	grad_norm
`1e-4`	1.85	0.51	11.22
`5e-5`	1.40	0.60	12.96
`2e-5`	1.43	0.60	19.66
`1e-5`	1.64	0.56	23.69

LR	teacher-forced accept len	greedy-rollout accept len	teacher-forced mean acc	greedy-rollout mean acc
`1e-4`	0.9367	0.8919	0.3589	0.2064
`5e-5`	1.6861	1.5612	0.5545	0.3122

Uh oh!

Qwen3-8B EAGLE3 example: learning_rate=1e-4 shows large grad-norm outliers in early training #577

Description

Setup

Training length

Main observation: logged grad_norm outliers

Training metrics near the last complete checkpoint

Small GSM8K decode-style check

Question / suggestion

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions