A systematic two-stage benchmark comparing optimizers for LLM pre-training at scale.
Stage 1 (Broad Screening) sweeps 24+ optimizers on C4 under the LLaMA-3 architecture at four scales β 60M, 130M, 350M, and 1B β using final C4 validation perplexity.
Stage 2 (High-Quality Generalization) transfers the stronger Stage-1 optimizers to FineWeb-Edu with 32k sequences, at 340M and 1B, across four architectures: Transformer++, GLA, DeltaNet, and Gated DeltaNet.
A strict controlled-variable protocol is used: only optimizer hyperparameters (lr, betas, eps, method-specific knobs) are tuned per optimizer; all architectural, data, and schedule settings are held fixed.
scaling-for-scaling/
βββ Stage1-C4-Llama3/
β βββ torchrun_main.py # main training entry point
β βββ opt/ # 24+ optimizer implementations
β βββ utils/ # dataloader, modeling, training utilities
β βββ configs/ # LLaMA-3 model configs (60Mβ1B)
β βββ scripts/ # best-hyperparameter launch scripts
β βββ run_350m_best_opts.sh
β βββ run_1b_best_opts.sh
β
βββ Stage2-FWE/
βββ trainer/flame/ # FLA-based training framework
β βββ scripts/
β βββ run_340m.sh # best-hyperparameter launch script (340M, 4 archs)
β βββ run_1b.sh # best-hyperparameter launch script (1B, 4 archs)
βββ opentome/ # optimizer integrations and model implementations
βββ evaluations/ # downstream evaluation framework
Stage 1: AdamW, AdaBelief, Adafactor, Adam8bit, Adam-mini, AdamP, Adan, CAME, Conda, GaLore, LAMB, Lion, MARS-AdamW, MARS-Lion, MARS-Shampoo, Muon, NAdam, Prodigy, RAdam, RMNP, Shampoo, SOAP, Sophia, APOLLO.
Stage 2 (carried forward): AdamW, AdamP, Adan, Lion, MARS-AdamW, MARS-Lion, MARS-Shampoo, Muon, RMNP, SOAP, APOLLO, Conda.
cd Stage1-C4-Llama3
pip install -r requirements.txtKey dependencies: PyTorch β₯ 2.1, Transformers, Datasets, bitsandbytes (optional, for 8-bit optimizers).
To use a HuggingFace mirror (e.g. in mainland China):
export HF_ENDPOINT=https://hf-mirror.comcd Stage2-FWE
conda env create -f fla_environment.yml
conda activate fla
pip install -e .Requires Flash Linear Attention (FLA) for GLA / DeltaNet / Gated DeltaNet architectures.
cd Stage1-C4-Llama3
# 350M (60k steps, 4 GPUs)
bash scripts/run_350m_best_opts.sh muon
bash scripts/run_350m_best_opts.sh apollo
# 1B (100k steps, 8 GPUs)
bash scripts/run_1b_best_opts.sh muon
bash scripts/run_1b_best_opts.sh apolloOr launch manually:
torchrun --standalone --nproc_per_node 4 torchrun_main.py \
--model_config configs/llama_350m.json \
--optimizer muon --lr 6e-3 --beta1 0.9 --beta2 0.95 --eps 1e-8 \
--batch_size 128 --total_batch_size 512 \
--num_training_steps 60000 --warmup_steps 6000 \
--weight_decay 0.0 --dtype bfloat16Edit the three path variables at the top of the scripts (DATASET_PATH, TOKENIZER_PATH, VAL_DATA_DIR), then:
cd Stage2-FWE/trainer/flame
# 340M (30720 steps, 8 GPUs)
bash scripts/run_340m.sh gla muon
bash scripts/run_340m.sh transformer apollo
# 1B (30720 steps, 8 GPUs)
bash scripts/run_1b.sh gla muon
bash scripts/run_1b.sh transformer apolloAvailable architectures: transformer, gla, deltanet, gated_deltanet
Available optimizers: adamw, adamp, adan, lion, mars_adamw, mars_lion, mars_shampoo, muon, rmnp, soap, apollo, conda
| Stage | Dataset | Architecture | Scales | Seq Len | Steps |
|---|---|---|---|---|---|
| 1 | C4 | LLaMA-3 | 60M, 130M, 350M, 1B | 256 | 10k / 20k / 60k / 100k |
| 2 | FineWeb-Edu | Transformer++, GLA, DeltaNet, Gated DeltaNet | 340M, 1B | 32k | ~30k |
Cosine LR schedule, linear warmup (10% of steps), weight decay 0.0, gradient clipping 1.0.
- Stage 1 training framework is built on APOLLO, which is based on GaLore and Q-GaLore.
- Stage 2 training framework is built on OpenToMe, which integrates flame from the flash-linear-attention project.
@article{scaling-for-scaling,
title = {Scaling for Scaling: A Benchmark of Optimizers for LLM Pre-training},
year = {2026},
}