This repository contains an end-to-end LLM experimentation stack spanning tokenizer training, pretraining utilities, data curation, systems profiling, scaling analysis, and post-training experiments. The code is organized as independent, function-driven modules to support repeatable experiments and comparable outputs.
tokenizer/— tokenizer training and language-model pretraining scripts.data_pipeline/— corpus processing, filtering, and quality/safety analysis tools.systems/— distributed and precision-focused systems experiments.scaling/— scaling-law and IsoFLOPs analyses.posttraining/— post-training, preference optimization, and alignment workflows.results/— saved experiment outputs.figures/— generated plots and visual summaries.docs/— technical notes and runbooks.
- Tokenizer module: byte-level BPE training, tokenizer evaluation, and LM training/ablation scripts.
- Data pipeline module: HTML extraction, language identification, PII masking, quality filtering, toxicity/NSFW classification, and deduplication helpers.
- Systems module: communication benchmarks, profiling workflows, and DDP efficiency studies.
- Scaling module: compute/data scaling estimates and visualization scripts.
- Post-training module: SFT and policy optimization experiments with leaderboard-oriented evaluation flows.
- BF16 autocast improved throughput over FP32 in systems runs with measured speedups around 1.22x, 1.36x, and 1.45x.
- Naive DDP profiling runs showed communication occupying about 62% of per-step time.
- Sequence-length stress tests ran to 8192 tokens and failed at 16384, consistent with attention-memory scaling limits.
- In one post-training budget sweep,
ppo_epoch=2outperformedppo_epoch=3on validation accuracy with a corresponding entropy/exploration trade-off.
- Use
uvfor environment and command execution:uv run <command>
- Typical test entrypoint:
uv run pytest
- Experiment outputs are preserved under
results/, with related figures infigures/and supporting notes indocs/.
- This repository contains archived experiment outputs that may reference historical run paths.
- Hardware-specific performance observations depend on the exact GPU, driver, and runtime stack used during execution.
- Reported metrics reflect tracked runs in this repository and should be interpreted within that experimental scope.