From Tokenizer to Post-Training: LLM Research Engineering Stack

Overview

This repository contains an end-to-end LLM experimentation stack spanning tokenizer training, pretraining utilities, data curation, systems profiling, scaling analysis, and post-training experiments. The code is organized as independent, function-driven modules to support repeatable experiments and comparable outputs.

Repository layout

tokenizer/ — tokenizer training and language-model pretraining scripts.
data_pipeline/ — corpus processing, filtering, and quality/safety analysis tools.
systems/ — distributed and precision-focused systems experiments.
scaling/ — scaling-law and IsoFLOPs analyses.
posttraining/ — post-training, preference optimization, and alignment workflows.
results/ — saved experiment outputs.
figures/ — generated plots and visual summaries.
docs/ — technical notes and runbooks.

Main components

Tokenizer module: byte-level BPE training, tokenizer evaluation, and LM training/ablation scripts.
Data pipeline module: HTML extraction, language identification, PII masking, quality filtering, toxicity/NSFW classification, and deduplication helpers.
Systems module: communication benchmarks, profiling workflows, and DDP efficiency studies.
Scaling module: compute/data scaling estimates and visualization scripts.
Post-training module: SFT and policy optimization experiments with leaderboard-oriented evaluation flows.

Key findings

BF16 autocast improved throughput over FP32 in systems runs with measured speedups around 1.22x, 1.36x, and 1.45x.
Naive DDP profiling runs showed communication occupying about 62% of per-step time.
Sequence-length stress tests ran to 8192 tokens and failed at 16384, consistent with attention-memory scaling limits.
In one post-training budget sweep, ppo_epoch=2 outperformed ppo_epoch=3 on validation accuracy with a corresponding entropy/exploration trade-off.

Reproducibility

Use uv for environment and command execution:
- uv run <command>
Typical test entrypoint:
- uv run pytest
Experiment outputs are preserved under results/, with related figures in figures/ and supporting notes in docs/.

Limitations

This repository contains archived experiment outputs that may reference historical run paths.
Hardware-specific performance observations depend on the exact GPU, driver, and runtime stack used during execution.
Reported metrics reflect tracked runs in this repository and should be interpreted within that experimental scope.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

From Tokenizer to Post-Training: LLM Research Engineering Stack

Overview

Repository layout

Main components

Key findings

Reproducibility

Limitations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data_pipeline		data_pipeline
docs		docs
figures		figures
posttraining		posttraining
results		results
scaling		scaling
systems		systems
tokenizer		tokenizer
.gitignore		.gitignore
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

From Tokenizer to Post-Training: LLM Research Engineering Stack

Overview

Repository layout

Main components

Key findings

Reproducibility

Limitations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages