Skip to content
This repository was archived by the owner on Apr 26, 2026. It is now read-only.

VictorHuu/tokenizer-to-post-training

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

From Tokenizer to Post-Training: LLM Research Engineering Stack

Overview

This repository contains an end-to-end LLM experimentation stack spanning tokenizer training, pretraining utilities, data curation, systems profiling, scaling analysis, and post-training experiments. The code is organized as independent, function-driven modules to support repeatable experiments and comparable outputs.

Repository layout

  • tokenizer/ — tokenizer training and language-model pretraining scripts.
  • data_pipeline/ — corpus processing, filtering, and quality/safety analysis tools.
  • systems/ — distributed and precision-focused systems experiments.
  • scaling/ — scaling-law and IsoFLOPs analyses.
  • posttraining/ — post-training, preference optimization, and alignment workflows.
  • results/ — saved experiment outputs.
  • figures/ — generated plots and visual summaries.
  • docs/ — technical notes and runbooks.

Main components

  • Tokenizer module: byte-level BPE training, tokenizer evaluation, and LM training/ablation scripts.
  • Data pipeline module: HTML extraction, language identification, PII masking, quality filtering, toxicity/NSFW classification, and deduplication helpers.
  • Systems module: communication benchmarks, profiling workflows, and DDP efficiency studies.
  • Scaling module: compute/data scaling estimates and visualization scripts.
  • Post-training module: SFT and policy optimization experiments with leaderboard-oriented evaluation flows.

Key findings

  • BF16 autocast improved throughput over FP32 in systems runs with measured speedups around 1.22x, 1.36x, and 1.45x.
  • Naive DDP profiling runs showed communication occupying about 62% of per-step time.
  • Sequence-length stress tests ran to 8192 tokens and failed at 16384, consistent with attention-memory scaling limits.
  • In one post-training budget sweep, ppo_epoch=2 outperformed ppo_epoch=3 on validation accuracy with a corresponding entropy/exploration trade-off.

Reproducibility

  • Use uv for environment and command execution:
    • uv run <command>
  • Typical test entrypoint:
    • uv run pytest
  • Experiment outputs are preserved under results/, with related figures in figures/ and supporting notes in docs/.

Limitations

  • This repository contains archived experiment outputs that may reference historical run paths.
  • Hardware-specific performance observations depend on the exact GPU, driver, and runtime stack used during execution.
  • Reported metrics reflect tracked runs in this repository and should be interpreted within that experimental scope.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors