Skip to content

nicomele1/nicolas-lm

Repository files navigation

NicolasLM

nicolas-lm is a small research-oriented language modeling lab. It focuses on character-level autoregressive models and on the tools needed to train, evaluate, and analyze them in a controlled setting.

The current codebase includes:

  • a character tokenizer and corpus utilities;
  • a bigram language model;
  • a decoder-only Transformer;
  • a LLaMA-style decoder with RMSNorm, RoPE, and SwiGLU;
  • corpus statistics and evaluation scripts;
  • experiment helpers for corpus building, training, sampling, and result summarization.

Installation

The project targets Python 3.10+.

python -m pip install -r requirements.txt
python -m pip install -e .

Project layout

src/nicolasm/        Core library code
scripts/             Training, evaluation, and analysis entry points
tests/               Unit tests for tokenizer, models, and metrics
configs/             Model and experiment configuration files
data/                Local corpora, tokenizers, and PDF references
experiments/         Run artifacts, plots, and reports
docs/                Project documentation and reading notes

Typical workflow

Build corpora and tokenizers:

PYTHONPATH=src python scripts/build_corpora.py
PYTHONPATH=src python scripts/train.py

Train or evaluate a model:

PYTHONPATH=src python scripts/train.py
PYTHONPATH=src python scripts/evaluate.py

Analyze a corpus or summarize results:

PYTHONPATH=src python scripts/analyze_corpus.py
PYTHONPATH=src python scripts/run_effective_tokens.py
PYTHONPATH=src python scripts/summarize_results.py

Run the tests:

pytest

Models

The repository currently centers on three language-model families:

  • BigramLanguageModel: a first-order categorical model over tokens.
  • TinyTransformerLanguageModel: a causal Transformer decoder.
  • LLaMAStyleLanguageModel: a Transformer variant with LLaMA-style design choices.

Each model is implemented in src/nicolasm/models/ and built from reusable modules in src/nicolasm/modules/.

About

nicolas-lm is a small research-oriented language modeling lab. It focuses on character-level autoregressive models and on the tools needed to train, evaluate, and analyze them in a controlled setting.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors