Skip to content

maracatu-labs/maracatu

Maracatu

Brazilian LLMs, trained from scratch, in Portuguese, by Brazilians.

Open source project for pretraining language models in Brazilian Portuguese, with open weights under Apache 2.0 and a focus on national AI sovereignty.

maracatu.org · Hugging Face · Contributing · Code of Conduct · Security

Released models

Model Parameters Val Perplexity Corpus Hugging Face Ollama
Maracatu-20M 17M 23.81 Wikipedia PT (~550M tok) maracatu-labs/maracatu-20m whereisanzi/maracatu-20m
Maracatu-80M 87.8M 21.34 Wiki + Gutenberg + CulturaX-PT (~1.6B tok) maracatu-labs/maracatu-80m whereisanzi/maracatu-80m

See MODEL_CARD.md for technical details.

Architecture

Decoder-only transformer, Llama-style, with modern components:

  • RMSNorm · RoPE · SwiGLU · no bias in nn.Linear · weight tying
  • State dict aligned with Hugging Face's LlamaForCausalLM — loads via transformers with no conversion script
  • SentencePiece BPE 16k tokenizer trained on PT-BR
  • Framework: PyTorch

Corpus

Only sources with licenses compatible with Apache 2.0:

  • Wikipedia PT — CC BY-SA (979k articles, ~550M BPE tokens)
  • Project Gutenberg — public domain (Machado de Assis, José de Alencar, etc.)
  • CulturaX-PT — subset filtered for PT-BR

Details in data/README.md. Preparation pipelines in scripts/.

Quickstart

Requires Python 3.11+ and PyTorch 2.2+. For GPU training, see docs/kaggle.md or docs/runpod.md.

git clone git@github.com:maracatu-labs/maracatu.git
cd maracatu

python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"

Prepare corpus

python scripts/clean_corpus.py

Downloads Wikipedia PT (via datasets) to ~/.cache/huggingface/, cleans it, and writes to data/processed/corpus.txt.

Train tokenizer

python tokenizer/train_tokenizer.py

Train model

python -m maracatu.train --config configs/maracatu_20m.yaml
python -m maracatu.train --config configs/maracatu_80m.yaml --device cuda

Generate text

python -m maracatu.sample --checkpoint checkpoints/latest.pt --prompt "O Brasil é"

Experiments

Chronological log of training runs, metrics, and analyses in docs/experiments/.

Evaluation

Benchmarks on Brazilian exams (ENEM, ASSIN), via lm-evaluation-harness:

bash scripts/eval/run_benchmarks.sh

Custom tasks in scripts/eval/tasks/.

Structure

maracatu/
├── src/maracatu/    # Model, training, generation
├── tokenizer/       # SentencePiece tokenizer training
├── scripts/         # Corpus preparation, eval, deploy
├── configs/         # Hyperparameters (YAML)
├── data/            # Corpus (gitignored, see data/README.md)
├── checkpoints/     # Weights (gitignored)
├── docs/            # Technical docs, experiments, deploy
├── notebooks/       # Exploration
└── MODEL_CARD.md

Publishing

Publishing pipelines for Hugging Face, Ollama, and Kaggle in scripts/publish_all.sh and scripts/export_*.{py,sh}. Operational details in docs/publishing.md.

Contributing

Every contribution is welcome — code, corpus improvements, new benchmarks, bug reports. Read CONTRIBUTING.md for the PR workflow and CODE_OF_CONDUCT.md for what we expect from the community environment.

Found a vulnerability? See SECURITY.md before opening a public issue.

License

Code and weights under Apache License 2.0.

Acknowledgments

  • Andrej Karpathy for nanoGPT — indispensable pedagogical foundation
  • Brazilian AI community (Maritaca, WideLabs, LNCC, USP, Unicamp)
  • Tucano for the public baseline reference in PT-BR
  • Brazilian Artificial Intelligence Plan (PBIA)

About

Open-weight Brazilian Portuguese LLMs trained from scratch — Apache 2.0

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors