A GPT-style decoder-only Transformer trained entirely from scratch for Python code autocompletion.
This project now uses a true causal GPT decoder architecture with KV-cache support, enabling faster incremental generation and more stable autoregressive behavior.
The entire system was built and trained locally on CPU β no external LLM APIs.
β Replaced TransformerEncoder with true GPT-style decoder blocks β Implemented custom Causal Self-Attention β Added KV-cache for incremental decoding β Resume-safe training (Ctrl+C supported) β Dual inference modes (Autocomplete / Creative) β Cleaned & curated training dataset pipeline
This is now a proper autoregressive language model architecture.
-
Raw repositories collected
-
Hardened cleaning removes:
- tests
- build files
- compiled artifacts
- duplicate files
-
Curated alignment patterns added (balanced, non-repetitive)
-
Train / validation split
- Custom BPE tokenizer (vocab size: 8000)
- Trained on processed corpus
- Python-aware tokenization
Decoder-only GPT-style architecture:
| Component | Value |
|---|---|
| Layers | 8 |
| Attention Heads | 8 |
| Embedding Size | 512 |
| Context Length | 256 |
| Parameters | ~33.5M |
| Attention | Causal Self-Attention |
| KV Cache | β Supported |
Total Parameters: ~33,551,168
Generation now uses incremental decoding:
- First forward pass processes full prompt
- Subsequent tokens reuse stored key/value tensors
- No full-sequence recomputation
Result:
- Faster inference
- Lower latency
- True GPT-style decoding behavior
- Optimizer: AdamW
- Loss: Cross-Entropy
- Gradient Accumulation Supported
- Resume-safe checkpointing
Training can be interrupted safely:
Ctrl+CRestarting resumes automatically from the last checkpoint.
- ~5.1M training tokens
- ~0.5M validation tokens
- Balanced curated alignment
| Metric | Value |
|---|---|
| Epochs | 2 |
| Final Validation Loss | 2.84 |
| Perplexity | 17.20 |
For a 33M parameter CPU-trained model, this is strong stability.
Input:
def dfs(graph, node, visited):Output:
stack = [start]
while stack:
node = stack.pop()
if node not in visited:
visited.add(node)
stack.extend(graph.get(node, []))Input:
class Stack:
def push(self, item):Output:
self._items.append(item)Input:
def binary_search(arr, target):Output:
lo, hi = 0, len(arr)
while lo < hi:
mid = (lo + hi) // 2
if arr[mid] == target:
return mid
if arr[mid] < target:
lo = mid + 1
else:
hi = mid
return -1- temperature = 0.2
- top_k = 10
- Deterministic
- Code-focused
- temperature = 0.8
- top_k = 50
- More diverse
- Useful for code generation
Run with:
python inference/run_model.py \
-c model/checkpoints/latest_checkpoint.pth \
-p "def dfs(graph, node, visited):" \
--mode autocompleteYou can build a simple CLI wrapper:
python codellm.py autocomplete "def binary_search(arr, target):"
python codellm.py creative "Write a Python LRU cache implementation"AutoComplete-LLm/
β
βββ model/
β βββ ai.py
β βββ checkpoints/
β
βββ tokenizer/
β βββ tokenizer.json
β βββ train_tokenizer.py
β
βββ training/
β βββ dataset.py
β βββ train.py
β
βββ inference/
β βββ run_model.py
β
βββ tools/
β βββ hardened_clean.py
β βββ build_train_file.py
β βββ evaluate_model.py
β βββ generate_alignment_pack_v3.py
β
βββ data/
β βββ raw/
β βββ cleaned/
β βββ processed/
β
βββ README.md
- Full LLM lifecycle from scratch
- Decoder-only Transformer implementation
- Custom causal attention
- KV-cache integration
- Dataset curation & alignment engineering
- CPU-only training of 33M parameter model
- Practical engineering for small-scale LLM systems
MIT License
This model is:
- Stable
- Usable for Python autocomplete
- Structurally aligned
- KV-cache enabled
- Resume-safe trained
Further scaling would require:
- Larger dataset (20β50M tokens)
- GPU acceleration
- 60β120M parameter scale
But at current scale, this is a functional local Python code LLM.