Skip to content

CODExGAMERZ/Code-AutoComplete-LLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

29 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🧠 Python Code Autocomplete LLM (From Scratch)

A GPT-style decoder-only Transformer trained entirely from scratch for Python code autocompletion.

This project now uses a true causal GPT decoder architecture with KV-cache support, enabling faster incremental generation and more stable autoregressive behavior.

The entire system was built and trained locally on CPU β€” no external LLM APIs.


πŸš€ What’s New (Decoder Upgrade)

βœ… Replaced TransformerEncoder with true GPT-style decoder blocks βœ… Implemented custom Causal Self-Attention βœ… Added KV-cache for incremental decoding βœ… Resume-safe training (Ctrl+C supported) βœ… Dual inference modes (Autocomplete / Creative) βœ… Cleaned & curated training dataset pipeline

This is now a proper autoregressive language model architecture.


🧩 System Overview

1️⃣ Data Pipeline

  • Raw repositories collected

  • Hardened cleaning removes:

    • tests
    • build files
    • compiled artifacts
    • duplicate files
  • Curated alignment patterns added (balanced, non-repetitive)

  • Train / validation split

2️⃣ Tokenizer

  • Custom BPE tokenizer (vocab size: 8000)
  • Trained on processed corpus
  • Python-aware tokenization

3️⃣ Model Architecture

Decoder-only GPT-style architecture:

Component Value
Layers 8
Attention Heads 8
Embedding Size 512
Context Length 256
Parameters ~33.5M
Attention Causal Self-Attention
KV Cache βœ… Supported

Total Parameters: ~33,551,168


⚑ KV-Cache Support

Generation now uses incremental decoding:

  • First forward pass processes full prompt
  • Subsequent tokens reuse stored key/value tensors
  • No full-sequence recomputation

Result:

  • Faster inference
  • Lower latency
  • True GPT-style decoding behavior

🎯 Training Setup

  • Optimizer: AdamW
  • Loss: Cross-Entropy
  • Gradient Accumulation Supported
  • Resume-safe checkpointing

Training can be interrupted safely:

Ctrl+C

Restarting resumes automatically from the last checkpoint.


πŸ“Š Performance Benchmarks

Dataset

  • ~5.1M training tokens
  • ~0.5M validation tokens
  • Balanced curated alignment

Final Metrics (Decoder Architecture)

Metric Value
Epochs 2
Final Validation Loss 2.84
Perplexity 17.20

For a 33M parameter CPU-trained model, this is strong stability.


✨ Example Outputs

DFS

Input:

def dfs(graph, node, visited):

Output:

stack = [start]
while stack:
    node = stack.pop()
    if node not in visited:
        visited.add(node)
        stack.extend(graph.get(node, []))

Stack

Input:

class Stack:
    def push(self, item):

Output:

self._items.append(item)

Binary Search

Input:

def binary_search(arr, target):

Output:

lo, hi = 0, len(arr)
while lo < hi:
    mid = (lo + hi) // 2
    if arr[mid] == target:
        return mid
    if arr[mid] < target:
        lo = mid + 1
    else:
        hi = mid
return -1

🧠 Inference Modes

Autocomplete Mode (default)

  • temperature = 0.2
  • top_k = 10
  • Deterministic
  • Code-focused

Creative Mode

  • temperature = 0.8
  • top_k = 50
  • More diverse
  • Useful for code generation

Run with:

python inference/run_model.py \
  -c model/checkpoints/latest_checkpoint.pth \
  -p "def dfs(graph, node, visited):" \
  --mode autocomplete

πŸ–₯️ CLI Usage (Simple Wrapper)

You can build a simple CLI wrapper:

python codellm.py autocomplete "def binary_search(arr, target):"
python codellm.py creative "Write a Python LRU cache implementation"

πŸ“‚ Project Structure

AutoComplete-LLm/
β”‚
β”œβ”€β”€ model/
β”‚   β”œβ”€β”€ ai.py
β”‚   └── checkpoints/
β”‚
β”œβ”€β”€ tokenizer/
β”‚   β”œβ”€β”€ tokenizer.json
β”‚   └── train_tokenizer.py
β”‚
β”œβ”€β”€ training/
β”‚   β”œβ”€β”€ dataset.py
β”‚   └── train.py
β”‚
β”œβ”€β”€ inference/
β”‚   └── run_model.py
β”‚
β”œβ”€β”€ tools/
β”‚   β”œβ”€β”€ hardened_clean.py
β”‚   β”œβ”€β”€ build_train_file.py
β”‚   β”œβ”€β”€ evaluate_model.py
β”‚   └── generate_alignment_pack_v3.py
β”‚
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ raw/
β”‚   β”œβ”€β”€ cleaned/
β”‚   └── processed/
β”‚
└── README.md

πŸŽ“ What This Project Demonstrates

  • Full LLM lifecycle from scratch
  • Decoder-only Transformer implementation
  • Custom causal attention
  • KV-cache integration
  • Dataset curation & alignment engineering
  • CPU-only training of 33M parameter model
  • Practical engineering for small-scale LLM systems

πŸ“œ License

MIT License


πŸš€ Status

This model is:

  • Stable
  • Usable for Python autocomplete
  • Structurally aligned
  • KV-cache enabled
  • Resume-safe trained

Further scaling would require:

  • Larger dataset (20–50M tokens)
  • GPU acceleration
  • 60–120M parameter scale

But at current scale, this is a functional local Python code LLM.

About

A decoder-only GPT Transformer built from scratch in PyTorch for Python autocomplete, featuring custom tokenizers and KV-cache.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages