Skip to content

cmndcntrlcyber/code-trainer-offsec-pipeline

Repository files navigation

Code-Trainer V6

Final Dataset on Hugging Face

Overview

This project implements a 6-phase pipeline to build and deploy a multimodal code generation model on an RTX 5060 Ti 16GB (Blackwell). It produces a fine-tuned Qwen2.5-Coder-14B model capable of generating code from VS Code screenshot images.

Core Pipeline Phases

  • Phase 1: Data Collection — Scrape GitHub repositories, filter code files, and capture Monaco Editor screenshots with syntax highlighting.
  • Phase 2: Preprocessing — Convert captures to HuggingFace datasets in chat format and upload to Hub.
  • Phase 3: Vision Model — Train Swin-B + MLP projector + Qwen2.5-Coder-1.5B with LoRA on local GPU.
  • Phase 4: Qwen Fine-tuning — Qwen2.5-Coder-14B LoRA sweep + full training on HF Skills A100.
  • Phase 5: Deployment — LoRA merge, GGUF quantization, llama.cpp/Ollama serving.
  • Phase 6: Inference — vLLM + Qwen-Agent + MCP tool integration.

Built With

  • Python (>=3.12) — Core language.
  • Playwright — Headless Chromium for Monaco Editor screenshot capture.
  • PyTorch / Transformers / PEFT / TRL — Model training stack.

For full dependencies, see pyproject.toml.

Project Structure

root/
├── data/                      # Data directory for inputs and outputs
│   ├── inputs/                # Directory for storing cloned repositories
│   ├── outputs/               # Directory for assessment/training results
│   ├── offensive-security/    # Offensive security specialized data
│   └── sample-data/           # Sample repositories for testing
│
├── src/                       # Source code
│   ├── config/                # Configuration files
│   │   ├── settings.py        # YAML loader with env var substitution
│   │   ├── v6_config.yaml     # Central config for all 6 phases
│   │   └── offsec_config.yaml # Offensive security collection config
│   │
│   ├── phase1_data_collection/  # Phase 1: Data collection pipeline [COMPLETE]
│   │   ├── scrapers/          # GitHub scraping and repo cataloging
│   │   ├── capture/           # Screenshot capture via Monaco/Playwright
│   │   └── scripts/           # Pipeline orchestration scripts
│   │
│   ├── phase2_preprocessing/  # Phase 2: HF dataset build + Hub upload [COMPLETE — local]
│   │   ├── converters/        # Chat format, image encoding
│   │   ├── validation/        # Quality filtering, statistics
│   │   └── scripts/           # build_dataset.py, upload_to_hub.py, compute_statistics.py
│   │
│   ├── phase3_vision_model/   # Phase 3: Swin-B + MLP + Qwen-1.5B LoRA [READY — awaiting Hub]
│   │   ├── architecture/      # Model components (encoder, projector, decoder)
│   │   ├── training/          # Trainer, dataset loader, callbacks
│   │   ├── evaluation/        # Metrics, evaluator
│   │   └── scripts/           # train.py, evaluate.py, export.py
│   │
│   ├── phase4_qwen_finetuning/ # Phase 4: Qwen-14B cloud fine-tune [READY — awaiting Hub]
│   │   ├── hf_skills/         # HF Skills API client + sweep orchestrator
│   │   ├── configs/           # Sweep configs, training args
│   │   └── scripts/           # launch_validation_sweep.py, monitor_jobs.py
│   │
│   └── phase5_deployment/     # Phase 5: LoRA merge → GGUF → llama.cpp [READY — awaiting Phase 4]
│       ├── gguf/              # Converter + uploader
│       ├── inference/         # llama.cpp server wrapper
│       └── scripts/           # convert_to_gguf.py, benchmark.py
│
├── tests/                     # Unit tests
├── docs/                      # Documentation and planning
├── pyproject.toml             # Project dependencies (uv-managed)
├── Makefile                   # Common commands
└── CLAUDE.md                  # Claude Code guidance

Setup

uv sync                      # Install dependencies
playwright install chromium   # Required for screenshot capture

Usage

# --- Phase 1: Data Collection ---

# Run full collection pipeline (scrape + capture)
python -m src.phase1_data_collection.scripts.run_collection --config src/config/v6_config.yaml

# Run scraping only (skip screenshot capture)
python -m src.phase1_data_collection.scripts.run_collection --config src/config/v6_config.yaml --skip-capture

# Run offensive security collection
python -m src.phase1_data_collection.scripts.run_offsec_collection --config src/config/offsec_config.yaml

# Validate captured samples
python -m src.phase1_data_collection.scripts.validate_samples --config src/config/v6_config.yaml

# --- Phase 2: Preprocessing ---

# Build HuggingFace dataset from captures
python -m src.phase2_preprocessing.scripts.build_dataset --config src/config/v6_config.yaml

# Upload dataset to HF Hub (required to unblock Phases 3–4)
python -m src.phase2_preprocessing.scripts.upload_to_hub --config src/config/v6_config.yaml

# Compute dataset statistics
python -m src.phase2_preprocessing.scripts.compute_statistics --config src/config/v6_config.yaml

# --- Phase 3: Vision Model Training (RTX 5060 Ti) ---

python -m src.phase3_vision_model.scripts.train --config src/config/v6_config.yaml

# --- Phase 4: Qwen-14B Fine-tuning (HF Skills Cloud) ---

python -m src.phase4_qwen_finetuning.scripts.launch_validation_sweep --config src/config/v6_config.yaml
python -m src.phase4_qwen_finetuning.scripts.monitor_jobs --config src/config/v6_config.yaml

# --- Phase 5: GGUF Deployment ---

python -m src.phase5_deployment.scripts.convert_to_gguf --config src/config/v6_config.yaml

# --- Tests ---
uv run pytest tests/

Experiment Tracking

Training runs are logged to Weights & Biases. The launcher scripts pass a WANDB_API_KEY secret through to HF Jobs and fall back to WANDB_MODE=offline when no key is supplied — local dry-runs and air-gapped reproductions stay silent, while production runs publish to:

To re-enable W&B for an eval-only retroactive run (e.g. to surface metrics for a job that originally ran offline):

WANDB_API_KEY=<key> WANDB_MODE=online \
  python -m src.phase4_qwen_finetuning.scripts.launch_eval \
  --adapter cmndcntrlcyber/qwen14b-code-trainer-v6-aggressive \
  --val-limit 500 --wait

Ready Tensor Submission

This project is being prepared for the Ready Tensor LLMED Module 1 capstone — see docs/ReadyTensor Submission/publication.md for the technical publication, and docs/model_cards/ for the canonical Hugging Face Hub model cards. The catastrophic-forgetting benchmark (GSM8K via lm-evaluation-harness) is launched with:

python -m src.phase4_qwen_finetuning.scripts.launch_benchmark \
  --adapter cmndcntrlcyber/qwen14b-code-trainer-v6-aggressive --wait
python -m src.phase4_qwen_finetuning.scripts.launch_benchmark \
  --adapter cmndcntrlcyber/qwen14b-code-trainer-v6-aggressive --baseline --wait

About

A FT-Pipeline that produces a fine-tuned Qwen2.5-Coder-14B model capable of generating code from VS Code screenshot images.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages