Skip to content

Add self-training pipeline, demos, smoke tests, benchmarks, and production-grade tooling#18

Merged
MASSIVEMAGNETICS merged 4 commits into
mainfrom
copilot/add-training-pipeline-and-demos
Apr 2, 2026
Merged

Add self-training pipeline, demos, smoke tests, benchmarks, and production-grade tooling#18
MASSIVEMAGNETICS merged 4 commits into
mainfrom
copilot/add-training-pipeline-and-demos

Conversation

Copilot AI commented Apr 2, 2026

Copy link
Copy Markdown
Contributor

Victor LLM lacked a standardized dataset convention, CLI entrypoint, runnable demos, automated tests, and benchmarking infrastructure. This PR brings the project to production-grade by adding all of the above.

Dataset Convention

  • datasets/<name>/{train,valid,test}.jsonl layout with optional dataset.yaml metadata
  • Built-in datasets/example_dataset/ (20/10/10 records) for smoke tests and demos
  • Supports classification (text+label), language-model (text), and instruction-tuning (instruction+response) schemas

victor CLI (victor_cli/ + victor_cli_entry.py)

Five subcommands backed by the existing DataBlobGodmodeKit/AutoTrainer stack:

victor prepare  --dataset datasets/example_dataset
victor train    --dataset datasets/example_dataset --epochs 5 [--config cfg.yaml]
victor eval     --dataset datasets/example_dataset --checkpoint runs/my_run --split test
victor predict  --prompt "Tell me about neural networks" [--checkpoint ...]
victor benchmark --prompts 20 --max-tokens 128

Registered as victor console script in pyproject.toml.

Demos (demos/)

  • demo_inference.py – tokenizer load + text generation, zero deps beyond repo
  • demo_finetune.py – 2-epoch classification run on example dataset
  • demo_e2e.py – full prepare → train → eval → predict → benchmark pipeline

Smoke Tests (tests/test_smoke.py)

45 pytest tests covering imports, CLI argument parsing, dataset validation, 1-epoch training, inference, and benchmark result shape. Completes in < 1 s.

Benchmarks (benchmarks/harness.py)

Standalone harness measuring per-prompt latency, throughput (tokens/s), and RSS memory delta. Three modes: inference, training, compare. Results stored as timestamped JSON under benchmarks/results/.

CI (.github/workflows/smoke-tests.yml)

Runs on every push/PR across Python 3.10 and 3.11: smoke tests, legacy 149-test toolkit suite, inference demo, and benchmark harness.

Packaging & DX

  • pyproject.toml with setuptools.build_meta, optional [torch]/[dev] extras
  • Makefile targets: install, test, smoke, lint, format, benchmark, demo, clean
  • .gitignore extended for runs/, victor_tokenizers/, benchmarks/results/*.json
  • README expanded with full Production-Grade Usage section
Original prompt

You are working in the GitHub repository MASSIVEMAGNETICS/victor_llm (repo ID: 995841880). The user wants the project to become production-grade by adding:

  1. Self-training from a dataset folder
  • Add a standard datasets/ (or data/) folder convention and a clear on-disk layout (e.g., datasets/<dataset_name>/{train,valid,test}.jsonl or similar).
  • Implement a training pipeline that can:
    • Discover datasets in the dataset folder.
    • Validate dataset schema.
    • Tokenize / preprocess.
    • Train (fine-tune) the model.
    • Save checkpoints and final artifacts in artifacts/ or runs/.
    • Support configuration via CLI flags and/or a config file (YAML/TOML/JSON).
  • Provide a single CLI entrypoint (e.g., victor train ...) with subcommands for prepare, train, eval, predict.
  • Add sensible defaults and clear error messages.
  1. Demos
  • Add a demos/ directory with runnable examples:
    • Minimal inference demo.
    • Fine-tuning demo using the dataset folder.
    • End-to-end demo (prepare -> train -> eval -> inference).
  • Ensure demos are documented in the README and can be run locally.
  1. Smoke tests
  • Add a tests/ suite (prefer pytest if Python; choose idiomatic framework for the repo language).
  • Add fast smoke tests that run in CI:
    • Import/package test.
    • CLI help / command wiring.
    • Tiny dataset training (few samples, 1 epoch) that completes quickly.
    • Inference on a known prompt returns non-empty output.
  • Add GitHub Actions workflow(s) to run smoke tests on push/PR.
  1. Benchmarks
  • Add a benchmarks/ directory with:
    • A simple benchmarking harness (latency, throughput, memory) for inference.
    • Optional training benchmark for a tiny run.
    • A way to compare runs and store results (e.g., JSON/CSV) under benchmarks/results/.
  • Provide a CLI command (e.g., victor benchmark ...) and documentation.
  1. Production-grade improvements
  • Improve packaging and developer experience:
    • Add or improve pyproject.toml/package.json/build scripts as appropriate.
    • Add pre-commit hooks or lint/format tooling.
    • Add structured logging.
    • Add type checking if applicable.
    • Ensure configuration, secrets, and local paths are handled correctly.
  • Update README with:
    • Install instructions.
    • Quickstart.
    • Training/evaluation/inference usage.
    • Dataset format docs.
    • Benchmark usage.

Implementation requirements:

  • Keep changes consistent with existing project structure and language.
  • Add inline docs and docstrings where helpful.
  • Ensure make test or equivalent works (add a Makefile if beneficial).
  • Ensure CI passes.

Deliverable:

  • Open a pull request implementing the above changes with clear commit(s) and a comprehensive PR description including how to run training, demos, tests, and benchmarks.

The following is the prior conversation context from the user's chat exploration (may be truncated):

User: ADD SELF TRAINING FROM A DATASET FOLDER, ADD DEMOS, SMOKE TESTS AND BENCHMARKS , PRODUCTION GRADE

Assistant: I’m currently waiting on your approval in the GitHub dialog to proceed with creating the pull request and implementing the dataset self-training pipeline, demos, smoke tests, and benchmarks in MASSIVEMAGNETICS/victor_llm.

This pull request was created from Copilot chat.

Copilot AI and others added 2 commits April 2, 2026 13:19
… improvements

- datasets/ folder with example_dataset (train/valid/test.jsonl + dataset.yaml)
- victor_cli/ package: main.py, dataset.py, training.py, evaluation.py, inference.py, benchmark.py
- victor_cli_entry.py top-level CLI script (victor prepare/train/eval/predict/benchmark)
- demos/: demo_inference.py, demo_finetune.py, demo_e2e.py + README
- tests/: conftest.py + test_smoke.py (45 smoke tests, all passing)
- benchmarks/: harness.py + results/.gitkeep + README
- .github/workflows/smoke-tests.yml CI workflow
- pyproject.toml packaging with [project.scripts] victor entrypoint
- Makefile with install/test/smoke/lint/benchmark/demo/clean targets
- .gitignore updated for runs/, victor_tokenizers/, benchmarks/results/*.json
- README.md updated with Production-Grade Usage section

Agent-Logs-Url: https://github.com/MASSIVEMAGNETICS/victor_llm/sessions/4b0425ed-857c-47cb-bda7-757fafb34ae0

Co-authored-by: MASSIVEMAGNETICS <209589629+MASSIVEMAGNETICS@users.noreply.github.com>
…B for macOS, demo loaded check, CI permissions

Agent-Logs-Url: https://github.com/MASSIVEMAGNETICS/victor_llm/sessions/4b0425ed-857c-47cb-bda7-757fafb34ae0

Co-authored-by: MASSIVEMAGNETICS <209589629+MASSIVEMAGNETICS@users.noreply.github.com>
Copilot AI changed the title [WIP] Add training pipeline for self-training and demos Add self-training pipeline, demos, smoke tests, benchmarks, and production-grade tooling Apr 2, 2026
Copilot AI requested a review from MASSIVEMAGNETICS April 2, 2026 13:26
@MASSIVEMAGNETICS MASSIVEMAGNETICS marked this pull request as ready for review April 2, 2026 18:20
Copilot AI review requested due to automatic review settings April 2, 2026 18:20

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f29fcc25f1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".


- name: Run legacy toolkit tests
run: |
python -m unittest test_godmode_toolkit -v 2>&1 | tail -5

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve unittest exit status in smoke workflow

The Run legacy toolkit tests step pipes python -m unittest ... into tail, so the step exits with tail's status instead of the test runner's status. In GitHub Actions' default shell settings, this allows failing toolkit tests to appear green, which undermines CI gating for regressions in test_godmode_toolkit.

Useful? React with 👍 / 👎.


- name: Run smoke tests
run: |
python -m pytest tests/test_smoke.py -v --tb=short --timeout=120

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Install pytest-timeout before passing --timeout

The smoke test command uses --timeout=120, but the dependency install step does not install pytest-timeout. On a clean runner this makes pytest fail with an unrecognized argument error before running tests, so the smoke suite is not actually executable from this workflow definition.

Useful? React with 👍 / 👎.

Comment thread victor_cli/training.py
Comment on lines +110 to +111
checkpoint_dir = output_dir / "checkpoints"
checkpoint_dir.mkdir(parents=True, exist_ok=True)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Write checkpoints into the reported run artifact path

Training saves checkpoints under <output_dir>/checkpoints/..., while the command reports artifacts as <output_dir>/<run_id> and docs/examples pass run directories to victor eval. Because run_eval searches for epoch_*.json inside the provided checkpoint directory, using the reported run path will fail with "No epoch checkpoint files found", breaking the documented train→eval flow.

Useful? React with 👍 / 👎.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a “production-grade” workflow around the existing DataBlobGodmodeKit/AutoTrainer stack by adding a standardized dataset layout, a victor CLI, runnable demos, smoke tests, benchmarks, and CI wiring.

Changes:

  • Added victor CLI package (victor_cli/) with subcommands for prepare/train/eval/predict/benchmark.
  • Introduced dataset convention + bundled datasets/example_dataset and documentation.
  • Added demos, smoke tests, benchmark harness, and GitHub Actions workflow.

Reviewed changes

Copilot reviewed 26 out of 27 changed files in this pull request and generated 16 comments.

Show a summary per file
File Description
victor_cli/training.py Training pipeline wrapper over DataBlobGodmodeKit/AutoTrainer with config-file support and summary writing.
victor_cli/main.py CLI argument parsing and command wiring for prepare/train/eval/predict/benchmark.
victor_cli/inference.py Lightweight CPU-only inference/generation path for demos/tests.
victor_cli/evaluation.py Simple evaluation that reads checkpoint metadata and prints metrics.
victor_cli/dataset.py Dataset directory + JSONL parsing validation helpers.
victor_cli/benchmark.py CLI-accessible inference benchmark (latency/throughput/memory) with JSON output.
victor_cli/__init__.py Package marker for the CLI module.
victor_cli_entry.py Repo-root entrypoint for running the CLI directly.
tests/test_smoke.py Pytest smoke coverage for imports, CLI wiring, dataset prepare, tiny train, inference, benchmarks.
tests/conftest.py Ensures repo root is importable during tests.
README.md Adds “Production-Grade Usage” docs for install, datasets, CLI usage, demos, tests, Makefile.
pyproject.toml Introduces packaging metadata, console-script entrypoint, pytest/ruff/mypy config.
Makefile Developer targets for install/test/smoke/lint/format/benchmark/demo/clean.
demos/README.md Documentation for running the demo scripts.
demos/demo_inference.py Minimal inference demo using tokenizer + simple generation.
demos/demo_finetune.py Small fine-tuning demo against the example dataset.
demos/demo_e2e.py End-to-end prepare → train → eval → predict → benchmark demo pipeline.
datasets/README.md Documents on-disk dataset layout and record schemas.
datasets/example_dataset/train.jsonl Bundled training split for demos/tests.
datasets/example_dataset/valid.jsonl Bundled validation split for demos/tests.
datasets/example_dataset/test.jsonl Bundled test split for demos/tests.
datasets/example_dataset/dataset.yaml Example dataset metadata.
benchmarks/results/.gitkeep Keeps benchmark results directory tracked.
benchmarks/README.md Benchmark harness docs and usage.
benchmarks/harness.py Standalone benchmark harness (inference/training/compare) with JSON output.
.gitignore Ignores generated runs, tokenizers, benchmark result JSON, build artifacts, tool caches.
.github/workflows/smoke-tests.yml CI workflow to run smoke tests, legacy toolkit tests, inference demo, benchmark harness.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread victor_cli/training.py
Comment on lines +67 to +68
DataBlobGodmodeKit, and delegates to AutoTrainer. Saves artifacts to
output_dir/<run_id>/.

Copilot AI Apr 2, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring says artifacts are saved to output_dir/<run_id>/, but the code configures Godmode/AutoTrainer checkpoints under output_dir/checkpoints/... and writes the summary under output_dir/<run_id>/. This splits artifacts across two different roots and can confuse users/scripts. Consider consolidating everything under a single per-run directory (e.g., output_dir/<run_id>/checkpoints/...) and passing that to GodmodeConfig.checkpoint_dir.

Suggested change
DataBlobGodmodeKit, and delegates to AutoTrainer. Saves artifacts to
output_dir/<run_id>/.
DataBlobGodmodeKit, and delegates to AutoTrainer. Checkpoints are saved
under output_dir/checkpoints/ while run summaries and metadata are saved
under output_dir/<run_id>/.

Copilot uses AI. Check for mistakes.
Comment thread datasets/README.md

## Example Dataset

`datasets/example_dataset/` is a tiny built-in demo set (10 records each split)

Copilot AI Apr 2, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example dataset description says “10 records each split”, but example_dataset/train.jsonl contains 20 records (valid/test are 10). Update this line to match the actual shipped dataset sizes so users and tests are not misled.

Suggested change
`datasets/example_dataset/` is a tiny built-in demo set (10 records each split)
`datasets/example_dataset/` is a tiny built-in demo set (20 train / 10 validation / 10 test records)

Copilot uses AI. Check for mistakes.
Comment thread Makefile
$(PIP) install pytest pytest-cov pytest-timeout ruff pyyaml

test: smoke
$(PYTHON) -m unittest test_godmode_toolkit -v 2>&1 | tail -5

Copilot AI Apr 2, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Piping the unittest run through tail -5 masks test failures because the pipeline’s exit code will typically be from tail, not python -m unittest. This can cause CI/local make test to report success when tests fail. Use set -o pipefail (in shells that support it) or avoid piping and instead rely on the test runner output/verbosity controls.

Suggested change
$(PYTHON) -m unittest test_godmode_toolkit -v 2>&1 | tail -5
$(PYTHON) -m unittest test_godmode_toolkit -v

Copilot uses AI. Check for mistakes.
python -m pytest tests/test_smoke.py -v --tb=short --timeout=120

- name: Run legacy toolkit tests
run: |

Copilot AI Apr 2, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

python -m unittest ... | tail -5 will usually return 0 even when tests fail, because tail succeeds. This makes the CI job unreliable. Remove the pipe or enable pipefail so test failures correctly fail the workflow.

Suggested change
run: |
run: |
set -o pipefail

Copilot uses AI. Check for mistakes.
python -m pip install --upgrade pip
pip install pytest pytest-cov pyyaml
# Install lightweight subset of requirements (skip heavy GPU libs).
pip install numpy tqdm || true

Copilot AI Apr 2, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pip install numpy tqdm || true will hide install failures and lead to confusing downstream import/runtime errors. If these dependencies are required for the smoke tests, let the step fail; if they’re optional, gate the tests that need them accordingly.

Suggested change
pip install numpy tqdm || true
pip install numpy tqdm

Copilot uses AI. Check for mistakes.
Comment thread victor_cli/dataset.py
Comment on lines +53 to +57
def prepare_dataset(dataset_dir: Path, verbose: bool = False) -> int:
"""
Validate the layout of a dataset directory.

Returns 0 on success, 1 if critical errors are found.

Copilot AI Apr 2, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prepare_dataset() currently checks only file presence + JSONL parseability; it does not validate the record schema against the declared task (e.g., required text/label fields for classification). This contradicts the PR description’s “validate dataset schema” and can let invalid datasets pass preparation and fail later. Add schema validation (ideally driven by dataset.yaml) and return non-zero on violations.

Copilot uses AI. Check for mistakes.
Comment thread victor_cli/main.py
prompts.extend(args.prompt)
if args.prompts_file:
pf = Path(args.prompts_file).expanduser()
for line in pf.read_text(encoding="utf-8").splitlines():

Copilot AI Apr 2, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--prompts-file is read without handling missing/permission errors; read_text() will raise and crash the CLI with a traceback. Catch FileNotFoundError/OSError and return a non-zero code with a clear message.

Suggested change
for line in pf.read_text(encoding="utf-8").splitlines():
try:
contents = pf.read_text(encoding="utf-8")
except (FileNotFoundError, OSError) as exc:
logger.error("Failed to read prompts file '%s': %s", pf, exc)
return 1
for line in contents.splitlines():

Copilot uses AI. Check for mistakes.
Comment thread victor_cli/benchmark.py
Comment on lines +51 to +55
usage = resource.getrusage(resource.RUSAGE_SELF)
# Linux reports ru_maxrss in kB; macOS reports in bytes.
if sys.platform == "darwin":
return usage.ru_maxrss / (1024 * 1024)
return usage.ru_maxrss / 1024

Copilot AI Apr 2, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Unix, resource.getrusage(...).ru_maxrss is peak RSS (max so far), not current RSS. Using it for mem_before/mem_after makes memory_delta_mb misleading. Prefer psutil.Process().memory_info().rss when available, or rename the metric/output to reflect that this is peak RSS.

Copilot uses AI. Check for mistakes.
Comment thread victor_cli/benchmark.py
Comment on lines +113 to +115
data = json.loads(default_tok.read_text(encoding="utf-8"))
vocabulary = data.get("vocabulary", {})
reverse_vocabulary = {str(k): v for k, v in data.get("reverse_vocabulary", {}).items()}

Copilot AI Apr 2, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default tokenizer loading isn’t protected by try/except. If victor_tokenizers/nlp_tokenizer.json exists but is malformed, the benchmark will raise and crash. Mirror the defensive loading used in victor_cli.inference.run_predict() (warn and continue with empty vocab).

Suggested change
data = json.loads(default_tok.read_text(encoding="utf-8"))
vocabulary = data.get("vocabulary", {})
reverse_vocabulary = {str(k): v for k, v in data.get("reverse_vocabulary", {}).items()}
try:
data = json.loads(default_tok.read_text(encoding="utf-8"))
vocabulary = data.get("vocabulary", {})
reverse_vocabulary = {str(k): v for k, v in data.get("reverse_vocabulary", {}).items()}
logger.info("Loaded default vocabulary (%d tokens) from %s", len(vocabulary), default_tok)
except Exception as exc:
logger.warning("Could not load default tokenizer from %s: %s", default_tok, exc)

Copilot uses AI. Check for mistakes.
Comment thread .github/workflows/smoke-tests.yml Outdated
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@MASSIVEMAGNETICS MASSIVEMAGNETICS merged commit fc69179 into main Apr 2, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants