Add self-training pipeline, demos, smoke tests, benchmarks, and production-grade tooling by Copilot · Pull Request #18 · MASSIVEMAGNETICS/victor_llm

Copilot · 2026-04-02T13:01:02Z

Victor LLM lacked a standardized dataset convention, CLI entrypoint, runnable demos, automated tests, and benchmarking infrastructure. This PR brings the project to production-grade by adding all of the above.

Dataset Convention

datasets/<name>/{train,valid,test}.jsonl layout with optional dataset.yaml metadata
Built-in datasets/example_dataset/ (20/10/10 records) for smoke tests and demos
Supports classification (text+label), language-model (text), and instruction-tuning (instruction+response) schemas

`victor` CLI (`victor_cli/` + `victor_cli_entry.py`)

Five subcommands backed by the existing DataBlobGodmodeKit/AutoTrainer stack:

victor prepare  --dataset datasets/example_dataset
victor train    --dataset datasets/example_dataset --epochs 5 [--config cfg.yaml]
victor eval     --dataset datasets/example_dataset --checkpoint runs/my_run --split test
victor predict  --prompt "Tell me about neural networks" [--checkpoint ...]
victor benchmark --prompts 20 --max-tokens 128

Registered as victor console script in pyproject.toml.

Demos (`demos/`)

demo_inference.py – tokenizer load + text generation, zero deps beyond repo
demo_finetune.py – 2-epoch classification run on example dataset
demo_e2e.py – full prepare → train → eval → predict → benchmark pipeline

Smoke Tests (`tests/test_smoke.py`)

45 pytest tests covering imports, CLI argument parsing, dataset validation, 1-epoch training, inference, and benchmark result shape. Completes in < 1 s.

Benchmarks (`benchmarks/harness.py`)

Standalone harness measuring per-prompt latency, throughput (tokens/s), and RSS memory delta. Three modes: inference, training, compare. Results stored as timestamped JSON under benchmarks/results/.

CI (`.github/workflows/smoke-tests.yml`)

Runs on every push/PR across Python 3.10 and 3.11: smoke tests, legacy 149-test toolkit suite, inference demo, and benchmark harness.

Packaging & DX

pyproject.toml with setuptools.build_meta, optional [torch]/[dev] extras
Makefile targets: install, test, smoke, lint, format, benchmark, demo, clean
.gitignore extended for runs/, victor_tokenizers/, benchmarks/results/*.json
README expanded with full Production-Grade Usage section

Original prompt

You are working in the GitHub repository MASSIVEMAGNETICS/victor_llm (repo ID: 995841880). The user wants the project to become production-grade by adding:

Self-training from a dataset folder

Add a standard datasets/ (or data/) folder convention and a clear on-disk layout (e.g., datasets/<dataset_name>/{train,valid,test}.jsonl or similar).

Implement a training pipeline that can:

Discover datasets in the dataset folder.

Validate dataset schema.

Tokenize / preprocess.

Train (fine-tune) the model.

Save checkpoints and final artifacts in artifacts/ or runs/.

Support configuration via CLI flags and/or a config file (YAML/TOML/JSON).

Provide a single CLI entrypoint (e.g., victor train ...) with subcommands for prepare, train, eval, predict.

Add sensible defaults and clear error messages.

Demos

Add a demos/ directory with runnable examples:

Minimal inference demo.

Fine-tuning demo using the dataset folder.

End-to-end demo (prepare -> train -> eval -> inference).

Ensure demos are documented in the README and can be run locally.

Smoke tests

Add a tests/ suite (prefer pytest if Python; choose idiomatic framework for the repo language).

Add fast smoke tests that run in CI:

Import/package test.

CLI help / command wiring.

Tiny dataset training (few samples, 1 epoch) that completes quickly.

Inference on a known prompt returns non-empty output.

Add GitHub Actions workflow(s) to run smoke tests on push/PR.

Benchmarks

Add a benchmarks/ directory with:

A simple benchmarking harness (latency, throughput, memory) for inference.

Optional training benchmark for a tiny run.

A way to compare runs and store results (e.g., JSON/CSV) under benchmarks/results/.

Provide a CLI command (e.g., victor benchmark ...) and documentation.

Production-grade improvements

Improve packaging and developer experience:

Add or improve pyproject.toml/package.json/build scripts as appropriate.

Add pre-commit hooks or lint/format tooling.

Add structured logging.

Add type checking if applicable.

Ensure configuration, secrets, and local paths are handled correctly.

Update README with:

Install instructions.

Quickstart.

Training/evaluation/inference usage.

Dataset format docs.

Benchmark usage.

Implementation requirements:

Keep changes consistent with existing project structure and language.

Add inline docs and docstrings where helpful.

Ensure make test or equivalent works (add a Makefile if beneficial).

Ensure CI passes.

Deliverable:

Open a pull request implementing the above changes with clear commit(s) and a comprehensive PR description including how to run training, demos, tests, and benchmarks.

The following is the prior conversation context from the user's chat exploration (may be truncated):

User: ADD SELF TRAINING FROM A DATASET FOLDER, ADD DEMOS, SMOKE TESTS AND BENCHMARKS , PRODUCTION GRADE

Assistant: I’m currently waiting on your approval in the GitHub dialog to proceed with creating the pull request and implementing the dataset self-training pipeline, demos, smoke tests, and benchmarks in MASSIVEMAGNETICS/victor_llm.

This pull request was created from Copilot chat.

… improvements - datasets/ folder with example_dataset (train/valid/test.jsonl + dataset.yaml) - victor_cli/ package: main.py, dataset.py, training.py, evaluation.py, inference.py, benchmark.py - victor_cli_entry.py top-level CLI script (victor prepare/train/eval/predict/benchmark) - demos/: demo_inference.py, demo_finetune.py, demo_e2e.py + README - tests/: conftest.py + test_smoke.py (45 smoke tests, all passing) - benchmarks/: harness.py + results/.gitkeep + README - .github/workflows/smoke-tests.yml CI workflow - pyproject.toml packaging with [project.scripts] victor entrypoint - Makefile with install/test/smoke/lint/benchmark/demo/clean targets - .gitignore updated for runs/, victor_tokenizers/, benchmarks/results/*.json - README.md updated with Production-Grade Usage section Agent-Logs-Url: https://github.com/MASSIVEMAGNETICS/victor_llm/sessions/4b0425ed-857c-47cb-bda7-757fafb34ae0 Co-authored-by: MASSIVEMAGNETICS <209589629+MASSIVEMAGNETICS@users.noreply.github.com>

…B for macOS, demo loaded check, CI permissions Agent-Logs-Url: https://github.com/MASSIVEMAGNETICS/victor_llm/sessions/4b0425ed-857c-47cb-bda7-757fafb34ae0 Co-authored-by: MASSIVEMAGNETICS <209589629+MASSIVEMAGNETICS@users.noreply.github.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f29fcc25f1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-02T18:23:25Z

+
+      - name: Run legacy toolkit tests
+        run: |
+          python -m unittest test_godmode_toolkit -v 2>&1 | tail -5


Preserve unittest exit status in smoke workflow

The Run legacy toolkit tests step pipes python -m unittest ... into tail, so the step exits with tail's status instead of the test runner's status. In GitHub Actions' default shell settings, this allows failing toolkit tests to appear green, which undermines CI gating for regressions in test_godmode_toolkit.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-02T18:23:25Z

+
+      - name: Run smoke tests
+        run: |
+          python -m pytest tests/test_smoke.py -v --tb=short --timeout=120


Install pytest-timeout before passing --timeout

The smoke test command uses --timeout=120, but the dependency install step does not install pytest-timeout. On a clean runner this makes pytest fail with an unrecognized argument error before running tests, so the smoke suite is not actually executable from this workflow definition.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-02T18:23:25Z

+    checkpoint_dir = output_dir / "checkpoints"
+    checkpoint_dir.mkdir(parents=True, exist_ok=True)


Write checkpoints into the reported run artifact path

Training saves checkpoints under <output_dir>/checkpoints/..., while the command reports artifacts as <output_dir>/<run_id> and docs/examples pass run directories to victor eval. Because run_eval searches for epoch_*.json inside the provided checkpoint directory, using the reported run path will fail with "No epoch checkpoint files found", breaking the documented train→eval flow.

Useful? React with 👍 / 👎.

Copilot

Pull request overview

This PR introduces a “production-grade” workflow around the existing DataBlobGodmodeKit/AutoTrainer stack by adding a standardized dataset layout, a victor CLI, runnable demos, smoke tests, benchmarks, and CI wiring.

Changes:

Added victor CLI package (victor_cli/) with subcommands for prepare/train/eval/predict/benchmark.
Introduced dataset convention + bundled datasets/example_dataset and documentation.
Added demos, smoke tests, benchmark harness, and GitHub Actions workflow.

Reviewed changes

Copilot reviewed 26 out of 27 changed files in this pull request and generated 16 comments.

Show a summary per file

File	Description
`victor_cli/training.py`	Training pipeline wrapper over DataBlobGodmodeKit/AutoTrainer with config-file support and summary writing.
`victor_cli/main.py`	CLI argument parsing and command wiring for prepare/train/eval/predict/benchmark.
`victor_cli/inference.py`	Lightweight CPU-only inference/generation path for demos/tests.
`victor_cli/evaluation.py`	Simple evaluation that reads checkpoint metadata and prints metrics.
`victor_cli/dataset.py`	Dataset directory + JSONL parsing validation helpers.
`victor_cli/benchmark.py`	CLI-accessible inference benchmark (latency/throughput/memory) with JSON output.
`victor_cli/__init__.py`	Package marker for the CLI module.
`victor_cli_entry.py`	Repo-root entrypoint for running the CLI directly.
`tests/test_smoke.py`	Pytest smoke coverage for imports, CLI wiring, dataset prepare, tiny train, inference, benchmarks.
`tests/conftest.py`	Ensures repo root is importable during tests.
`README.md`	Adds “Production-Grade Usage” docs for install, datasets, CLI usage, demos, tests, Makefile.
`pyproject.toml`	Introduces packaging metadata, console-script entrypoint, pytest/ruff/mypy config.
`Makefile`	Developer targets for install/test/smoke/lint/format/benchmark/demo/clean.
`demos/README.md`	Documentation for running the demo scripts.
`demos/demo_inference.py`	Minimal inference demo using tokenizer + simple generation.
`demos/demo_finetune.py`	Small fine-tuning demo against the example dataset.
`demos/demo_e2e.py`	End-to-end prepare → train → eval → predict → benchmark demo pipeline.
`datasets/README.md`	Documents on-disk dataset layout and record schemas.
`datasets/example_dataset/train.jsonl`	Bundled training split for demos/tests.
`datasets/example_dataset/valid.jsonl`	Bundled validation split for demos/tests.
`datasets/example_dataset/test.jsonl`	Bundled test split for demos/tests.
`datasets/example_dataset/dataset.yaml`	Example dataset metadata.
`benchmarks/results/.gitkeep`	Keeps benchmark results directory tracked.
`benchmarks/README.md`	Benchmark harness docs and usage.
`benchmarks/harness.py`	Standalone benchmark harness (inference/training/compare) with JSON output.
`.gitignore`	Ignores generated runs, tokenizers, benchmark result JSON, build artifacts, tool caches.
`.github/workflows/smoke-tests.yml`	CI workflow to run smoke tests, legacy toolkit tests, inference demo, benchmark harness.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-02T18:27:01Z

+    DataBlobGodmodeKit, and delegates to AutoTrainer.  Saves artifacts to
+    output_dir/<run_id>/.


The docstring says artifacts are saved to output_dir/<run_id>/, but the code configures Godmode/AutoTrainer checkpoints under output_dir/checkpoints/... and writes the summary under output_dir/<run_id>/. This splits artifacts across two different roots and can confuse users/scripts. Consider consolidating everything under a single per-run directory (e.g., output_dir/<run_id>/checkpoints/...) and passing that to GodmodeConfig.checkpoint_dir.

Suggested change

DataBlobGodmodeKit, and delegates to AutoTrainer. Saves artifacts to

output_dir/<run_id>/.

DataBlobGodmodeKit, and delegates to AutoTrainer. Checkpoints are saved

under output_dir/checkpoints/ while run summaries and metadata are saved

under output_dir/<run_id>/.

Copilot · 2026-04-02T18:27:02Z

+
+## Example Dataset
+
+`datasets/example_dataset/` is a tiny built-in demo set (10 records each split)


The example dataset description says “10 records each split”, but example_dataset/train.jsonl contains 20 records (valid/test are 10). Update this line to match the actual shipped dataset sizes so users and tests are not misled.

Suggested change

`datasets/example_dataset/` is a tiny built-in demo set (10 records each split)

`datasets/example_dataset/` is a tiny built-in demo set (20 train / 10 validation / 10 test records)

Copilot · 2026-04-02T18:27:02Z

+	$(PIP) install pytest pytest-cov pytest-timeout ruff pyyaml
+
+test: smoke
+	$(PYTHON) -m unittest test_godmode_toolkit -v 2>&1 | tail -5


Piping the unittest run through tail -5 masks test failures because the pipeline’s exit code will typically be from tail, not python -m unittest. This can cause CI/local make test to report success when tests fail. Use set -o pipefail (in shells that support it) or avoid piping and instead rely on the test runner output/verbosity controls.

Suggested change

$(PYTHON) -m unittest test_godmode_toolkit -v 2>&1 | tail -5

$(PYTHON) -m unittest test_godmode_toolkit -v

Copilot · 2026-04-02T18:27:03Z

+          python -m pytest tests/test_smoke.py -v --tb=short --timeout=120
+
+      - name: Run legacy toolkit tests
+        run: |


python -m unittest ... | tail -5 will usually return 0 even when tests fail, because tail succeeds. This makes the CI job unreliable. Remove the pipe or enable pipefail so test failures correctly fail the workflow.

Suggested change

run: |

run: |

set -o pipefail

Copilot · 2026-04-02T18:27:03Z

+          python -m pip install --upgrade pip
+          pip install pytest pytest-cov pyyaml
+          # Install lightweight subset of requirements (skip heavy GPU libs).
+          pip install numpy tqdm || true


pip install numpy tqdm || true will hide install failures and lead to confusing downstream import/runtime errors. If these dependencies are required for the smoke tests, let the step fail; if they’re optional, gate the tests that need them accordingly.

Suggested change

pip install numpy tqdm || true

pip install numpy tqdm

Copilot · 2026-04-02T18:27:05Z

+def prepare_dataset(dataset_dir: Path, verbose: bool = False) -> int:
+    """
+    Validate the layout of a dataset directory.
+
+    Returns 0 on success, 1 if critical errors are found.


prepare_dataset() currently checks only file presence + JSONL parseability; it does not validate the record schema against the declared task (e.g., required text/label fields for classification). This contradicts the PR description’s “validate dataset schema” and can let invalid datasets pass preparation and fail later. Add schema validation (ideally driven by dataset.yaml) and return non-zero on violations.

Copilot · 2026-04-02T18:27:05Z

+        prompts.extend(args.prompt)
+    if args.prompts_file:
+        pf = Path(args.prompts_file).expanduser()
+        for line in pf.read_text(encoding="utf-8").splitlines():


--prompts-file is read without handling missing/permission errors; read_text() will raise and crash the CLI with a traceback. Catch FileNotFoundError/OSError and return a non-zero code with a clear message.

Suggested change

for line in pf.read_text(encoding="utf-8").splitlines():

try:

contents = pf.read_text(encoding="utf-8")

except (FileNotFoundError, OSError) as exc:

logger.error("Failed to read prompts file '%s': %s", pf, exc)

return 1

for line in contents.splitlines():

Copilot · 2026-04-02T18:27:05Z

+        usage = resource.getrusage(resource.RUSAGE_SELF)
+        # Linux reports ru_maxrss in kB; macOS reports in bytes.
+        if sys.platform == "darwin":
+            return usage.ru_maxrss / (1024 * 1024)
+        return usage.ru_maxrss / 1024


On Unix, resource.getrusage(...).ru_maxrss is peak RSS (max so far), not current RSS. Using it for mem_before/mem_after makes memory_delta_mb misleading. Prefer psutil.Process().memory_info().rss when available, or rename the metric/output to reflect that this is peak RSS.

Copilot · 2026-04-02T18:27:05Z

+            data = json.loads(default_tok.read_text(encoding="utf-8"))
+            vocabulary = data.get("vocabulary", {})
+            reverse_vocabulary = {str(k): v for k, v in data.get("reverse_vocabulary", {}).items()}


Default tokenizer loading isn’t protected by try/except. If victor_tokenizers/nlp_tokenizer.json exists but is malformed, the benchmark will raise and crash. Mirror the defensive loading used in victor_cli.inference.run_predict() (warn and continue with empty vocab).

Suggested change

data = json.loads(default_tok.read_text(encoding="utf-8"))

vocabulary = data.get("vocabulary", {})

reverse_vocabulary = {str(k): v for k, v in data.get("reverse_vocabulary", {}).items()}

try:

data = json.loads(default_tok.read_text(encoding="utf-8"))

vocabulary = data.get("vocabulary", {})

reverse_vocabulary = {str(k): v for k, v in data.get("reverse_vocabulary", {}).items()}

logger.info("Loaded default vocabulary (%d tokens) from %s", len(vocabulary), default_tok)

except Exception as exc:

logger.warning("Could not load default tokenizer from %s: %s", default_tok, exc)

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Initial plan

3b84411

Copilot AI assigned Copilot and MASSIVEMAGNETICS Apr 2, 2026

Copilot started work on behalf of MASSIVEMAGNETICS April 2, 2026 13:01 View session

Copilot AI and others added 2 commits April 2, 2026 13:19

Copilot AI changed the title ~~[WIP] Add training pipeline for self-training and demos~~ Add self-training pipeline, demos, smoke tests, benchmarks, and production-grade tooling Apr 2, 2026

Copilot finished work on behalf of MASSIVEMAGNETICS April 2, 2026 13:26

Copilot AI requested a review from MASSIVEMAGNETICS April 2, 2026 13:26

MASSIVEMAGNETICS approved these changes Apr 2, 2026

View reviewed changes

MASSIVEMAGNETICS marked this pull request as ready for review April 2, 2026 18:20

Copilot AI review requested due to automatic review settings April 2, 2026 18:20

Copilot started reviewing on behalf of MASSIVEMAGNETICS April 2, 2026 18:20 View session

chatgpt-codex-connector Bot reviewed Apr 2, 2026

View reviewed changes

Copilot AI reviewed Apr 2, 2026

View reviewed changes

Update .github/workflows/smoke-tests.yml

05a0025

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

MASSIVEMAGNETICS merged commit fc69179 into main Apr 2, 2026
4 checks passed

		checkpoint_dir = output_dir / "checkpoints"
		checkpoint_dir.mkdir(parents=True, exist_ok=True)

		DataBlobGodmodeKit, and delegates to AutoTrainer. Saves artifacts to
		output_dir/<run_id>/.

-    DataBlobGodmodeKit, and delegates to AutoTrainer.  Saves artifacts to
-    output_dir/<run_id>/.
+    DataBlobGodmodeKit, and delegates to AutoTrainer. Checkpoints are saved
+    under output_dir/checkpoints/ while run summaries and metadata are saved
+    under output_dir/<run_id>/.


		## Example Dataset

		`datasets/example_dataset/` is a tiny built-in demo set (10 records each split)

	$(PYTHON) -m unittest test_godmode_toolkit -v 2>&1 \| tail -5
	$(PYTHON) -m unittest test_godmode_toolkit -v

-        for line in pf.read_text(encoding="utf-8").splitlines():
+        try:
+            contents = pf.read_text(encoding="utf-8")
+        except (FileNotFoundError, OSError) as exc:
+            logger.error("Failed to read prompts file '%s': %s", pf, exc)
+            return 1
+        for line in contents.splitlines():

Conversation

Copilot AI commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Dataset Convention

victor CLI (victor_cli/ + victor_cli_entry.py)

Demos (demos/)

Smoke Tests (tests/test_smoke.py)

Benchmarks (benchmarks/harness.py)

CI (.github/workflows/smoke-tests.yml)

Packaging & DX

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Copilot AI commented Apr 2, 2026 •

edited

Loading

`victor` CLI (`victor_cli/` + `victor_cli_entry.py`)

Demos (`demos/`)

Smoke Tests (`tests/test_smoke.py`)

Benchmarks (`benchmarks/harness.py`)

CI (`.github/workflows/smoke-tests.yml`)