Skip to content

fix(task-148): Toyota Way 500-line refactor + FALSIFY-CORPUS-004 + QLoRA + GPU training backend#1003

Closed
noahgift wants to merge 2 commits into
mainfrom
fix/task-148-toyota-way-500-line-bundle
Closed

fix(task-148): Toyota Way 500-line refactor + FALSIFY-CORPUS-004 + QLoRA + GPU training backend#1003
noahgift wants to merge 2 commits into
mainfrom
fix/task-148-toyota-way-500-line-bundle

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

Summary

Toyota Way file-size refactor (PMAT-689 / task #148) bundled with FALSIFY-CORPUS-004 pre-flight gate, QLoRA distillation contract (#137), and GPU training backend Phase 2 (#132).

Test plan

  • cargo fmt --all -- --check
  • cargo test -p apr-cli --features training --lib → 5307 passed, 12 ignored
  • cargo clippy -p apr-cli --features training --lib -- -D warnings
  • cargo clippy -p aprender-train --lib -- -D warnings
  • cargo test -p aprender-train --lib cpu_stepfn_exhaustion → 2 passed (PMAT-688 CPU peer)
  • CI green on push

🤖 Generated with Claude Code

noahgift and others added 2 commits April 22, 2026 07:49
Addresses 2026-04-22 outage where all 16 intel-clean-room runners went
offline because / on intel hit 100% (3.5T/3.6T). Runner diag logs
couldn't be written, so GitHub marked runners offline.

Two layers of defence:
- pre-job hook: aggressive target/ prune when disk >= 85%
- nightly timer: prune target/ older than 7 days

Scripts are runner-host-agnostic — install path and deployment recipe in
scripts/runner-infra/README.md.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…or + FALSIFY-CORPUS-004 + QLoRA contract + GPU training backend

Toyota Way (PMAT-689): split 5 files over 500-line cap via include!() pattern
- distill.rs 1984→468 (4-way split: types/config_and_execute/train_and_write/text_generate)
- extended_commands.rs →497 (4 sibling sub-enum files: forensics/lints/runs/training)
- dispatch_analysis.rs →453 (+ dispatch_helpers.rs + dispatch_profiling.rs)
- lib_dispatch_coverage.rs 773→158 (3 sibling test files: analysis/profiling/train)
- pull.rs →374 (+ pull_sharded.rs)

FALSIFY-CORPUS-004 pre-flight gate (#142/#144/#145/#146/#147):
- contracts/pretraining-corpus-v1.yaml v2.0.0 (INV-TRAIN-010/011)
- ShardBatchIter::count_tokens static counter
- cycling_iter.rs: BoxedShardIter + optional cycling
- pretrain_preflight.rs + pretrain_report.rs module split
- --allow-shard-cycle CLI flag wired
- pretrain_tests.rs unit tests covering epoch/budget/cycle paths

QLoRA distillation contract (#137):
- contracts/entrenar/qlora-distillation-v1.yaml v1.1.0 PROPOSED
- distill/{preflight,driver,apr_writer}.rs wiring 5 INV-DISTILL invariants
- 14 harness tests at PARTIAL_ALGORITHM_LEVEL

GPU training backend Phase 2 (#132):
- pretrain_real_cuda.rs CUDA dispatch wiring
- evidence/gpu-training-backend/ Phase 2 scaffold

MODEL-2 spec updates:
- ship-two-models-spec.md v2.24.0 (INV-TRAIN-011 + corpus v2.0)
- roadmap.yaml phase tracking for tasks #142/#144/#145/#146/#147

Verification:
- cargo test -p apr-cli --features training --lib → 5307 passed
- cargo fmt --all -- --check → clean
- cargo clippy -p apr-cli --features training --lib -- -D warnings → clean
- cargo clippy -p aprender-train --lib -- -D warnings → clean
- All changed files ≤500 lines (pmat work complete invariant GREEN)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@codeinputmachine
Copy link
Copy Markdown

Hey @noahgift, Code Input detected this PR has a merge conflict. The conflicts of this PR can be resolved with a semantic merge driver. Code Input can do that automatically: https://codeinput.com/r/3rcKnuOwggR Let me know if you need more help with this conflict or how Code Input works.

@noahgift
Copy link
Copy Markdown
Contributor Author

Triaged by autonomous sweep (2026-05-11): this PR is a 95-file / 128-commit bundle that pre-dates the current §17.5 cascade and the §50.4 polymorphic-preflight cascade now on main. Auto-arming was skipped because:

  1. The blast radius (95 files across multiple crates) makes a clean cherry-pick onto current main impractical.
  2. Several of the changes (FALSIFY-CORPUS-004, QLoRA, GPU training backend) are likely either superseded by or in conflict with PRs that landed during the §50.4 cascade (feat(apr-cli): wire apr pretrain --init <model.apr> — §49 step 4 #1471/contract(apr-pretrain-arch-polymorphic-v1): v1.0.0 PROPOSED — §50.4 step 5a #1473/fix(aprender-train): qwen2_0_5b tie_word_embeddings true — §50.4 step 5b + DEFECT FIX #1474/feat(aprender-train): build_transformer_config polymorphic dispatch — §50.4 step 5c #1475/feat(apr-cli): polymorphic preflight_tokenizer_vocab_matches_target — §50.4 step 5d #1476/test(aprender-train): GQA-7:1 forward-pass smoke test — §50.4 step 5e #1478/feat(aprender-train): validate_pretrain_init_arch_compatible — §50.4 step 5f.1 #1479/feat(aprender-train): load_init_tensors_from_apr — §50.4 step 5f.2 #1481/contract(apr-pretrain-arch-polymorphic-v1): v1.0.0 → v1.1.0 PARTIAL_ALGORITHM_LEVEL #1482/feat(aprender-train): populate_trainer_from_init_tensors — §50.4 step 5f.3 #1483/spec(ship-two-models): v2.96.0 → v2.97.0 — §52 cascade ALGORITHM-COMPLETE + 5f.4 wireup gap #1486/feat(apr-cli + aprender-train): apr pretrain --init wireup — §50.4 step 5f.4 #1494) and the SHIP-007 §22 fix cascade (M91-M103).

Recommended next step: split this into the still-relevant subset (probably the GPU training backend changes that aren't already on main) as one or two focused PRs, and close this one as superseded. Leaving as-is for human review.

@noahgift
Copy link
Copy Markdown
Contributor Author

Triaged after rebase attempt (2026-05-13). Rebase against current main produces 3 structural compile errors:

  1. crate::models::llama_370m::Llama370MConfig — module removed during MODEL-2 architecture-coupling cleanup (§50 multi-PR cascade)
  2. config::Normalization — removed from tokenizer::config exports
  3. BPETokenizer::preprocess — method removed

The PR's 125-commit Toyota Way refactor pre-dates 22+ commits to main during this session including:

The refactor scope and code touched (95 files including tokenizer/bpe.rs, models/, distill/) overlap heavily with this work. Re-authoring against current main is more tractable than rebase. Closing as superseded; please re-open with a fresh PR focused on the still-applicable subset.

@noahgift noahgift closed this May 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants