Skip to content

feat(test-maturity): no-mock OAuth Codex audit per domain (plan + imp…#24

Merged
usetheodev merged 2 commits into
developfrom
worktree-mellow-wiggling-karp
May 15, 2026
Merged

feat(test-maturity): no-mock OAuth Codex audit per domain (plan + imp…#24
usetheodev merged 2 commits into
developfrom
worktree-mellow-wiggling-karp

Conversation

@usetheodev

Copy link
Copy Markdown
Owner

…l + reviews)

Implements the test-maturity-audit-no-mock plan v1.1: a reproducible audit of test maturity across 20 domains (CLAUDE.md "Domain Status") scoring 6 dimensions (pyramid, failure_coverage, determinism, invariants, realism×2, boundary) with LOC-weighted global score. Validation runs real OAuth Codex E2E scenarios; zero mocks in the audit code (ADR D2 enforced by preflight-no-mock.sh).

Entry points (opt-in, costs USD per run, NOT in make audit):

  • make check-test-maturity — full audit with OAuth Codex E2E (~$0.30, ~1min)
  • make check-test-maturity-skip-e2e — structural-only (no cost)
  • make check-test-maturity-report — print latest report

Plan: 13 tasks across 5 phases. 8 ADRs (D1-D8). 12 edge cases pinned by tests. Halt-on-anomaly via D8a (auth re-validation per scenario), D8b (trajectory mtime filter), D8c (RunCompleted gate — halt if cost unknown).

Validation evidence (this session):

  • 66/66 Python tests green
  • 1 real OAuth Codex audit run: 9 scenarios, $0.286/$2.00, 1 PASS + 8 diagnostic FAILs (the 8 are genuine signals — agent didn't dispatch expected tools for short prompts)
  • Cross-validation: APROVADO COM RESSALVAS (13/13 tasks, 0 BLOCKER/CRITICAL)
  • Dogfood QA: SHIPPABLE WITH CAVEATS, health 85/100, 0 CRITICAL attributable (4 pre-existing bwrap host-kernel failures unattributable)

Scope: Rust workspace only. apps/theo-ui (TS) and apps/theo-benchmark (Python) have their own suites and are out of scope (declared in report methodology).

Files (38, +4953 LOC):

  • scripts/test-maturity/ (28 files: 12 source Python/bash, 4 data yaml, 1 jinja2 template, 11 test_*.py)
  • .claude/knowledge-base/plans/test-maturity-audit-no-mock-plan.md (v1.1)
  • 4 reviews under .claude/knowledge-base/reviews/ (test-maturity report, cross-validation, edge-case, dogfood)
  • Makefile (+3 targets, NOT in make audit)
  • CLAUDE.md (+1 line in Extended gates)
  • index.md + log.md updated

…l + reviews)

Implements the test-maturity-audit-no-mock plan v1.1: a reproducible audit of
test maturity across 20 domains (CLAUDE.md "Domain Status") scoring 6 dimensions
(pyramid, failure_coverage, determinism, invariants, realism×2, boundary) with
LOC-weighted global score. Validation runs real OAuth Codex E2E scenarios; zero
mocks in the audit code (ADR D2 enforced by preflight-no-mock.sh).

Entry points (opt-in, costs USD per run, NOT in `make audit`):
  - `make check-test-maturity` — full audit with OAuth Codex E2E (~$0.30, ~1min)
  - `make check-test-maturity-skip-e2e` — structural-only (no cost)
  - `make check-test-maturity-report` — print latest report

Plan: 13 tasks across 5 phases. 8 ADRs (D1-D8). 12 edge cases pinned by tests.
Halt-on-anomaly via D8a (auth re-validation per scenario), D8b (trajectory
mtime filter), D8c (RunCompleted gate — halt if cost unknown).

Validation evidence (this session):
  - 66/66 Python tests green
  - 1 real OAuth Codex audit run: 9 scenarios, $0.286/$2.00, 1 PASS + 8
    diagnostic FAILs (the 8 are genuine signals — agent didn't dispatch
    expected tools for short prompts)
  - Cross-validation: APROVADO COM RESSALVAS (13/13 tasks, 0 BLOCKER/CRITICAL)
  - Dogfood QA: SHIPPABLE WITH CAVEATS, health 85/100, 0 CRITICAL attributable
    (4 pre-existing bwrap host-kernel failures unattributable)

Scope: Rust workspace only. apps/theo-ui (TS) and apps/theo-benchmark (Python)
have their own suites and are out of scope (declared in report methodology).

Files (38, +4953 LOC):
  - scripts/test-maturity/ (28 files: 12 source Python/bash, 4 data yaml,
    1 jinja2 template, 11 test_*.py)
  - .claude/knowledge-base/plans/test-maturity-audit-no-mock-plan.md (v1.1)
  - 4 reviews under .claude/knowledge-base/reviews/ (test-maturity report,
    cross-validation, edge-case, dogfood)
  - Makefile (+3 targets, NOT in `make audit`)
  - CLAUDE.md (+1 line in Extended gates)
  - index.md + log.md updated
@usetheodev

Copy link
Copy Markdown
Owner Author

@copilot resolve the merge conflicts in this pull request

Co-authored-by: usetheodev <241186089+usetheodev@users.noreply.github.com>

Copilot AI commented May 15, 2026

Copy link
Copy Markdown
Contributor

@copilot resolve the merge conflicts in this pull request

Done in commit bf6c847: I merged develop into this branch and resolved all merge conflicts in CLAUDE.md, .claude/knowledge-base/index.md, and .claude/knowledge-base/log.md while keeping both sides’ updates.

@usetheodev usetheodev merged commit ce42406 into develop May 15, 2026
4 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants