Add Steps 13-15: hooks, testing, procedure encoding#1
Conversation
…encoding New content covering the full Agent Red setup (S207-S212): Step 13 — Defense-in-Depth Hook Architecture: - Three-layer enforcement model (PreToolUse → pre-commit → SessionStart) - Credential scanner, destructive command gate, assertion ratchet, test deletion guard, architectural guards, TSX spec gate - Custom agents (code-reviewer, security-analyzer) Step 14 — Automated Testing Strategy: - 12 suite taxonomy (unit through property-based, 9,152 tests) - 4 test generation patterns (spec-driven, assertion, auto-generated, regression) - Live-only testing principle (GOV-10) and two-container test host - Known testing gaps documented honestly Step 15 — Encoding Procedures to Reduce Orchestration Load: - The orchestration problem (step omission, order violation, drift, partial execution) - "Prompts encode intent. Hooks encode enforcement. Skills encode execution." - 10-skill catalog as operational surface - Measured 10x reduction in GOV-12 violations Also: 6 new lessons learned (41-46), updated metrics (212 sessions, 9,152 tests, 10 skills), new evolution timeline entries, expanded Quick Start Checklist. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the documentation of the Membase pattern by integrating three new foundational steps: a robust, layered hook architecture for defense-in-depth, a comprehensive strategy for automated testing, and a principle for encoding repeatable procedures to improve AI orchestration reliability. These additions formalize critical quality assurance and governance mechanisms, ensuring more consistent and secure AI-driven development by shifting from prose-based instructions to executable enforcement and structured workflows. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request adds extensive documentation for Steps 13-15, covering defense-in-depth hooks, automated testing strategies, and procedure encoding. It also updates various metrics and lessons learned across the documentation. My review focuses on ensuring consistency and clarity in the newly added content. I've found a few inconsistencies in metrics (like test counts and skill counts) across different sections and files, a typo in a table header, and a broken markdown table. Addressing these points will improve the overall quality and readability of the documentation.
| | Operational procedures | 14 | | ||
| | Governance principles | 20 (GOV-01 through GOV-18 + 2 architectural) | | ||
| | Test plan phases | 18 active (incl. fuzzing + property phases) | | ||
| | Test plan phases | 18 active (incl. fuzzing + property phases) | | |
| | Suite | Est. Tests | Speed | What It Catches | Generation Method | | ||
| |-------|-----------|-------|----------------|-------------------| | ||
| | **Unit** | 950 | ~2 min | Logic errors, type mismatches, pure function bugs | Claude-generated from specs | | ||
| | **Core** (multi-tenant) | 3,700 | ~5 min | Tenant isolation, API routing, auth, middleware | Claude-generated from specs | | ||
| | **Integration** | 270 | ~3 min | Component interaction, database queries, cache behavior | Claude-generated from specs + WIs | | ||
| | **Agents** | 300 | ~2 min | MCP agent dispatch, tool execution, guardrails | Claude-generated from SPEC-1706..1712 | | ||
| | **Security** | 150 | ~2 min | Auth bypass, injection, tenant leaks, OWASP patterns | Claude-generated from security specs | | ||
| | **Regression** | 47 | ~1 min | Previously-fixed bugs | Auto-created when WI is resolved | | ||
| | **Widget** | 60 | ~1 min | Embed behavior, postMessage, launcher, resize | Claude-generated from widget specs | | ||
| | **Ops** | 80 | ~1 min | Deployment checks, config validation, health probes | Claude-generated from procedures | | ||
| | **E2E Live** | 1,100 | ~15 min | Real deployment: Playwright against staging/production | Claude-generated from user stories | | ||
| | **Load** | variable | ~10 min | Throughput, latency under load, rate limit behavior | Locust scenarios from capacity specs | | ||
| | **Fuzzing** | 307 ops | ~10 min | API contract violations, edge cases, unexpected inputs | Schemathesis auto-generated from OpenAPI | | ||
| | **Property-based** | 46 | ~3 min | Algebraic invariants, roundtrip properties | Hypothesis strategies from data models | |
There was a problem hiding this comment.
The estimated test counts in this table seem to be inconsistent with the total number of tests mentioned elsewhere. The sum of Est. Tests here is 6,703 (or 7,010 if including fuzzing ops), but other parts of the documentation (like README.md) mention a total of 9,152 automated tests. To avoid confusion, could you please update these numbers to be consistent with the total?
| ``` | ||
| Batch 1: core-a (2,400 tests, parallel) -> 30s cooldown | ||
| Batch 2: core-b (680 tests, parallel) -> 30s cooldown | ||
| Batch 3: agents-chat (600 tests, parallel) -> 30s cooldown | ||
| Batch 4: integrations (400 tests, parallel) -> 30s cooldown | ||
| Batch 5: sequential (120 tests, serial) | ||
| ``` |
There was a problem hiding this comment.
The test counts in this example for thermal-safe testing are inconsistent with the Testing Taxonomy table in the same document. For example:
core-a(2,400) +core-b(680) = 3,080 tests, but the taxonomy lists 3,700 forCore.agents-chatis 600 tests here, butAgentsin the taxonomy is 300.integrationsis 400 tests here, butIntegrationin the taxonomy is 270.
Please align these numbers to ensure consistency throughout the document.
| - [ ] Create git pre-commit hooks: assertion ratchet, test deletion guard, architectural guards, TSX gate, credential scan (Step 13) | ||
| - [ ] Generate assertion baseline JSON for the ratchet hook (Step 13) | ||
| - [ ] Create custom review agents (code-reviewer, security-analyzer) in `.claude/agents/` (Step 13) | ||
| - [ ] Expand test taxonomy beyond unit/integration/e2e: add security, regression, fuzzing, property-based, load suites (Step 14) |
There was a problem hiding this comment.
This checklist item for expanding the test taxonomy could be clearer. It lists security, regression, fuzzing, property-based, load suites, but this is an incomplete list compared to the 12-suite taxonomy defined in Step 14. It's also slightly confusing because regression and fuzzing are already part of the Phase Taxonomy in Step 10.
To improve clarity, I suggest either listing all test suites that should be added or rephrasing to better guide the user on how to expand from a basic setup to the full 12-suite taxonomy.
| | S211 | Claude silently weakens tests and removes architectural patterns | **Quality guardrails** — 5 PreToolUse/pre-commit hooks (assertion ratchet, test deletion guard, architecture guard, TSX spec gate, credential scan). Three-layer defense model. | | ||
| | S212 | Production and staging share test infrastructure; uncontrolled regressions | **Environment isolation + production verification** — Separate test hosts per environment, skip-as-pass classification, SPEC-0058 enforcement (24 files cleaned), widget storefront presence testing | | ||
|
|
||
| ### Current Database (as of Session 206) |
There was a problem hiding this comment.
| The database is used exclusively by Claude and contains only what Claude needs to remember. The human observes through a lightweight read-only UI (sort, filter, search, tree-view, change history) that deliberately excludes write operations. When the human spots a discrepancy, they tell Claude, and Claude creates a corrected version. | ||
|
|
||
| The current database is ~40 MB with 2,052 specifications, 10,847 test artifacts, 1 test plan (18 active phases), ~1,600 work items, 14 operational procedures, 176 documents, 520 testable elements, ~2,040 specs with machine-verifiable assertions (99.5% coverage), 8 KB-aware Claude Code skills, and multi-agent coordination via prime-bridge — all accumulated across 206 sessions with zero data loss. | ||
| The current database is ~40 MB with 2,052 specifications, 10,847 test artifacts, 1 test plan (18 active phases), ~1,600 work items, 14 operational procedures, 176 documents, 520 testable elements, ~2,040 specs with machine-verifiable assertions (99.5% coverage), 8 KB-aware Claude Code skills, and multi-agent coordination via prime-bridge — all accumulated across 212 sessions with zero data loss. |
There was a problem hiding this comment.
This summary paragraph contains several outdated metrics that are inconsistent with the "Current Database" table updated in this same pull request. Specifically:
- Test artifacts: 10,847 here vs. 10,912 in the table.
- Knowledge documents: 176 here vs. 154 in the table.
- Claude Code skills: 8 here vs. 10 in the table and elsewhere.
Please update this paragraph to reflect the latest metrics for consistency.
| The current database is ~40 MB with 2,052 specifications, 10,847 test artifacts, 1 test plan (18 active phases), ~1,600 work items, 14 operational procedures, 176 documents, 520 testable elements, ~2,040 specs with machine-verifiable assertions (99.5% coverage), 8 KB-aware Claude Code skills, and multi-agent coordination via prime-bridge — all accumulated across 212 sessions with zero data loss. | |
| The current database is ~40 MB with 2,052 specifications, 10,912 test artifacts, 1 test plan (18 active phases), ~1,600 work items, 14 operational procedures, 154 documents, 520 testable elements, ~2,040 specs with machine-verifiable assertions (99.5% coverage), 10 KB-aware Claude Code skills, and multi-agent coordination via prime-bridge — all accumulated across 212 sessions with zero data loss. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9b9cd56e35
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| The database is used exclusively by Claude and contains only what Claude needs to remember. The human observes through a lightweight read-only UI (sort, filter, search, tree-view, change history) that deliberately excludes write operations. When the human spots a discrepancy, they tell Claude, and Claude creates a corrected version. | ||
|
|
||
| The current database is ~40 MB with 2,052 specifications, 10,847 test artifacts, 1 test plan (18 active phases), ~1,600 work items, 14 operational procedures, 176 documents, 520 testable elements, ~2,040 specs with machine-verifiable assertions (99.5% coverage), 8 KB-aware Claude Code skills, and multi-agent coordination via prime-bridge — all accumulated across 206 sessions with zero data loss. | ||
| The current database is ~40 MB with 2,052 specifications, 10,847 test artifacts, 1 test plan (18 active phases), ~1,600 work items, 14 operational procedures, 176 documents, 520 testable elements, ~2,040 specs with machine-verifiable assertions (99.5% coverage), 8 KB-aware Claude Code skills, and multi-agent coordination via prime-bridge — all accumulated across 212 sessions with zero data loss. |
There was a problem hiding this comment.
Update summary paragraph to match revised database metrics
The final “current database” paragraph was edited for the 212-session count but still reports old values (10,847 test artifacts, 176 documents, 8 skills) that now conflict with the updated metrics table in the same section (10,912, 154, 10). This introduces contradictory source-of-truth data in one README and can mislead readers who copy numbers from the summary text instead of the table.
Useful? React with 👍 / 👎.
Summary
Also updates:
Test plan
🤖 Generated with Claude Code