big_boy_benchmarking is the official benchmarking and calibration repository
for state_collapser.
Current public beta component:
Big Boy Calibration / Smoke
Future component:
Benchmarking
This beta is source-first and public-inspection oriented. It contains working environment surfaces, evaluation machinery, artifact contracts, human-readable readouts, and smoke-scale calibration evidence. It does not yet claim broad benchmark victory, statistical significance, or general tower superiority.
git clone <repo-url>
cd big_boy_benchmarking
uv sync --group dev
uv run pytest
uv run python -m big_boy_benchmarking.cli --helpCurrent reports assume state_collapser v0.7.2 or newer compatible
pointwise liftability semantics.
| Environment | Status | Public docs | What it is for |
|---|---|---|---|
| Counterpoint Symbolic v001 | Active calibration/smoke environment | environment docs | Symbolic hidden-graph and contraction-schema workbench used to develop BBB artifact, readout, liftability, and tower-training machinery. |
| PlateSupport 5x5 Default v001 | Active robotics-like calibration/smoke environment | environment docs | Constrained plate-support control surface with meaningful invalid-action behavior and a completed standard gauntlet. |
| Report | Status | Link | Bounded conclusion |
|---|---|---|---|
| Counterpoint first serious learning | Complete structural-limit diagnostic | README | Harness, direct baselines, artifact pipeline, and readout path work; early non-empty tower arms expose collapse/lift limitations. |
| Counterpoint noisy-rate contraction diagnostics | Complete structural diagnostic | README | Edge-global noisy-rate schemas can avoid immediate full collapse on the widened fixture and produce inspectable candidate towers. |
| Counterpoint noisy-rate full-tower training diagnostic | Complete tower-only training-health diagnostic | README | Selected non-collapsed candidates can be rebuilt and trained under pointwise liftability semantics; this is not a direct comparison. |
| Counterpoint second serious schema comparison | Complete bounded comparison surface | README | Matched Schema 0 versus Schema 1 comparison works; current evidence is narrow and calibration-scale. |
| Counterpoint threshold frontier probe | Complete next-measure probe | README | Threshold sweeps expose a small Schema 1 margin pattern, not broad dominance. |
| Counterpoint small paired replicate probe | Complete next-measure probe | README | Seed-paired machinery works and records a weak positive Schema 1 margin pattern; not statistical significance. |
| PlateSupport standard gauntlet | Complete correction gauntlet with bounded positive smoke signal | README | The selected iterated tower candidate beat the direct baseline on the calibrated binary-success target and showed a coherent action-filtering signal. |
| PlateSupport direct-star cul-de-sac control | Complete diagnostic control | README | Abdul Malik's cul-de-sac concern is tested by adding one-step guarded direct controls beside the selected tower candidate. |
| PlateSupport tower-star guarded lift comparison | Complete diagnostic control with inconclusive smoke result | README | Direct-star and tower-star controls are both implemented; the first smoke run is tied on the primary target and does not resolve a tower advantage. |
The full evaluation index is in docs/evaluations/README.md.
Supported by the checked-in readouts:
- BBB can build repo-resident artifacts, summaries, badges, and human-readable
reports for nontrivial
state_collapserdownstream environments. - Counterpoint established the artifact/readout/tower-control workflow and exposed real integration issues, including pointwise liftability.
- PlateSupport now provides the clearest calibration/smoke signal: in the standard gauntlet correction run, the selected tower arm hit the calibrated target more often than direct, had much better mean reward, and made zero invalid concrete moves while direct made many invalid moves.
- Follow-up PlateSupport control diagnostics now separate the Abdul Malik cul-de-sac concern from the original positive smoke result. The first tower-star guarded-lift smoke run is correctly bounded as inconclusive, not as a new positive tower claim.
- The current evidence is strong enough to justify further benchmark design.
Not supported yet:
- general tower superiority;
- final robotics benchmark claims;
- statistical significance across large budgets;
- tensor-enabled or GPU performance claims;
- PyPI release stability;
- broad claims beyond the exact checked-in environments, budgets, and readouts.
Human-readable reports and compact summaries live in git. Large raw run trees
and event-level artifacts have been externalized to a local release-asset
bundle for the v0.1.0-beta.1 public beta.
Bundle metadata is tracked in docs/design/beta_public_release/release_asset_manifests:
- asset name:
big_boy_calibration_smoke_v0.1.0-beta.1_artifacts.tar.zst; - checksum:
b0fd6be1d30abaad25d5a02a308a44d6f52e3ac409c99f735150d408b94d4090; - raw artifact file count: 4,207.
When inspecting a report, start with its checked-in readout_source.json.
Durable human readouts are generated with the explicit protocol-file command:
execute docs/prime_directive/artifact_table_to_readable_document_protocol.md at docs/evaluations/<environment>/<evaluation>/readout_source.json
BBB uses a three-step workflow:
- Design and build an environment.
- Design and build evaluations or gauntlets for that environment.
- Convert machine-readable artifact tables into human-readable repo reports.
Key protocol folders:
- docs/prime_directive: operational protocols for Codex/engineer collaboration.
- docs/design: open-lab design history, blueprints, workplans, and implementation logs.
- docs/environments: environment descriptions and readiness docs.
- docs/evaluations: human-readable evaluation reports.
- docs/engineer_continuity: continuity reports and handoff notes.
uv sync --group dev
uv run pytest
uv run python -m big_boy_benchmarking.cli validate-contracts
uv run python scripts/release_hygiene.py --repo-root .The future installed command name bbb is reserved. The stable beta entry
point is currently:
uv run python -m big_boy_benchmarking.cliThis branch is preparing the initial public beta:
v0.1.0-beta.1
Release notes, governance files, CI, and artifact-bundle manifests are part of the beta-readiness work. Tagging, publishing, uploading release assets, making the GitHub repository public, and publishing to PyPI are separate release actions and are not implied by this README.