Skip to content

TYLERSFOSTER/big_boy_benchmarking

Repository files navigation

BBB

Big Boy Benchmarking

big_boy_benchmarking is the official benchmarking and calibration repository for state_collapser.

Current public beta component:

Big Boy Calibration / Smoke

Future component:

Benchmarking

This beta is source-first and public-inspection oriented. It contains working environment surfaces, evaluation machinery, artifact contracts, human-readable readouts, and smoke-scale calibration evidence. It does not yet claim broad benchmark victory, statistical significance, or general tower superiority.

Install From Source

git clone <repo-url>
cd big_boy_benchmarking
uv sync --group dev
uv run pytest
uv run python -m big_boy_benchmarking.cli --help

Current reports assume state_collapser v0.7.2 or newer compatible pointwise liftability semantics.

What Is Here Now

Environments

Environment Status Public docs What it is for
Counterpoint Symbolic v001 Active calibration/smoke environment environment docs Symbolic hidden-graph and contraction-schema workbench used to develop BBB artifact, readout, liftability, and tower-training machinery.
PlateSupport 5x5 Default v001 Active robotics-like calibration/smoke environment environment docs Constrained plate-support control surface with meaningful invalid-action behavior and a completed standard gauntlet.

Main Human-Readable Reports

Report Status Link Bounded conclusion
Counterpoint first serious learning Complete structural-limit diagnostic README Harness, direct baselines, artifact pipeline, and readout path work; early non-empty tower arms expose collapse/lift limitations.
Counterpoint noisy-rate contraction diagnostics Complete structural diagnostic README Edge-global noisy-rate schemas can avoid immediate full collapse on the widened fixture and produce inspectable candidate towers.
Counterpoint noisy-rate full-tower training diagnostic Complete tower-only training-health diagnostic README Selected non-collapsed candidates can be rebuilt and trained under pointwise liftability semantics; this is not a direct comparison.
Counterpoint second serious schema comparison Complete bounded comparison surface README Matched Schema 0 versus Schema 1 comparison works; current evidence is narrow and calibration-scale.
Counterpoint threshold frontier probe Complete next-measure probe README Threshold sweeps expose a small Schema 1 margin pattern, not broad dominance.
Counterpoint small paired replicate probe Complete next-measure probe README Seed-paired machinery works and records a weak positive Schema 1 margin pattern; not statistical significance.
PlateSupport standard gauntlet Complete correction gauntlet with bounded positive smoke signal README The selected iterated tower candidate beat the direct baseline on the calibrated binary-success target and showed a coherent action-filtering signal.
PlateSupport direct-star cul-de-sac control Complete diagnostic control README Abdul Malik's cul-de-sac concern is tested by adding one-step guarded direct controls beside the selected tower candidate.
PlateSupport tower-star guarded lift comparison Complete diagnostic control with inconclusive smoke result README Direct-star and tower-star controls are both implemented; the first smoke run is tied on the primary target and does not resolve a tower advantage.

The full evaluation index is in docs/evaluations/README.md.

Current Conclusions

Supported by the checked-in readouts:

  • BBB can build repo-resident artifacts, summaries, badges, and human-readable reports for nontrivial state_collapser downstream environments.
  • Counterpoint established the artifact/readout/tower-control workflow and exposed real integration issues, including pointwise liftability.
  • PlateSupport now provides the clearest calibration/smoke signal: in the standard gauntlet correction run, the selected tower arm hit the calibrated target more often than direct, had much better mean reward, and made zero invalid concrete moves while direct made many invalid moves.
  • Follow-up PlateSupport control diagnostics now separate the Abdul Malik cul-de-sac concern from the original positive smoke result. The first tower-star guarded-lift smoke run is correctly bounded as inconclusive, not as a new positive tower claim.
  • The current evidence is strong enough to justify further benchmark design.

Not supported yet:

  • general tower superiority;
  • final robotics benchmark claims;
  • statistical significance across large budgets;
  • tensor-enabled or GPU performance claims;
  • PyPI release stability;
  • broad claims beyond the exact checked-in environments, budgets, and readouts.

Artifact Policy

Human-readable reports and compact summaries live in git. Large raw run trees and event-level artifacts have been externalized to a local release-asset bundle for the v0.1.0-beta.1 public beta.

Bundle metadata is tracked in docs/design/beta_public_release/release_asset_manifests:

  • asset name: big_boy_calibration_smoke_v0.1.0-beta.1_artifacts.tar.zst;
  • checksum: b0fd6be1d30abaad25d5a02a308a44d6f52e3ac409c99f735150d408b94d4090;
  • raw artifact file count: 4,207.

When inspecting a report, start with its checked-in readout_source.json. Durable human readouts are generated with the explicit protocol-file command:

execute docs/prime_directive/artifact_table_to_readable_document_protocol.md at docs/evaluations/<environment>/<evaluation>/readout_source.json

Workflow

BBB uses a three-step workflow:

  1. Design and build an environment.
  2. Design and build evaluations or gauntlets for that environment.
  3. Convert machine-readable artifact tables into human-readable repo reports.

Key protocol folders:

Development Commands

uv sync --group dev
uv run pytest
uv run python -m big_boy_benchmarking.cli validate-contracts
uv run python scripts/release_hygiene.py --repo-root .

The future installed command name bbb is reserved. The stable beta entry point is currently:

uv run python -m big_boy_benchmarking.cli

Release Status

This branch is preparing the initial public beta:

v0.1.0-beta.1

Release notes, governance files, CI, and artifact-bundle manifests are part of the beta-readiness work. Tagging, publishing, uploading release assets, making the GitHub repository public, and publishing to PyPI are separate release actions and are not implied by this README.

Packages

 
 
 

Contributors

Languages