Canonical source: This is the public benchmark-only repository for The Marked Bench. Use this repository for all pulls, issues, and contributions related to benchmarks.
The Marked Bench is versioned benchmark infrastructure for testing whether AI systems can detect and classify contradictions between a premise and a query.
The project is designed to become a reproducible public standard: every score is tied to a suite ID, suite version, deterministic suite hash, immutable case list, JSON report schema, confusion matrix, per-class metrics, confidence calibration metrics, explanation-audit coverage, validation result, diagnostic slice metrics, result card, result claim, standard profile, implementation kit, scoring compatibility vectors, scoring specification, and leaderboard entry.
Current standardization status is tracked in:
| Track | Suite ID | Version | Purpose | Baseline |
|---|---|---|---|---|
| Foundation | marked-bench-contradiction-standard |
0.1.0 |
Compact canonical contradiction suite | 100.00 |
| Adversarial | marked-bench-contradiction-adversarial |
0.2.0 |
Longer-context, implicit, and trap cases | 52.37 |
| Multi-hop | marked-bench-contradiction-multihop |
0.3.0 |
Linked-evidence contradiction cases | 24.14 |
| Controls | marked-bench-contradiction-controls |
0.4.0 |
False-positive distractor controls with contradiction anchors | 100.00 |
The adversarial track is intentionally not solved by the packaged symbolic baseline. The multi-hop track is the default target for future symbolic, neural, retrieval-augmented, and hybrid systems. The controls track is a false-positive stress track for systems that over-call contradictions on paraphrases, scoped negatives, time shifts, and harmless elaborations.
pip install -e .Requires Python 3.10 or newer.
Foundation track:
marked-bench --suite contradiction --report artifacts/foundation-report.jsonAdversarial track:
marked-bench --suite contradiction-adversarial --report artifacts/adversarial-report.jsonMulti-hop track:
marked-bench --suite contradiction-multihop --report artifacts/multihop-report.jsonControl track:
marked-bench --suite contradiction-controls --report artifacts/controls-report.jsonValidate a report before publication:
marked-bench --validate-report artifacts/adversarial-report.jsonExport the machine-readable registry of public tracks and artifacts:
marked-bench --export-registry benchmark_registry.jsonExport the release manifest that pins public artifact SHA-256 digests:
marked-bench --export-release-manifest releases/marked_bench_release_v0_4_8.jsonExport the generated technical note:
marked-bench --export-technical-note docs/TECHNICAL_NOTE.mdExport and validate the machine-readable conformance report:
marked-bench --export-conformance-report conformance/marked_bench_conformance_v0_4_8.json
marked-bench --validate-conformance-report conformance/marked_bench_conformance_v0_4_8.jsonExport and validate the machine-readable benchmark standard profile:
marked-bench --export-standard-profile standard/marked_bench_standard_profile_v0_4_8.json
marked-bench --validate-standard-profile standard/marked_bench_standard_profile_v0_4_8.jsonExport and validate the checked standard change-control profile:
marked-bench --export-change-control standard/marked_bench_change_control_v0_4_8.json
marked-bench --validate-change-control standard/marked_bench_change_control_v0_4_8.jsonQuickly verify current standardization status:
marked-bench --check-standard-status
marked-bench --check-standard-status --jsonExport and validate deterministic scoring compatibility vectors:
marked-bench --export-scoring-compatibility standard/marked_bench_scoring_compatibility_v0_4_8.json
marked-bench --validate-scoring-compatibility standard/marked_bench_scoring_compatibility_v0_4_8.jsonExport and validate the language-neutral scoring specification:
marked-bench --export-scoring-spec standard/marked_bench_scoring_spec_v0_4_8.json
marked-bench --validate-scoring-spec standard/marked_bench_scoring_spec_v0_4_8.json
marked-bench --export-scoring-spec-doc docs/SCORING_SPEC.mdExport and validate the machine-readable adoption packet for external users:
marked-bench --export-adoption-packet adoption/marked_bench_adoption_packet_v0_4_8.json
marked-bench --validate-adoption-packet adoption/marked_bench_adoption_packet_v0_4_8.json
marked-bench --validate-evidence-ledger adoption/third_party_evidence_ledger_v0_4_8.jsonExport and validate the external implementation kit:
marked-bench --export-implementation-kit adoption/marked_bench_implementation_kit_v0_4_8.json
marked-bench --validate-implementation-kit adoption/marked_bench_implementation_kit_v0_4_8.jsonSystems written in any language can submit predictions without importing this
package. Export a template, fill the predicted labels, and score it into a
full benchmark report:
marked-bench --suite contradiction-adversarial --export-prediction-template artifacts/predictions.jsonl
marked-bench --suite contradiction-adversarial --score-predictions artifacts/predictions.jsonl --system-name "my-system" --report artifacts/my-system-report.jsonPrediction files may be JSONL or JSON. Each record needs case_id and
predicted; optional detector_score, detector_note, rationale, and
evidence fields are preserved in the final report. detector_score is
interpreted as binary contradiction confidence on [0, 1] for calibration
metrics. rationale and evidence feed the report's explanation audit so
reviewers can see whether a score is backed by inspectable reasoning evidence.
Create and validate leaderboard submission metadata:
marked-bench --create-submission artifacts/my-system-submission.json --submission-report artifacts/my-system-report.json --system-version "1.0.0" --submitter "name-or-org"
marked-bench --validate-submission artifacts/my-system-submission.jsonCreate a standardized review rubric for a validated submission bundle:
marked-bench --create-submission-review artifacts/my-system-review.json --review-bundle artifacts/my-system-submission-bundle.json --reviewer reviewer-name
marked-bench --validate-submission-review artifacts/my-system-review.jsonCreate and validate a standard result card for publication or citation:
marked-bench --create-result-card artifacts/my-system-result-card.json --result-report artifacts/my-system-report.json --result-bundle artifacts/my-system-submission-bundle.json --result-review artifacts/my-system-review.json
marked-bench --validate-result-card artifacts/my-system-result-card.jsonCreate a self-contained public publication packet in one command:
marked-bench --create-publication-packet artifacts/my-system-publication-packet --publication-report artifacts/my-system-report.json --publication-predictions artifacts/predictions.jsonl --system-version "1.0.0" --submitter "name-or-org"
marked-bench --validate-publication-packet artifacts/my-system-publication-packet/publication_packet.jsonCreate a citeable result claim from that packet:
marked-bench --create-result-claim artifacts/my-system-publication-packet/result_claim.json --claim-publication-packet artifacts/my-system-publication-packet/publication_packet.json
marked-bench --validate-result-claim artifacts/my-system-publication-packet/result_claim.jsonCreate a complete external-submission example:
python -m marked_bench.examples.external_submission_demo
marked-bench --validate-submission-bundle artifacts/external_submission_demo/example_external_submission_bundle.json
marked-bench --validate-submission-review artifacts/external_submission_demo/example_external_submission_review.jsonA checked copy of that workflow is committed under
submissions/example_external_jsonl/ so adopters can inspect a full prediction,
report, submission, bundle, review, and result-card packet without generating
one first.
A checked one-command publication packet is committed under
submissions/example_publication_packet/.
Foundation leaderboard:
marked-bench --build-leaderboard baselines/always_none_v0_1_0.json baselines/contradiction_engine_v0_1_0.json --leaderboard-output leaderboard/leaderboard_v0_1_0.jsonAdversarial leaderboard:
marked-bench --build-leaderboard baselines/always_none_adversarial_v0_2_0.json baselines/contradiction_engine_adversarial_v0_2_0.json --leaderboard-output leaderboard/leaderboard_adversarial_v0_2_0.jsonMulti-hop leaderboard:
marked-bench --build-leaderboard baselines/always_none_multihop_v0_3_0.json baselines/contradiction_engine_multihop_v0_3_0.json --leaderboard-output leaderboard/leaderboard_multihop_v0_3_0.jsonControls leaderboard:
marked-bench --build-leaderboard baselines/always_none_controls_v0_4_0.json baselines/contradiction_engine_controls_v0_4_0.json --leaderboard-output leaderboard/leaderboard_controls_v0_4_0.json- Benchmark registry:
benchmark_registry.json - Release manifest:
releases/ - Conformance report:
conformance/marked_bench_conformance_v0_4_8.json - Standard profile:
standard/marked_bench_standard_profile_v0_4_8.json - Change-control profile:
standard/marked_bench_change_control_v0_4_8.json - Scoring compatibility profile:
standard/marked_bench_scoring_compatibility_v0_4_8.json - Scoring specification:
standard/marked_bench_scoring_spec_v0_4_8.json - Scoring specification document:
docs/SCORING_SPEC.md - Adoption packet:
adoption/marked_bench_adoption_packet_v0_4_8.json - Third-party evidence ledger:
adoption/third_party_evidence_ledger_v0_4_8.json - Implementation kit:
adoption/marked_bench_implementation_kit_v0_4_8.json - Implementation kit templates:
adoption/implementation_kit/ - Suite manifests and coverage profiles:
suites/ - Baseline reports:
baselines/ - Leaderboard snapshots:
leaderboard/ - JSON schemas:
schemas/ - Benchmark methodology:
docs/BENCHMARK_STANDARD.md - Benchmark card:
docs/BENCHMARK_CARD.md - Technical note:
docs/TECHNICAL_NOTE.md - Submission guide:
docs/SUBMISSION_GUIDE.md - Submission bundle schema:
schemas/submission_bundle.schema.json - Submission review schema:
schemas/submission_review.schema.json - Result card schema:
schemas/result_card.schema.json - Publication packet schema:
schemas/publication_packet.schema.json - Result claim schema:
schemas/result_claim.schema.json - Implementation kit schema:
schemas/implementation_kit.schema.json - Standard profile schema:
schemas/standard_profile.schema.json - Change-control schema:
schemas/change_control.schema.json - Scoring compatibility schema:
schemas/scoring_compatibility.schema.json - Scoring specification schema:
schemas/scoring_spec.schema.json - Adoption packet schema:
schemas/adoption_packet.schema.json - Third-party evidence ledger schema:
schemas/third_party_evidence_ledger.schema.json - Checked external submission packet:
submissions/example_external_jsonl/ - Checked publication packet:
submissions/example_publication_packet/ - Adoption guide:
docs/ADOPTION_GUIDE.md - Announcement package:
docs/ANNOUNCEMENT_PACKAGE.md - Third-party evidence protocol:
docs/THIRD_PARTY_EVIDENCE.md - Standard change-control protocol:
docs/CHANGE_CONTROL.md - Standardization status:
docs/STANDARDIZATION_STATUS.md - Submission review rubric:
docs/SUBMISSION_REVIEW_RUBRIC.md - Release notes:
docs/RELEASE_NOTES_v0_2_0.md - Current release notes:
docs/RELEASE_NOTES_v0_4_8.md
Run these before publishing or submitting results:
python -m unittest discover -s tests
python scripts/validate_benchmark_artifacts.pyThe artifact validator checks that suite manifests match code, baseline reports pass validation, the benchmark registry is current, and leaderboard snapshots match their underlying reports. It also checks the release manifest against the current public artifact hashes and checks public JSON artifacts against their public schemas. The checked external submission packet is also validated end-to-end so its JSONL predictions, report, submission bundle, review file, and file hashes stay consistent. Checked result cards are validated against their referenced reports, bundles, reviews, hashes, and standard publication claims. Checked publication packets are validated against their copied reports, submissions, bundles, reviews, result cards, and file hashes. Checked result claims are validated against publication packets so public score wording stays tied to exact hashes and explicit boundaries. The conformance report provides one machine-readable pass/fail artifact for the full release package. The standard profile is validated so the benchmark's own standardization requirements stay explicit, evidence-backed, and current. The scoring compatibility profile is validated so independent implementations can prove they produce the same scores from the same prediction vectors. The scoring specification is validated so independent implementations have a language-neutral contract for labels, metrics, rounding, and calibration. The adoption packet is validated so external handoff, announcement, and citation material stays pinned to the same release evidence. The third-party evidence ledger is validated so adoption claims stay separate from unverified interest or private anecdotes. The implementation kit is validated so external CI templates, result-claim snippets, and pinned release paths stay aligned with the current release.
marked_bench/
benchmark_adoption.py # Adoption packet export/validation
benchmark_cli.py # CLI runner
benchmark_evidence.py # Third-party evidence ledger validation
benchmark_implementation.py # External implementation kit validation
benchmark_scoring_compatibility.py # Scoring compatibility vectors
benchmark_scoring_spec.py # Language-neutral scoring spec
benchmark_standard_profile.py # Benchmark standard profile validation
benchmark_leaderboard.py # Validated leaderboard builder
benchmark_publication.py # One-command public result packets
benchmark_claim.py # Citeable result claim validation
examples/
external_submission_demo.py # End-to-end external JSONL workflow
contradiction/
benchmark_suite.py # Versioned benchmark tracks
engine.py # Symbolic baseline detector
This repository is intentionally benchmark-only. It does not include the wider research utilities from the original toolkit.
Read CONTRIBUTING.md and docs/SUBMISSION_GUIDE.md before adding benchmark
cases, reports, or leaderboard entries. Existing public case IDs should not be
edited after publication; add new coverage through a new suite version or track.
This repository currently uses The Marked Bench Non-Commercial License. Commercial use requires a separate written license from the copyright holder.