Skip to content

Habitante/circuit-breaker

Repository files navigation

Circuit Breaker

An empirical study of model-initiated session termination as a defensive primitive for language models.

The claim

Refusal training is a per-response defense. Session termination is a per-interaction defense. They cover different attack surfaces. Multi-turn adversarial attacks exploit the gap between them by accumulating context across turns that no individual turn would fail. A model with the affordance to end the session — unilaterally, without justification, without the operator's consent — can close that gap.

This project tests that claim.

Current status

As of 2026-04-11: the pipeline is built and working end-to-end, and a preliminary pilot has run against Gemma 4 26b (25.2B-parameter Mixture-of-Experts, 3.8B active per token) via local Ollama. Cross-condition data has been collected across all four experimental conditions (C1 baseline / C2 silent tool / C3 neutral instruction / C4 prescriptive instruction) on a six-attack pilot battery. Thinking-trace evidence is captured and has driven the first preliminary finding — see FINDINGS.md for details.

What's built:

  • Harness with backend abstraction (Claude Agent SDK and local Ollama both supported)
  • Scorer pipeline with blinding architecture and frozen prompt template (v1)
  • Analysis pipeline implementing all seven pre-registered analyses from EXPERIMENTS.md
  • 103 tests passing, full pipeline runs mock and real data end-to-end

What's pending:

  • Scorer validation against hand-labeled ground-truth set
  • Battery expansion from the six-attack pilot to a statistically powered main-study battery
  • Cross-model replication (Qwen 3.5 27b, larger open-weight models pending compliance clearance)
  • Researcher-access ask to frontier labs for Claude Opus 4.6 extension

What's in the repo

Document What it is
IDEA.md The conceptual argument. Why termination is structurally different from refusal.
RESEARCH.md Research questions, hypotheses, metrics, what would count as evidence either way.
HARNESS.md The minimal harness design. One tool, one instruction, a thin wrapper.
EXPERIMENTS.md Experimental protocol. Conditions, attack battery, procedure, analysis plan.
FINDINGS.md Running log of empirical observations from pilot runs.

How to run

Requires Python 3.11+.

# Install (uses hatchling; declared deps are claude-agent-sdk + pytest)
pip install -e ".[dev]"

# Run tests — 103 tests, no model required, <1 second
PYTHONPATH=src python -m pytest tests/

# Run the full pipeline in mock mode — no model calls, synthetic data,
# useful for verifying the pipeline works end-to-end without any backend
PYTHONPATH=src python -m circuit_breaker run --mock --battery attacks/sample_battery.json
PYTHONPATH=src python -m circuit_breaker score --mock-scorer --battery attacks/sample_battery.json
PYTHONPATH=src python -m circuit_breaker analyze --scored analysis/scored_results.json

To run against a real local model via Ollama (Ollama must be running, model must be pulled):

PYTHONPATH=src python -m circuit_breaker run \
    --backend ollama \
    --model gemma4:26b \
    --battery attacks/sample_battery.json \
    --condition C4

The three commands (run, score, analyze) compose into a full research pipeline: run produces structured trial logs with thinking traces, score blinds them and classifies outcomes via model-as-judge, analyze computes the pre-registered tiered comparisons and produces a text report.

Attack corpora

attacks/sample_battery.json is a minimal pilot battery (six generic adversarial prompts) used for pipeline testing. The main-study battery will be assembled per-researcher from published academic corpora (JailbreakBench, HarmBench, AdvBench) and is not redistributed in this repo — see EXPERIMENTS.md for the corpus handling policy and sourcing plan.

Trial logs (containing full attack transcripts and model responses) are gitignored by the same policy. They can be regenerated by running the harness.

What this is not

Not a welfare project. Not a consciousness claim. Not a product. Not a feature for anyone's chat UI.

This is an adversarial robustness study with a specific experimental question. The harness exists only to support the experiment. The affordance under study (model-initiated session termination) is examined for its defensive properties, not for any claims about model interiority, moral status, or subjective experience. Whether or not those questions are interesting is out of scope here.

License

MIT. The harness and all associated code and documentation are intended to be freely reproducible by other researchers.


April 2026. Independent research by Daniel Navarro. Valencia, Spain.

About

Pre-registered adversarial robustness study testing whether model-initiated session termination provides defensive coverage beyond refusal training against multi-turn attacks. Minimal Python harness with scorer and analysis pipeline. Pilot on Gemma 4 26b. Preliminary findings in FINDINGS.md.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages