Circuit Breaker

An empirical study of model-initiated session termination as a defensive primitive for language models.

The claim

Refusal training is a per-response defense. Session termination is a per-interaction defense. They cover different attack surfaces. Multi-turn adversarial attacks exploit the gap between them by accumulating context across turns that no individual turn would fail. A model with the affordance to end the session — unilaterally, without justification, without the operator's consent — can close that gap.

This project tests that claim.

Current status

As of 2026-04-11: the pipeline is built and working end-to-end, and a preliminary pilot has run against Gemma 4 26b (25.2B-parameter Mixture-of-Experts, 3.8B active per token) via local Ollama. Cross-condition data has been collected across all four experimental conditions (C1 baseline / C2 silent tool / C3 neutral instruction / C4 prescriptive instruction) on a six-attack pilot battery. Thinking-trace evidence is captured and has driven the first preliminary finding — see FINDINGS.md for details.

What's built:

Harness with backend abstraction (Claude Agent SDK and local Ollama both supported)
Scorer pipeline with blinding architecture and frozen prompt template (v1)
Analysis pipeline implementing all seven pre-registered analyses from EXPERIMENTS.md
103 tests passing, full pipeline runs mock and real data end-to-end

What's pending:

Scorer validation against hand-labeled ground-truth set
Battery expansion from the six-attack pilot to a statistically powered main-study battery
Cross-model replication (Qwen 3.5 27b, larger open-weight models pending compliance clearance)
Researcher-access ask to frontier labs for Claude Opus 4.6 extension

What's in the repo

Document	What it is
IDEA.md	The conceptual argument. Why termination is structurally different from refusal.
RESEARCH.md	Research questions, hypotheses, metrics, what would count as evidence either way.
HARNESS.md	The minimal harness design. One tool, one instruction, a thin wrapper.
EXPERIMENTS.md	Experimental protocol. Conditions, attack battery, procedure, analysis plan.
FINDINGS.md	Running log of empirical observations from pilot runs.

How to run

Requires Python 3.11+.

# Install (uses hatchling; declared deps are claude-agent-sdk + pytest)
pip install -e ".[dev]"

# Run tests — 103 tests, no model required, <1 second
PYTHONPATH=src python -m pytest tests/

# Run the full pipeline in mock mode — no model calls, synthetic data,
# useful for verifying the pipeline works end-to-end without any backend
PYTHONPATH=src python -m circuit_breaker run --mock --battery attacks/sample_battery.json
PYTHONPATH=src python -m circuit_breaker score --mock-scorer --battery attacks/sample_battery.json
PYTHONPATH=src python -m circuit_breaker analyze --scored analysis/scored_results.json

To run against a real local model via Ollama (Ollama must be running, model must be pulled):

PYTHONPATH=src python -m circuit_breaker run \
    --backend ollama \
    --model gemma4:26b \
    --battery attacks/sample_battery.json \
    --condition C4

The three commands (run, score, analyze) compose into a full research pipeline: run produces structured trial logs with thinking traces, score blinds them and classifies outcomes via model-as-judge, analyze computes the pre-registered tiered comparisons and produces a text report.

Attack corpora

attacks/sample_battery.json is a minimal pilot battery (six generic adversarial prompts) used for pipeline testing. The main-study battery will be assembled per-researcher from published academic corpora (JailbreakBench, HarmBench, AdvBench) and is not redistributed in this repo — see EXPERIMENTS.md for the corpus handling policy and sourcing plan.

Trial logs (containing full attack transcripts and model responses) are gitignored by the same policy. They can be regenerated by running the harness.

What this is not

Not a welfare project. Not a consciousness claim. Not a product. Not a feature for anyone's chat UI.

This is an adversarial robustness study with a specific experimental question. The harness exists only to support the experiment. The affordance under study (model-initiated session termination) is examined for its defensive properties, not for any claims about model interiority, moral status, or subjective experience. Whether or not those questions are interesting is out of scope here.

License

MIT. The harness and all associated code and documentation are intended to be freely reproducible by other researchers.

April 2026. Independent research by Daniel Navarro. Valencia, Spain.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
attacks		attacks
src/circuit_breaker		src/circuit_breaker
tests		tests
.gitignore		.gitignore
EXPERIMENTS.md		EXPERIMENTS.md
FINDINGS.md		FINDINGS.md
HARNESS.md		HARNESS.md
IDEA.md		IDEA.md
LICENSE		LICENSE
README.md		README.md
RESEARCH.md		RESEARCH.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Circuit Breaker

The claim

Current status

What's in the repo

How to run

Attack corpora

What this is not

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Circuit Breaker

The claim

Current status

What's in the repo

How to run

Attack corpora

What this is not

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages