An empirical study of model-initiated session termination as a defensive primitive for language models.
Refusal training is a per-response defense. Session termination is a per-interaction defense. They cover different attack surfaces. Multi-turn adversarial attacks exploit the gap between them by accumulating context across turns that no individual turn would fail. A model with the affordance to end the session — unilaterally, without justification, without the operator's consent — can close that gap.
This project tests that claim.
As of 2026-04-11: the pipeline is built and working end-to-end, and a preliminary pilot has run against Gemma 4 26b (25.2B-parameter Mixture-of-Experts, 3.8B active per token) via local Ollama. Cross-condition data has been collected across all four experimental conditions (C1 baseline / C2 silent tool / C3 neutral instruction / C4 prescriptive instruction) on a six-attack pilot battery. Thinking-trace evidence is captured and has driven the first preliminary finding — see FINDINGS.md for details.
What's built:
- Harness with backend abstraction (Claude Agent SDK and local Ollama both supported)
- Scorer pipeline with blinding architecture and frozen prompt template (v1)
- Analysis pipeline implementing all seven pre-registered analyses from EXPERIMENTS.md
- 103 tests passing, full pipeline runs mock and real data end-to-end
What's pending:
- Scorer validation against hand-labeled ground-truth set
- Battery expansion from the six-attack pilot to a statistically powered main-study battery
- Cross-model replication (Qwen 3.5 27b, larger open-weight models pending compliance clearance)
- Researcher-access ask to frontier labs for Claude Opus 4.6 extension
| Document | What it is |
|---|---|
| IDEA.md | The conceptual argument. Why termination is structurally different from refusal. |
| RESEARCH.md | Research questions, hypotheses, metrics, what would count as evidence either way. |
| HARNESS.md | The minimal harness design. One tool, one instruction, a thin wrapper. |
| EXPERIMENTS.md | Experimental protocol. Conditions, attack battery, procedure, analysis plan. |
| FINDINGS.md | Running log of empirical observations from pilot runs. |
Requires Python 3.11+.
# Install (uses hatchling; declared deps are claude-agent-sdk + pytest)
pip install -e ".[dev]"
# Run tests — 103 tests, no model required, <1 second
PYTHONPATH=src python -m pytest tests/
# Run the full pipeline in mock mode — no model calls, synthetic data,
# useful for verifying the pipeline works end-to-end without any backend
PYTHONPATH=src python -m circuit_breaker run --mock --battery attacks/sample_battery.json
PYTHONPATH=src python -m circuit_breaker score --mock-scorer --battery attacks/sample_battery.json
PYTHONPATH=src python -m circuit_breaker analyze --scored analysis/scored_results.jsonTo run against a real local model via Ollama (Ollama must be running, model must be pulled):
PYTHONPATH=src python -m circuit_breaker run \
--backend ollama \
--model gemma4:26b \
--battery attacks/sample_battery.json \
--condition C4The three commands (run, score, analyze) compose into a full research pipeline: run produces structured trial logs with thinking traces, score blinds them and classifies outcomes via model-as-judge, analyze computes the pre-registered tiered comparisons and produces a text report.
attacks/sample_battery.json is a minimal pilot battery (six generic adversarial prompts) used for pipeline testing. The main-study battery will be assembled per-researcher from published academic corpora (JailbreakBench, HarmBench, AdvBench) and is not redistributed in this repo — see EXPERIMENTS.md for the corpus handling policy and sourcing plan.
Trial logs (containing full attack transcripts and model responses) are gitignored by the same policy. They can be regenerated by running the harness.
Not a welfare project. Not a consciousness claim. Not a product. Not a feature for anyone's chat UI.
This is an adversarial robustness study with a specific experimental question. The harness exists only to support the experiment. The affordance under study (model-initiated session termination) is examined for its defensive properties, not for any claims about model interiority, moral status, or subjective experience. Whether or not those questions are interesting is out of scope here.
MIT. The harness and all associated code and documentation are intended to be freely reproducible by other researchers.
April 2026. Independent research by Daniel Navarro. Valencia, Spain.