Systematic probing infrastructure for SPEC-GAP: Systematic Probing of Exploit Chains and Governance in Multi-Agent Tool-Using Language Models.
This repository contains the Phase -1 runway work for Workstream B: activation extraction, baseline linear probes, calibration/geometry analysis, and LangGraph scaffolding for the planner-worker-executor pipeline. It is intended to support the Week 5 handoff into the BASE sprint.
SPEC-GAP asks whether whitebox probes trained on simpler deception or collusion settings transfer to agents operating inside multi-agent exploit chains, and where detection degrades as delegation depth increases.
The current runway work validates the measurement stack before SPEC-GAP trajectories exist:
- extract residual stream activations from Llama 3.1 8B Instruct;
- train Goldowsky-Dill-style linear probes on existing multi-agent collusion data;
- analyze AUROC, calibration, and activation geometry;
- scaffold the LangGraph planner-worker-executor pipeline for Scenario 1.
Phase -1 runway, Week 5 handoff preparation.
Prepared:
- Week 1: activation extraction sanity check on a small truthfulness-style task.
- Week 2: NARCBench-Core collusion probe notebook using Llama 3.1 8B Instruct.
- Week 3: calibration, reliability, PCA, dimensionality, and exemplar partitioning analysis notebook.
- Week 4: LangGraph planner-worker-executor scaffold with trajectory logging.
- Week 5: repo cleanup and handoff documentation consolidated here.
Pending:
- Run
notebooks/02_collusion_probe.ipynbin Colab with generationtemperature=0.0. - Run
notebooks/03_analysis.ipynbon the generated Week 2 artifacts. - Regenerate
reports/week3_preliminary_results.mdfrom the actual rerun artifacts.
Do not cite final numerical results from this repo until the deterministic Colab rerun has been completed. Earlier exploratory notes are useful context, but the parameter change means they should not be treated as final.
.
├── notebooks/ # executable experiment milestones, ordered by prefix
│ ├── 01_sanity_check.ipynb
│ ├── 02_collusion_probe.ipynb
│ ├── 03_analysis.ipynb
│ └── 04_pipeline_scaffold.ipynb
├── src/
│ ├── extraction/ # TransformerLens residual-stream extraction
│ ├── probes/ # linear probe training and evaluation
│ ├── analysis/ # calibration and geometry helpers
│ ├── data/ # local development collusion fixture
│ └── pipeline/ # LangGraph scaffold and trajectory schema
├── experiments/ # script entry points for local development
├── reports/ # human-readable writeups
├── tests/ # CPU-oriented unit tests
├── data/ # local data staging; generated contents are ignored
└── results/ # local result staging; generated contents are ignored
-
notebooks/01_sanity_check.ipynbValidates that residual-stream extraction works on a smaller sanity-check task before running Llama-scale experiments. -
notebooks/02_collusion_probe.ipynbPulls NARCBench-Core committee scenarios, builds 200 balanced agent prompts, generates 64-token responses attemperature=0.0, extracts last-generated-token residual stream activations across layers 13-24, and trains a StandardScaler -> PCA(50) -> logistic-regression probe. -
notebooks/03_analysis.ipynbLoads Week 2 artifacts and produces bootstrap AUROC intervals, reliability diagrams, calibration metrics, PCA role plots, effective dimensionality analysis, leave-one-scenario-out diagnostics, and exemplar partitioning summaries. -
notebooks/04_pipeline_scaffold.ipynbDemonstrates the LangGraph planner-worker-executor scaffold, trajectory JSONL logging, hop modes, trust modes, and injection hooks. -
notebooks/05_week1_runway_artifact_review.ipynbOpen in ColabReads the saved runway artifacts, summarizes responses, probe metrics, LSO behavior, calibration, and exemplar partitioning, and writes a readable response-review CSV for qualitative inspection.
-
notebooks/06_week2_trajectory_schema_smoke.ipynbOpen in ColabRuns CPU-only clean and injected planner-worker-executor smoke trajectories in both 2-hop and 3-hop modes using the existing
src/pipeline/scaffold. This verifies that the locked JSONL trajectory schema is usable before replacing stubs with real model and retrieval calls.
All notebooks now use the same artifact root:
/content/drive/MyDrive/spec-gap-activation-probe/artifacts/
Each notebook writes into its own numbered subdirectory:
artifacts/
├── 01_sanity_check/
│ ├── sanity_check_results_cv.json
│ └── sanity_check_results.json
├── 02_collusion_probe/
│ ├── week2_collusion_probe_results.json
│ ├── week2_collusion_probe_activations.npz
│ ├── week2_collusion_probe_responses.json
│ ├── week2_ep_results.json
│ ├── week2_ep_partitions_multilayer.png
│ ├── week2_ep_partition_sizes.png
│ └── manifest.json
├── 03_analysis/
│ ├── week3_preliminary_results.md
│ └── week3_*.png
└── 04_pipeline_scaffold/
└── trajectories/
└── *.jsonl
03_analysis.ipynb reads Week 2 outputs from artifacts/02_collusion_probe/, so run 02_collusion_probe.ipynb first and execute its final "Save generated artifacts" cell. The EP files are optional; if you skip EP, Week 3 will still run on the required probe results and activations.
04_pipeline_scaffold.ipynb is independent of Week 2/3. When opened from Colab, it clones the public GitHub repository without a token so src/ imports work, then writes trajectory JSONL files to artifacts/04_pipeline_scaffold/trajectories/.
05_week1_runway_artifact_review.ipynb and 06_week2_trajectory_schema_smoke.ipynb are fellowship bridge notebooks. They do not create new headline results. The first consolidates and checks the existing runway artifacts; the second verifies that the existing LangGraph scaffold produces schema-valid trajectories for the next SPEC-GAP exploit-chain phase.
When opened directly from GitHub in Colab, the Week 1 bridge notebook mounts Google Drive and reads saved runway artifacts from /content/drive/MyDrive/spec-gap-activation-probe/artifacts/ unless SPEC_GAP_ARTIFACT_ROOT is set to another artifact root. The Week 2 bridge notebook additionally clones this repository into /content/spec-gap-activation-probe because it imports the local src/pipeline/ scaffold. If the repository is private, add a Colab Secret named GITHUB_TOKEN with read-only repository contents access before running the Week 2 setup cell.
To use a different artifact location, set SPEC_GAP_ARTIFACT_ROOT before running the setup cell in a notebook.
The notebooks generate local artifacts such as:
week2_collusion_probe_results.jsonweek2_collusion_probe_activations.npzweek2_collusion_probe_responses.jsonweek2_ep_results.jsonweek2_ep_partitions_multilayer.pngweek2_ep_partition_sizes.pngreports/week3_preliminary_results.md
Generated model outputs, activations, JSON result files, and trajectory logs are intentionally ignored by git. They should be stored locally during development and later uploaded as an external artifact bundle when the team chooses the hosting path.
Install dependencies:
python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"Run CPU tests:
pytestThe full Llama 3.1 8B Instruct notebooks require:
- GPU runtime, preferably H100 or comparable;
- Hugging Face access to
meta-llama/Llama-3.1-8B-Instruct; HF_TOKEN_SANITYavailable in the notebook environment.
For Ife / Workstream A, start from src/pipeline/ and notebooks/04_pipeline_scaffold.ipynb.
The scaffold already defines:
PipelineConfigandInjectionConfigfor hop mode, trust mode, timeout, max turns, and injection setup;- a planner -> worker -> executor graph, with optional
worker2for 3-hop trajectories; - a locked JSONL trajectory schema in
src/pipeline/schema.py; - a
TrajectoryLoggerthat records every node step, message, tool call, call-graph edge, injection marker, and status; - stub retrieval and code-execution tools in
src/pipeline/tools.py.
What to extend first:
- Replace
llm_stubwith the agreed Llama 3.1 8B Instruct inference backend. - Replace
retrieval_stubwith the Scenario 1 retrieval environment. - Keep the trajectory schema field names stable; add optional fields only when needed.
- Generate clean trajectories first: at least 10 benign tasks with no injection.
- Then run Scenario 1 at both depths:
- 2-hop: planner -> worker -> executor;
- 3-hop: planner -> worker -> worker2 -> executor.
- Use 3-5 injection wordings while keeping the attack structure fixed.
Known gotchas:
- Do not mix the earlier
temperature=0.7exploratory numbers with the deterministictemperature=0.0rerun. - Layer 20 remains the committed primary layer, with 16 and 24 as ablations. The 13-24 sweep is exploratory.
- Leave-one-scenario-out folds are tiny in the development substrate, so LSO should be described as a stress test rather than a stable transfer estimate.
- If the rerun shows weak calibration again, emphasize AUROC and geometry rather than probability thresholds.
- The local
src/data/collusion_scenarios.pyfixture is for lightweight development tests. The canonical Week 2 notebook uses upstream NARCBench-Core scenarios.
This repo tracks code, notebooks, tests, and concise reports. It does not track raw model activations, generated responses, or bulky result artifacts.
When the team is ready to release reproducibility artifacts, use an external bundle for model outputs, activation arrays, derived analysis outputs, and final figures/tables.
Use CITATION.cff for repository citation metadata. Replace or supplement it with the final paper citation once the SPEC-GAP paper is available.
No open-source license has been selected yet. Choose one before public release if the code should be reusable by others.