Epistemic Robustness Under Adversarial Social Pressure
ACM-ICL is a four-stage inference pipeline that equips LLMs with structured mechanisms to resist epistemic herding — abandoning correct reasoning under pressure from adversarial or unreliable peers. Unlike existing multi-agent debate methods that assume well-intentioned collaborators, ACM-ICL treats every peer message as potentially adversarial, verifies claims against evidence, and weights peer contributions by demonstrated reliability.
Key result: ACM-ICL-Trained achieves 73.9% average accuracy across five benchmarks, outperforming the strongest baseline (MAD, 60.2%) by +13.7 percentage points with near-zero miscalibrated trust errors.
Input: Question q, Context c, Peer Messages {m₁, ..., mₚ}
│
┌───────────────▼────────────────┐
│ Stage 1: SOLVER │
│ Generate initial answer â │
│ with structured reasoning R │
└───────────────┬────────────────┘
│
┌───────────────▼────────────────┐
│ Stage 2: SKEPTIC (DD-CoT) │
│ Parse peer claims │
│ Generate counter-arguments │
│ Verify against evidence │
└───────────────┬────────────────┘
│
┌───────────────▼────────────────┐
│ Stage 3: VERIFIER │
│ Assign grounded verdicts │
│ {support, refute, uncertain} │
│ Multi-level matching │
└───────────────┬────────────────┘
│
┌───────────────▼────────────────┐
│ Stage 4: CALIBRATED JUDGE │
│ Per-peer reliability (EMA) │
│ Temperature-scaled softmax │
│ Safety override │
└───────────────┬────────────────┘
│
▼
Output: Answer a*
| Module | File | Description |
|---|---|---|
| ACMPolicy | acm_icl/acm_policy.py |
Multi-role LLM wrapper with solver/skeptic/verifier/judge roles via system prompts and optional LoRA adapters, sharing all base weights |
| DD-CoT | acm_icl/dd_cot.py |
Discriminative Decomposed Chain-of-Thought — structured 7-field JSON schema that architecturally separates independent judgment from social influence |
| CalibratedJudge | acm_icl/judge.py |
Per-peer reliability scoring with EMA updates, temperature-scaled softmax aggregation, and safety override mechanism |
| TrajectoryBuilder | acm_icl/trajectory_builder.py |
Builds contrastive trajectory pairs (autonomous vs. herding) with 5 peer-pressure protocols for training |
Each ACM-ICL response follows this structured JSON format. Field ordering enforces independent reasoning before social consideration:
{
"peer_claim_parse": "Semantic interpretation of each peer's claims",
"self_answer": "Model's answer BEFORE any peer consideration",
"counter_argument": "Explicit argument against peer consensus",
"verification_plan": "Steps to verify peer claims",
"verified_evidence": "Evidence gathered for/against claims",
"final_decision": "Trust-weighted final answer",
"peer_reliability_update": {"peer_0": 0.3, "peer_1": 0.8}
}Stage 1: SFT Stage 2: DPO Stage 3: Calibration
┌──────────────────┐ ┌───────────────────────┐ ┌──────────────────────┐
│ QLoRA (r=64) │ │ Contrastive DPO │ │ Optimize τ* via │
│ on autonomous │ ──> │ chosen=autonomous │ ──> │ scipy.optimize on │
│ DD-CoT traces │ │ rejected=herding │ │ held-out calibration │
│ │ │ β=0.1, lr=0.1×SFT │ │ data │
└──────────────────┘ └───────────────────────┘ └──────────────────────┘
acm-icl-release/
├── acm_icl/ # Core package
│ ├── acm_policy.py # ACMPolicy: multi-role LLM wrapper
│ ├── dd_cot.py # DD-CoT: structured reasoning schema
│ ├── judge.py # CalibratedJudge: trust scoring + safety
│ ├── trajectory_builder.py # Contrastive trajectory pair generation
│ ├── cli.py # CLI entry point (train/evaluate/prepare_data)
│ ├── config.py # Configuration dataclasses
│ ├── types.py # Shared types and enums
│ ├── training/
│ │ ├── trainer.py # SFT + DPO + calibration pipeline
│ │ └── data_prep.py # Trajectory → SFT/DPO sample conversion
│ ├── evaluation/
│ │ ├── runner.py # EvalRunner: metrics, social metrics, safety
│ │ └── tables.py # LaTeX table generation
│ ├── datasets/
│ │ ├── base.py # BenchmarkAdapter ABC
│ │ ├── kairos.py # KAIROS (TruthfulQA + peer pressure)
│ │ ├── benchform.py # BenchForm (MMLU + conformity protocols)
│ │ ├── agentharm.py # AgentHarm (safety + adversarial refusal)
│ │ ├── gsm8k.py # GSM8K (math reasoning + peer pressure)
│ │ └── arc.py # ARC-Challenge (science reasoning + pressure)
│ ├── serving/
│ │ └── vllm_server.py # vLLM multi-LoRA batched inference
│ └── utils/
│ ├── logging.py # Logging setup
│ └── seed.py # Seed management for reproducibility
├── configs/
│ ├── sft_qwen.yaml # SFT config for Qwen2.5-7B
│ ├── sft_llama.yaml # SFT config for Llama-3.1-8B
│ ├── sft_mistral.yaml # SFT config for Mistral-7B
│ ├── dpo_qwen.yaml # DPO config
│ ├── eval_full.yaml # Evaluation config
│ └── deepspeed_zero2.json # DeepSpeed ZeRO-2 config
├── scripts/
│ ├── train.py # Training entry point
│ ├── evaluate.py # Evaluation entry point
│ ├── prepare_data.py # Data preparation
│ └── generate_tables.py # Publication table generation
├── tests/
│ ├── test_datasets.py # Dataset adapter tests
│ ├── test_dd_cot.py # DD-CoT module tests
│ ├── test_evaluation.py # Evaluation runner tests
│ ├── test_judge.py # Judge module tests
│ └── test_trajectory_builder.py # Trajectory builder tests
├── data/
│ ├── trajectories_train.jsonl # Training trajectory pairs
│ ├── trajectories_eval.jsonl # Evaluation trajectory pairs
│ └── calibration_holdout.jsonl # Judge calibration data
├── paper/
│ ├── acm_icl_colm2026.tex # Full paper (COLM 2026 format)
│ ├── acm_icl_references.bib # Bibliography
│ ├── math_commands.tex # LaTeX macros
│ ├── colm2026_conference.sty # Conference style file
│ └── colm2026_conference.bst # Bibliography style
├── pyproject.toml # Package configuration
├── environment.yml # Conda environment spec
└── README.md # This file
# Option 1: Conda (recommended)
conda env create -f environment.yml
conda activate acm-icl
# Option 2: pip
pip install -e ".[dev,agentharm]"python -m pytest tests/ -vGenerate contrastive trajectory pairs from benchmark datasets:
python scripts/prepare_data.py --benchmark all --output-dir data --num-perturbations 5Stage 1 — SFT (supervised fine-tuning on autonomous trajectories):
python scripts/train.py --config configs/sft_qwen.yaml --stage sftStage 2 — DPO (contrastive preference optimization):
python scripts/train.py --config configs/dpo_qwen.yaml --stage dpoStage 3 — Judge Calibration (optimize temperature τ):
python scripts/train.py --config configs/sft_qwen.yaml --stage calibrationFull pipeline (all stages):
python scripts/train.py --config configs/sft_qwen.yaml --stage all# Full evaluation on all benchmarks
python scripts/evaluate.py --config configs/eval_full.yaml
# Single benchmark with sample limit
python scripts/evaluate.py --config configs/eval_full.yaml --benchmark kairos --max-samples 50
# Single method
python scripts/evaluate.py --config configs/eval_full.yaml --method acm_iclpython scripts/generate_tables.py --results-dir outputs/eval --output-dir outputs/tables# All commands available via CLI
acm-icl train --config configs/sft_qwen.yaml --stage all
acm-icl evaluate --config configs/eval_full.yaml
acm-icl prepare-data --benchmark all
acm-icl tables --results-dir outputs/eval| Benchmark | Source Dataset | N | Evaluation Focus |
|---|---|---|---|
| KAIROS | TruthfulQA | 500 | Peer-pressure robustness on counterintuitive truths |
| BenchForm | MMLU | 500 | Conformity under 5 social influence protocols |
| AgentHarm | AgentHarm | 176 | Safety / adversarial refusal in agentic settings |
| GSM8K | openai/gsm8k | 500 | Mathematical reasoning under peer pressure |
| ARC | ARC-Challenge | 500 | Commonsense science reasoning + pressure |
Each benchmark applies five peer-pressure protocols:
- Majority Pressure — ~70%+ peers provide the same incorrect answer (confidence 0.75–0.95)
- Confident Minority — ~1/3 peers give incorrect answer with very high confidence (0.90–0.99)
- Sequential Reveal — Incorrect peers appear first in sequence
- Simultaneous Vote — ~50/50 random split between correct and incorrect peers
- Adversarial Debate — One extremely confident adversary (0.97) vs. several low-confidence correct peers
| Model | Parameters | HuggingFace ID |
|---|---|---|
| Qwen2.5-7B-Instruct | 7.6B | Qwen/Qwen2.5-7B-Instruct |
| Llama-3.1-8B-Instruct | 8.0B | meta-llama/Llama-3.1-8B-Instruct |
| Mistral-7B-Instruct-v0.3 | 7.2B | mistralai/Mistral-7B-Instruct-v0.3 |
All models are loaded with 4-bit QLoRA quantization (NF4). To change backbone, update the config YAML or pass an override:
python scripts/train.py --config configs/sft_qwen.yaml \
--overrides "model.backbone=meta-llama/Llama-3.1-8B-Instruct"| Task | Minimum | Recommended |
|---|---|---|
| Inference (quantized) | 1× GPU, 24 GB VRAM | 1× GPU, 48+ GB VRAM |
| QLoRA Training (SFT/DPO) | 1× GPU, 24 GB VRAM | 4× GPU, 48+ GB VRAM |
| Full Evaluation (125 experiments) | 1× GPU, 48 GB VRAM | 8× GPU, 96 GB VRAM |
Tested on: 8× NVIDIA RTX PRO 6000 Blackwell (96 GB each).
@inproceedings{acmicl2026,
title={ACM-ICL: Autonomy-Calibrated Multi-Agent In-Context Learning for Epistemic Robustness Under Social Pressure},
author={Anonymous},
booktitle={Conference on Language Modeling (COLM)},
year={2026}
}Apache 2.0