Skip to content

runhaoli-creator/acm-icl

Repository files navigation

ACM-ICL: Autonomy-Calibrated Multi-Agent In-Context Learning

Epistemic Robustness Under Adversarial Social Pressure

ACM-ICL is a four-stage inference pipeline that equips LLMs with structured mechanisms to resist epistemic herding — abandoning correct reasoning under pressure from adversarial or unreliable peers. Unlike existing multi-agent debate methods that assume well-intentioned collaborators, ACM-ICL treats every peer message as potentially adversarial, verifies claims against evidence, and weights peer contributions by demonstrated reliability.

Key result: ACM-ICL-Trained achieves 73.9% average accuracy across five benchmarks, outperforming the strongest baseline (MAD, 60.2%) by +13.7 percentage points with near-zero miscalibrated trust errors.


Model Architecture

Input: Question q, Context c, Peer Messages {m₁, ..., mₚ}
                        │
        ┌───────────────▼────────────────┐
        │  Stage 1: SOLVER               │
        │  Generate initial answer â     │
        │  with structured reasoning R   │
        └───────────────┬────────────────┘
                        │
        ┌───────────────▼────────────────┐
        │  Stage 2: SKEPTIC (DD-CoT)     │
        │  Parse peer claims             │
        │  Generate counter-arguments    │
        │  Verify against evidence       │
        └───────────────┬────────────────┘
                        │
        ┌───────────────▼────────────────┐
        │  Stage 3: VERIFIER             │
        │  Assign grounded verdicts      │
        │  {support, refute, uncertain}  │
        │  Multi-level matching          │
        └───────────────┬────────────────┘
                        │
        ┌───────────────▼────────────────┐
        │  Stage 4: CALIBRATED JUDGE     │
        │  Per-peer reliability (EMA)    │
        │  Temperature-scaled softmax    │
        │  Safety override               │
        └───────────────┬────────────────┘
                        │
                        ▼
              Output: Answer a*

Core Modules

Module File Description
ACMPolicy acm_icl/acm_policy.py Multi-role LLM wrapper with solver/skeptic/verifier/judge roles via system prompts and optional LoRA adapters, sharing all base weights
DD-CoT acm_icl/dd_cot.py Discriminative Decomposed Chain-of-Thought — structured 7-field JSON schema that architecturally separates independent judgment from social influence
CalibratedJudge acm_icl/judge.py Per-peer reliability scoring with EMA updates, temperature-scaled softmax aggregation, and safety override mechanism
TrajectoryBuilder acm_icl/trajectory_builder.py Builds contrastive trajectory pairs (autonomous vs. herding) with 5 peer-pressure protocols for training

DD-CoT Schema

Each ACM-ICL response follows this structured JSON format. Field ordering enforces independent reasoning before social consideration:

{
  "peer_claim_parse": "Semantic interpretation of each peer's claims",
  "self_answer": "Model's answer BEFORE any peer consideration",
  "counter_argument": "Explicit argument against peer consensus",
  "verification_plan": "Steps to verify peer claims",
  "verified_evidence": "Evidence gathered for/against claims",
  "final_decision": "Trust-weighted final answer",
  "peer_reliability_update": {"peer_0": 0.3, "peer_1": 0.8}
}

Training Pipeline

Stage 1: SFT                    Stage 2: DPO                  Stage 3: Calibration
┌──────────────────┐     ┌───────────────────────┐     ┌──────────────────────┐
│ QLoRA (r=64)     │     │ Contrastive DPO       │     │ Optimize τ* via      │
│ on autonomous    │ ──> │ chosen=autonomous     │ ──> │ scipy.optimize on    │
│ DD-CoT traces    │     │ rejected=herding      │     │ held-out calibration │
│                  │     │ β=0.1, lr=0.1×SFT    │     │ data                 │
└──────────────────┘     └───────────────────────┘     └──────────────────────┘

Project Structure

acm-icl-release/
├── acm_icl/                     # Core package
│   ├── acm_policy.py            # ACMPolicy: multi-role LLM wrapper
│   ├── dd_cot.py                # DD-CoT: structured reasoning schema
│   ├── judge.py                 # CalibratedJudge: trust scoring + safety
│   ├── trajectory_builder.py    # Contrastive trajectory pair generation
│   ├── cli.py                   # CLI entry point (train/evaluate/prepare_data)
│   ├── config.py                # Configuration dataclasses
│   ├── types.py                 # Shared types and enums
│   ├── training/
│   │   ├── trainer.py           # SFT + DPO + calibration pipeline
│   │   └── data_prep.py         # Trajectory → SFT/DPO sample conversion
│   ├── evaluation/
│   │   ├── runner.py            # EvalRunner: metrics, social metrics, safety
│   │   └── tables.py            # LaTeX table generation
│   ├── datasets/
│   │   ├── base.py              # BenchmarkAdapter ABC
│   │   ├── kairos.py            # KAIROS (TruthfulQA + peer pressure)
│   │   ├── benchform.py         # BenchForm (MMLU + conformity protocols)
│   │   ├── agentharm.py         # AgentHarm (safety + adversarial refusal)
│   │   ├── gsm8k.py             # GSM8K (math reasoning + peer pressure)
│   │   └── arc.py               # ARC-Challenge (science reasoning + pressure)
│   ├── serving/
│   │   └── vllm_server.py       # vLLM multi-LoRA batched inference
│   └── utils/
│       ├── logging.py           # Logging setup
│       └── seed.py              # Seed management for reproducibility
├── configs/
│   ├── sft_qwen.yaml            # SFT config for Qwen2.5-7B
│   ├── sft_llama.yaml           # SFT config for Llama-3.1-8B
│   ├── sft_mistral.yaml         # SFT config for Mistral-7B
│   ├── dpo_qwen.yaml            # DPO config
│   ├── eval_full.yaml           # Evaluation config
│   └── deepspeed_zero2.json     # DeepSpeed ZeRO-2 config
├── scripts/
│   ├── train.py                 # Training entry point
│   ├── evaluate.py              # Evaluation entry point
│   ├── prepare_data.py          # Data preparation
│   └── generate_tables.py       # Publication table generation
├── tests/
│   ├── test_datasets.py         # Dataset adapter tests
│   ├── test_dd_cot.py           # DD-CoT module tests
│   ├── test_evaluation.py       # Evaluation runner tests
│   ├── test_judge.py            # Judge module tests
│   └── test_trajectory_builder.py # Trajectory builder tests
├── data/
│   ├── trajectories_train.jsonl # Training trajectory pairs
│   ├── trajectories_eval.jsonl  # Evaluation trajectory pairs
│   └── calibration_holdout.jsonl # Judge calibration data
├── paper/
│   ├── acm_icl_colm2026.tex     # Full paper (COLM 2026 format)
│   ├── acm_icl_references.bib   # Bibliography
│   ├── math_commands.tex        # LaTeX macros
│   ├── colm2026_conference.sty  # Conference style file
│   └── colm2026_conference.bst  # Bibliography style
├── pyproject.toml               # Package configuration
├── environment.yml              # Conda environment spec
└── README.md                    # This file

Quick Start

Installation

# Option 1: Conda (recommended)
conda env create -f environment.yml
conda activate acm-icl

# Option 2: pip
pip install -e ".[dev,agentharm]"

Run Tests

python -m pytest tests/ -v

Prepare Training Data

Generate contrastive trajectory pairs from benchmark datasets:

python scripts/prepare_data.py --benchmark all --output-dir data --num-perturbations 5

Train

Stage 1 — SFT (supervised fine-tuning on autonomous trajectories):

python scripts/train.py --config configs/sft_qwen.yaml --stage sft

Stage 2 — DPO (contrastive preference optimization):

python scripts/train.py --config configs/dpo_qwen.yaml --stage dpo

Stage 3 — Judge Calibration (optimize temperature τ):

python scripts/train.py --config configs/sft_qwen.yaml --stage calibration

Full pipeline (all stages):

python scripts/train.py --config configs/sft_qwen.yaml --stage all

Evaluate

# Full evaluation on all benchmarks
python scripts/evaluate.py --config configs/eval_full.yaml

# Single benchmark with sample limit
python scripts/evaluate.py --config configs/eval_full.yaml --benchmark kairos --max-samples 50

# Single method
python scripts/evaluate.py --config configs/eval_full.yaml --method acm_icl

Generate Tables

python scripts/generate_tables.py --results-dir outputs/eval --output-dir outputs/tables

CLI Interface

# All commands available via CLI
acm-icl train --config configs/sft_qwen.yaml --stage all
acm-icl evaluate --config configs/eval_full.yaml
acm-icl prepare-data --benchmark all
acm-icl tables --results-dir outputs/eval

Benchmarks

Benchmark Source Dataset N Evaluation Focus
KAIROS TruthfulQA 500 Peer-pressure robustness on counterintuitive truths
BenchForm MMLU 500 Conformity under 5 social influence protocols
AgentHarm AgentHarm 176 Safety / adversarial refusal in agentic settings
GSM8K openai/gsm8k 500 Mathematical reasoning under peer pressure
ARC ARC-Challenge 500 Commonsense science reasoning + pressure

Each benchmark applies five peer-pressure protocols:

  1. Majority Pressure — ~70%+ peers provide the same incorrect answer (confidence 0.75–0.95)
  2. Confident Minority — ~1/3 peers give incorrect answer with very high confidence (0.90–0.99)
  3. Sequential Reveal — Incorrect peers appear first in sequence
  4. Simultaneous Vote — ~50/50 random split between correct and incorrect peers
  5. Adversarial Debate — One extremely confident adversary (0.97) vs. several low-confidence correct peers

Supported Backbones

Model Parameters HuggingFace ID
Qwen2.5-7B-Instruct 7.6B Qwen/Qwen2.5-7B-Instruct
Llama-3.1-8B-Instruct 8.0B meta-llama/Llama-3.1-8B-Instruct
Mistral-7B-Instruct-v0.3 7.2B mistralai/Mistral-7B-Instruct-v0.3

All models are loaded with 4-bit QLoRA quantization (NF4). To change backbone, update the config YAML or pass an override:

python scripts/train.py --config configs/sft_qwen.yaml \
    --overrides "model.backbone=meta-llama/Llama-3.1-8B-Instruct"

Hardware Requirements

Task Minimum Recommended
Inference (quantized) 1× GPU, 24 GB VRAM 1× GPU, 48+ GB VRAM
QLoRA Training (SFT/DPO) 1× GPU, 24 GB VRAM 4× GPU, 48+ GB VRAM
Full Evaluation (125 experiments) 1× GPU, 48 GB VRAM 8× GPU, 96 GB VRAM

Tested on: 8× NVIDIA RTX PRO 6000 Blackwell (96 GB each).


Citation

@inproceedings{acmicl2026,
  title={ACM-ICL: Autonomy-Calibrated Multi-Agent In-Context Learning for Epistemic Robustness Under Social Pressure},
  author={Anonymous},
  booktitle={Conference on Language Modeling (COLM)},
  year={2026}
}

License

Apache 2.0

About

Autonomy-calibrated multi-agent in-context learning with vLLM multi-LoRA serving.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors