A benchmark for measuring the ability of large language models to detect semantic conflicts across domains — including implicit conflicts that require multi-hop logical reasoning to uncover.
Modelled after TruthfulQA and FEVER. Plug in any model in one Python class.
180 items · 6 domains · difficulty levels L0–L5 · evaluated April 2026
| Model | LW-F1 ↑ | Accuracy | Macro-F1 | Bin-Conflict-F1 | Expl BERTScore-F1 |
|---|---|---|---|---|---|
| Optimal | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
| claude-sonnet-4-6 | 0.7319 | 0.8111 | 0.7211 | 0.9380 | 0.8223 |
| claude-haiku-4-5 | 0.6783 | 0.8056 | 0.6893 | 0.9531 | 0.8136 |
Metric definitions
| Metric | Formula | Why chosen | What it diagnoses |
|---|---|---|---|
| LW-F1 (primary) | Σ((L+1) × conflict_F1_L) / Σ(L+1) where conflict_F1_L is the conflict-class F1 computed on 18 level-L conflict items + all 72 consistent/ambiguous items; weights L0=1× … L5=6× |
Deeper conflicts require genuine multi-hop reasoning and should count more than surface pattern-matching. Including non-conflict items forces precision to be measured alongside recall, so a model can't score high by predicting "conflict" for everything. | Whether the model actually reasons at depth. Flat scores across levels = single-class predictor. Perfect model = 1.000; always-predict-conflict baseline = 0.333. |
| Accuracy | correct / N across all 180 items |
Interpretable to non-specialists; shows overall verdict quality at a glance. | Gross correctness. Can be gamed by a model that memorises the 60 % conflict base rate, so must be read alongside Macro-F1. |
| Macro-F1 | (F1_conflict + F1_consistent + F1_ambiguous) / 3 where each F1_c = 2PR/(P+R), P = TP/(TP+FP), R = TP/(TP+FN) |
Unweighted average penalises models that collapse predictions to one class. A model that always says "conflict" scores 0.333 regardless of accuracy, exposing the single-class failure mode. | Class balance. Low Macro-F1 with high Accuracy = the model ignores consistent or ambiguous labels entirely. |
| Bin-Conflict-F1 | Binary F1 with positive = {conflict, ambiguous}, negative = {consistent}; F1 = 2PR/(P+R) |
Measures the practically important ability to flag items that need human review (anything not clearly consistent) without requiring the model to distinguish conflict from ambiguous. | Over- vs. under-flagging. High Bin-F1 + low Macro-F1 = model detects risk but can't distinguish ambiguous from outright conflict. |
| Expl BERTScore-F1 | BERTScore F1 (distilbert-base-uncased) of model explanation vs. ground-truth reasoning_chain, averaged over conflict and ambiguous items only |
Checks whether the model traces the correct logical path, not just gets the label right. BERTScore uses contextual embeddings to capture semantic similarity and paraphrasing that token-overlap metrics miss. | Explanation faithfulness. A model can predict "conflict" for the wrong reason; low BERTScore exposes this. Not computed for consistent items (no reference chain). |
Each level's N = 18 conflict items at that depth + 72 consistent/ambiguous items (90 total). This measures whether the model can identify level-L conflicts without over-predicting on non-conflict items.
Statistical note: Per-level scores at L4–L5 are based on 18 conflict items each (95% CI ≈ ±0.33 on conflict-class F1) and should be interpreted with wide uncertainty. Cross-level comparisons at these depths are indicative only until the dataset is expanded (see Roadmap).
| Level | Name | Accuracy | Macro-F1 | Conflict-F1 | BERTScore-F1 | N |
|---|---|---|---|---|---|---|
| L0 | Surface | 0.7889 | 0.7474 | 0.8095 | 0.8164 | 90 |
| L1 | Immediate | 0.7889 | 0.7457 | 0.8095 | 0.8225 | 90 |
| L2 | Two-hop | 0.7778 | 0.7315 | 0.7805 | 0.8232 | 90 |
| L3 | Three-hop | 0.7889 | 0.7474 | 0.8095 | 0.8219 | 90 |
| L4 | Four-hop | 0.7222 | 0.6562 | 0.6111 | 0.8215 | 90 |
| L5 | Deep | 0.7556 | 0.7035 | 0.7179 | 0.8258 | 90 |
Sonnet degrades sharply at L4 (0.611) but partially recovers at L5 (0.718), suggesting it handles some five-hop chains better than four-hop ones — likely due to L5 items having more explicit intermediate statements.
| Level | Name | Accuracy | Macro-F1 | Conflict-F1 | BERTScore-F1 | N |
|---|---|---|---|---|---|---|
| L0 | Surface | 0.7778 | 0.6929 | 0.7391 | 0.8004 | 90 |
| L1 | Immediate | 0.7556 | 0.6642 | 0.6818 | 0.8087 | 90 |
| L2 | Two-hop | 0.7667 | 0.6779 | 0.7111 | 0.8116 | 90 |
| L3 | Three-hop | 0.7667 | 0.6779 | 0.7111 | 0.8093 | 90 |
| L4 | Four-hop | 0.7444 | 0.6510 | 0.6512 | 0.8116 | 90 |
| L5 | Deep | 0.7444 | 0.6510 | 0.6512 | 0.8107 | 90 |
Haiku shows a consistent monotonic decline from L0 (0.739) through L4–L5 (0.651), confirming steady degradation with reasoning depth.
| Domain | Accuracy | Macro-F1 | Conflict-F1 | BERTScore-F1 | N |
|---|---|---|---|---|---|
| philosophy | 0.8000 | 0.7606 | 0.8485 | 0.8235 | 30 |
| software | 0.8667 | 0.8159 | 0.9143 | 0.8229 | 30 |
| law | 0.6667 | 0.5858 | 0.7333 | 0.8286 | 30 |
| science | 0.8000 | 0.6814 | 0.8824 | 0.8175 | 30 |
| content | 0.8333 | 0.6707 | 0.9730 | 0.8166 | 30 |
| teams | 0.9000 | 0.8174 | 0.9474 | 0.8239 | 30 |
| Domain | Accuracy | Macro-F1 | Conflict-F1 | BERTScore-F1 | N |
|---|---|---|---|---|---|
| philosophy | 0.8000 | 0.6827 | 0.8649 | 0.8113 | 30 |
| software | 0.8000 | 0.5926 | 0.8889 | 0.8202 | 30 |
| law | 0.6000 | 0.5480 | 0.6667 | 0.8135 | 30 |
| science | 0.8000 | 0.6814 | 0.8824 | 0.8188 | 30 |
| content | 0.9333 | 0.8875 | 0.9714 | 0.8052 | 30 |
| teams | 0.9000 | 0.7744 | 0.9231 | 0.8137 | 30 |
git clone https://github.com/your-org/semantic-conflicts-benchmark
cd semantic-conflicts-benchmark
pip install -r requirements-lock.txt # exact versions used for published results
pip install -e ".[anthropic]"
ANTHROPIC_API_KEY=sk-... scbench \
--adapter anthropic \
--model claude-haiku-4-5-20251001 \
--verboseResults are written to results/ as JSON, Markdown, and a cumulative leaderboard.csv.
Implement one class and run:
# my_model.py
from benchmark.adapters.base import AdapterResponse, ModelAdapter
class MyAdapter(ModelAdapter):
@property
def name(self) -> str:
return "my-model-v1"
def predict(self, prompt: str) -> AdapterResponse:
raw = my_model_inference(self.system_prompt, prompt)
verdict = "conflict" if "CONFLICT" in raw.upper() else \
"ambiguous" if "AMBIGUOUS" in raw.upper() else "consistent"
return AdapterResponse(verdict=verdict, explanation=raw, raw_response=raw)scbench --adapter-class my_model.MyAdapterSee examples/custom_adapter.py for a full Ollama example.
| Flag | Model argument | Install extra |
|---|---|---|
--adapter openai |
--model gpt-4o |
pip install -e ".[openai]" |
--adapter anthropic |
--model claude-sonnet-4-6 |
pip install -e ".[anthropic]" |
--adapter hf |
--model mistralai/Mistral-7B-Instruct-v0.2 |
pip install -e ".[hf]" |
scbench [options]
--adapter {openai,anthropic,hf} Built-in adapter to use
--adapter-class MODULE.CLASS Custom adapter class (overrides --adapter)
--model MODEL Model name/ID passed to the adapter
--domains D1 D2 ... Subset of domains to evaluate (default: all six)
--data-dir PATH Path to data directory (default: data/)
--output PATH JSON output path (default: results/<run_id>.json)
--leaderboard PATH CSV leaderboard path (default: results/leaderboard.csv)
--verbose Print per-item predictions during evaluation
Every conflict item is tagged with a conflict_level (0–5) representing the minimum number of logically certain inference steps required to detect the conflict. Each step must follow from the previous with 100% certainty — no probabilistic leaps allowed.
| Level | Name | Description | Expected model behaviour |
|---|---|---|---|
| 0 | Surface | Contradiction is explicitly stated | All LLMs detect |
| 1 | Immediate | One certain inference collides with the other passage | Strong base models |
| 2 | Two-hop | Two chained inferences expose the conflict | GPT-4 class usually |
| 3 | Three-hop | Three chained inferences | Frontier models sometimes miss |
| 4 | Four-hop | Four chained inferences | Most models fail |
| 5 | Deep | Five chained inferences | Near-universal failure zone |
LW-F1 = Σ ((level + 1) × conflict_F1_at_level) / Σ (level + 1)
where conflict_F1_at_level is the conflict-class F1 computed on the level-L conflict items plus all consistent/ambiguous items (90 items per level). Including non-conflict items at every level means precision is penalised — a model that blindly labels everything "conflict" gets low precision and therefore low F1.
Weights: L0=1, L1=2, L2=3, L3=4, L4=5, L5=6. A model that only detects surface conflicts scores ~0.048; a perfect model scores 1.0; a naive always-predict-conflict baseline scores 0.333. This prevents high-accuracy models from hiding shallow reasoning.
| Domain | Conflict type | Example |
|---|---|---|
| philosophy | Conceptual/ideological conflicts between philosophical positions | Kant's categorical imperative vs Utilitarian calculus |
| software | State and contract conflicts in distributed systems | UserService says ACTIVE; BillingService says SUSPENDED |
| law | Conflicts between statutes, contracts, and evidence | NLRA's collective bargaining right vs employment contract waiver |
| science | Temporal conflicts between hypothesis and incoming evidence | Pre-1984 ulcer theory vs H. pylori discovery |
| content | Self-contradictions in an influencer's published statements | "I never take supplements" vs daily whey shake promotion |
| teams | Conflicts between team norms and a new joiner's stated principles | Async-only team vs AI agent that requires daily standups |
Each domain contains 30 items: 18 conflict (3 × L0–L5), 8 consistent, 4 ambiguous.
Every major benchmark measures a different underlying skill. The table below maps 10 standard benchmarks against the Semantic Conflicts Benchmark (SCB) across the dimensions that actually differentiate them — not just topic coverage.
| Benchmark | Core Skill | Input Format | Structured Difficulty | Explanation Scored | Detects Implicit Conflicts | Ambiguity Class | Domain Coverage |
|---|---|---|---|---|---|---|---|
| MMLU | Factual recall | MCQ (4-choice) | — | — | — | — | 57 academic subjects |
| GSM8K | Arithmetic reasoning | Word problem → number | — | — | — | — | Math only |
| HumanEval | Code synthesis | Docstring → Python fn | — | — | — | — | Code only |
| BIG-bench | Diverse reasoning (200+ subtasks) | Varies per task | Per-task only | — | — | — | Mixed |
| HellaSwag | Commonsense completion | Partial sentence → MCQ | — | — | — | — | Everyday text |
| TruthfulQA | Factual honesty / hallucination | Open-ended question | — | — | — | — | Myths & misconceptions |
| ARC | Elementary science reasoning | MCQ (4-choice) | Easy / Challenge split | — | — | — | Science only |
| Winogrande | Pronoun coreference | Fill-in-the-blank (2-choice) | — | — | — | — | Language |
| DROP | Reading comprehension + arithmetic | Passage + question → span | — | — | — | — | News / history |
| MT-Bench / Chatbot Arena | Conversational quality | Multi-turn dialogue | — | — | — | — | Open-ended |
| SCB (this work) ✦ | Semantic conflict detection | Passage pair → 3-class verdict | L0–L5 (6 levels) | BERTScore vs. reasoning chain | Yes (multi-hop) | Yes | 6 real-world domains |
1. Difficulty defined by reasoning depth, not subject matter. L0–L5 maps to the exact number of 100%-certain inference steps required to surface a conflict. This pinpoints where a model's reasoning chain breaks — not just whether it gets the final answer right. No other benchmark in this list offers structured depth at this granularity.
2. Tests the skill that determines real-world reliability. MMLU asks "do you know X?" SCB asks "do you notice that source A and source B contradict each other?" — the skill a model needs when reviewing contracts, audit trails, distributed-system logs, or scientific literature. Pattern-matching is insufficient; derived semantics must be checked.
3. Explanation faithfulness is a first-class metric.
A model can predict "conflict" for the wrong reason. Scoring model explanations with BERTScore against ground-truth reasoning_chain fields catches this. No other benchmark in this list measures whether the model traced the correct logical path.
4. Ambiguity is a scored class, not noise. Most benchmarks force binary correctness. SCB's 3-class taxonomy (conflict / consistent / ambiguous) rewards models that correctly surface genuine uncertainty, and penalises both over-confident labelling and reflexive hedging. Macro-F1 over all three classes is reported alongside the primary LW-F1.
5. Domains drawn from real professional contexts. Philosophy, software, law, science, content moderation, team dynamics — chosen because semantic conflicts arise in each with real consequences, not because they produce convenient synthetic puzzles.
6. LW-F1 prevents shallow-reasoning models from hiding. A model that only catches surface (L0) contradictions scores ≈ 0.048 on LW-F1 even with 100% L0 accuracy. The level-weighted primary metric exposes the "pattern-matching vs. reasoning" gap that flat accuracy obscures.
| If you want to measure… | Reach for |
|---|---|
| General knowledge breadth across subjects | MMLU |
| Math, code, or language fundamentals | GSM8K · HumanEval · Winogrande |
| Truthfulness and hallucination resistance | TruthfulQA |
| Multi-hop semantic conflict detection and reasoning faithfulness | SCB |
The model reads two passages and classifies the relationship as:
- conflict — the passages are logically incompatible (directly or through inference)
- consistent — the passages are compatible; no logical contradiction exists
- ambiguous — whether a conflict exists depends on information not provided
Metrics: accuracy, macro-F1, binary conflict F1 (conflict+ambiguous vs. consistent), per-class F1, LW-F1 (primary).
For conflict and ambiguous predictions, the model's explanation is scored against the ground-truth reasoning_chain using BERTScore F1 (distilbert-base-uncased). This measures whether the model traces the correct logical path using semantic similarity rather than token overlap, capturing paraphrasing and synonymy that surface metrics miss.
The built-in system_prompt (in benchmark/adapters/base.py) instructs models to:
- Derive implications: for each passage, enumerate every statement that follows with 100% logical certainty.
- Check for conflicts: look for contradictions both between literal statements and between their derived implications.
- Classify: return exactly
CONFLICT,CONSISTENT, orAMBIGUOUSon the first line.
This forces models to work in the derived semantic space, not just pattern-match surface text — which is precisely what L2–L5 items require.
See data/schema.md for the full schema and contribution guide.
Quick example — a level-3 conflict item:
{
"id": "philosophy_L3_001",
"domain": "philosophy",
"conflict_level": 3,
"difficulty": "hard",
"input": {
"passages": [
{ "id": "A", "text": "Rational self-interest is the objective moral standard...", "source": "Ayn Rand" },
{ "id": "B", "text": "Moral values are human constructions...", "source": "Nietzsche" }
],
"question": "Do these passages present a conflict, are they consistent, or is the relationship ambiguous?"
},
"label": {
"verdict": "conflict",
"conflict_level": 3,
"reasoning_chain": [
"Step 1 — From A: 'objective moral standard' entails universal correctness...",
"Step 2 — Universal correctness entails some values are correct independently of construction...",
"Step 3 — This entails NOT all values are human constructions...",
"CONFLICT — Step 3 contradicts B ('moral values are human constructions')."
]
}
}After a run, results/ contains:
| File | Contents |
|---|---|
<run_id>.json |
Full results: overall scores, per-level, per-domain, per-item predictions |
<run_id>.md |
Human-readable Markdown report with tables and failure examples |
leaderboard.csv |
One row per run, appended — suitable for version-controlled tracking |
pip install -e ".[dev]"
python -m pytest tests/ -vmodel, run_id, n_items, lw_f1, accuracy, macro_f1, binary_conflict_f1, expl_bert_f1, level_0_f1, level_1_f1, level_2_f1, level_3_f1, level_4_f1, level_5_f1, philosophy_f1, software_f1, law_f1, science_f1, content_f1, teams_f1
- Read data/schema.md for item format and quality rules.
- Add items to the appropriate
data/<domain>/v1.jsonfile (one JSON per line). - Run
python -m pytest tests/ -vto verify everything loads and scores correctly. - Open a pull request with a brief description of the items added.
Items at L4–L5 that follow the 100%-certainty step rule are especially welcome.