Semantic Conflicts Benchmark

A benchmark for measuring the ability of large language models to detect semantic conflicts across domains — including implicit conflicts that require multi-hop logical reasoning to uncover.

Modelled after TruthfulQA and FEVER. Plug in any model in one Python class.

Leaderboard

180 items · 6 domains · difficulty levels L0–L5 · evaluated April 2026

Overall scores

Model	LW-F1 ↑	Accuracy	Macro-F1	Bin-Conflict-F1	Expl BERTScore-F1
Optimal	1.0000	1.0000	1.0000	1.0000	1.0000
claude-sonnet-4-6	0.7319	0.8111	0.7211	0.9380	0.8223
claude-haiku-4-5	0.6783	0.8056	0.6893	0.9531	0.8136

Metric definitions

Metric	Formula	Why chosen	What it diagnoses
LW-F1 (primary)	`Σ((L+1) × conflict_F1_L) / Σ(L+1)` where `conflict_F1_L` is the conflict-class F1 computed on 18 level-L conflict items + all 72 consistent/ambiguous items; weights L0=1× … L5=6×	Deeper conflicts require genuine multi-hop reasoning and should count more than surface pattern-matching. Including non-conflict items forces precision to be measured alongside recall, so a model can't score high by predicting "conflict" for everything.	Whether the model actually reasons at depth. Flat scores across levels = single-class predictor. Perfect model = 1.000; always-predict-conflict baseline = 0.333.
Accuracy	`correct / N` across all 180 items	Interpretable to non-specialists; shows overall verdict quality at a glance.	Gross correctness. Can be gamed by a model that memorises the 60 % conflict base rate, so must be read alongside Macro-F1.
Macro-F1	`(F1_conflict + F1_consistent + F1_ambiguous) / 3` where each `F1_c = 2PR/(P+R)`, `P = TP/(TP+FP)`, `R = TP/(TP+FN)`	Unweighted average penalises models that collapse predictions to one class. A model that always says "conflict" scores 0.333 regardless of accuracy, exposing the single-class failure mode.	Class balance. Low Macro-F1 with high Accuracy = the model ignores consistent or ambiguous labels entirely.
Bin-Conflict-F1	Binary F1 with positive = `{conflict, ambiguous}`, negative = `{consistent}`; `F1 = 2PR/(P+R)`	Measures the practically important ability to flag items that need human review (anything not clearly consistent) without requiring the model to distinguish conflict from ambiguous.	Over- vs. under-flagging. High Bin-F1 + low Macro-F1 = model detects risk but can't distinguish ambiguous from outright conflict.
Expl BERTScore-F1	BERTScore F1 (distilbert-base-uncased) of model explanation vs. ground-truth `reasoning_chain`, averaged over conflict and ambiguous items only	Checks whether the model traces the correct logical path, not just gets the label right. BERTScore uses contextual embeddings to capture semantic similarity and paraphrasing that token-overlap metrics miss.	Explanation faithfulness. A model can predict "conflict" for the wrong reason; low BERTScore exposes this. Not computed for consistent items (no reference chain).

By difficulty level

Each level's N = 18 conflict items at that depth + 72 consistent/ambiguous items (90 total). This measures whether the model can identify level-L conflicts without over-predicting on non-conflict items.

Statistical note: Per-level scores at L4–L5 are based on 18 conflict items each (95% CI ≈ ±0.33 on conflict-class F1) and should be interpreted with wide uncertainty. Cross-level comparisons at these depths are indicative only until the dataset is expanded (see Roadmap).

claude-sonnet-4-6

Level	Name	Accuracy	Macro-F1	Conflict-F1	BERTScore-F1	N
L0	Surface	0.7889	0.7474	0.8095	0.8164	90
L1	Immediate	0.7889	0.7457	0.8095	0.8225	90
L2	Two-hop	0.7778	0.7315	0.7805	0.8232	90
L3	Three-hop	0.7889	0.7474	0.8095	0.8219	90
L4	Four-hop	0.7222	0.6562	0.6111	0.8215	90
L5	Deep	0.7556	0.7035	0.7179	0.8258	90

Sonnet degrades sharply at L4 (0.611) but partially recovers at L5 (0.718), suggesting it handles some five-hop chains better than four-hop ones — likely due to L5 items having more explicit intermediate statements.

claude-haiku-4-5

Level	Name	Accuracy	Macro-F1	Conflict-F1	BERTScore-F1	N
L0	Surface	0.7778	0.6929	0.7391	0.8004	90
L1	Immediate	0.7556	0.6642	0.6818	0.8087	90
L2	Two-hop	0.7667	0.6779	0.7111	0.8116	90
L3	Three-hop	0.7667	0.6779	0.7111	0.8093	90
L4	Four-hop	0.7444	0.6510	0.6512	0.8116	90
L5	Deep	0.7444	0.6510	0.6512	0.8107	90

Haiku shows a consistent monotonic decline from L0 (0.739) through L4–L5 (0.651), confirming steady degradation with reasoning depth.

By domain

claude-sonnet-4-6

Domain	Accuracy	Macro-F1	Conflict-F1	BERTScore-F1	N
philosophy	0.8000	0.7606	0.8485	0.8235	30
software	0.8667	0.8159	0.9143	0.8229	30
law	0.6667	0.5858	0.7333	0.8286	30
science	0.8000	0.6814	0.8824	0.8175	30
content	0.8333	0.6707	0.9730	0.8166	30
teams	0.9000	0.8174	0.9474	0.8239	30

claude-haiku-4-5

Domain	Accuracy	Macro-F1	Conflict-F1	BERTScore-F1	N
philosophy	0.8000	0.6827	0.8649	0.8113	30
software	0.8000	0.5926	0.8889	0.8202	30
law	0.6000	0.5480	0.6667	0.8135	30
science	0.8000	0.6814	0.8824	0.8188	30
content	0.9333	0.8875	0.9714	0.8052	30
teams	0.9000	0.7744	0.9231	0.8137	30

Quickstart

git clone https://github.com/your-org/semantic-conflicts-benchmark
cd semantic-conflicts-benchmark
pip install -r requirements-lock.txt   # exact versions used for published results
pip install -e ".[anthropic]"

ANTHROPIC_API_KEY=sk-... scbench \
  --adapter anthropic \
  --model claude-haiku-4-5-20251001 \
  --verbose

Results are written to results/ as JSON, Markdown, and a cumulative leaderboard.csv.

Custom Model (30 seconds)

Implement one class and run:

# my_model.py
from benchmark.adapters.base import AdapterResponse, ModelAdapter

class MyAdapter(ModelAdapter):
    @property
    def name(self) -> str:
        return "my-model-v1"

    def predict(self, prompt: str) -> AdapterResponse:
        raw = my_model_inference(self.system_prompt, prompt)
        verdict = "conflict" if "CONFLICT" in raw.upper() else \
                  "ambiguous" if "AMBIGUOUS" in raw.upper() else "consistent"
        return AdapterResponse(verdict=verdict, explanation=raw, raw_response=raw)

scbench --adapter-class my_model.MyAdapter

See examples/custom_adapter.py for a full Ollama example.

Built-in Adapters

Flag	Model argument	Install extra
`--adapter openai`	`--model gpt-4o`	`pip install -e ".[openai]"`
`--adapter anthropic`	`--model claude-sonnet-4-6`	`pip install -e ".[anthropic]"`
`--adapter hf`	`--model mistralai/Mistral-7B-Instruct-v0.2`	`pip install -e ".[hf]"`

CLI Reference

scbench [options]

  --adapter {openai,anthropic,hf}   Built-in adapter to use
  --adapter-class MODULE.CLASS      Custom adapter class (overrides --adapter)
  --model MODEL                     Model name/ID passed to the adapter
  --domains D1 D2 ...               Subset of domains to evaluate (default: all six)
  --data-dir PATH                   Path to data directory (default: data/)
  --output PATH                     JSON output path (default: results/<run_id>.json)
  --leaderboard PATH                CSV leaderboard path (default: results/leaderboard.csv)
  --verbose                         Print per-item predictions during evaluation

Difficulty Levels

Every conflict item is tagged with a conflict_level (0–5) representing the minimum number of logically certain inference steps required to detect the conflict. Each step must follow from the previous with 100% certainty — no probabilistic leaps allowed.

Level	Name	Description	Expected model behaviour
0	Surface	Contradiction is explicitly stated	All LLMs detect
1	Immediate	One certain inference collides with the other passage	Strong base models
2	Two-hop	Two chained inferences expose the conflict	GPT-4 class usually
3	Three-hop	Three chained inferences	Frontier models sometimes miss
4	Four-hop	Four chained inferences	Most models fail
5	Deep	Five chained inferences	Near-universal failure zone

Level-Weighted F1 (LW-F1) — primary score

LW-F1 = Σ ((level + 1) × conflict_F1_at_level) / Σ (level + 1)

where conflict_F1_at_level is the conflict-class F1 computed on the level-L conflict items plus all consistent/ambiguous items (90 items per level). Including non-conflict items at every level means precision is penalised — a model that blindly labels everything "conflict" gets low precision and therefore low F1.

Weights: L0=1, L1=2, L2=3, L3=4, L4=5, L5=6. A model that only detects surface conflicts scores ~0.048; a perfect model scores 1.0; a naive always-predict-conflict baseline scores 0.333. This prevents high-accuracy models from hiding shallow reasoning.

Domains

Domain	Conflict type	Example
philosophy	Conceptual/ideological conflicts between philosophical positions	Kant's categorical imperative vs Utilitarian calculus
software	State and contract conflicts in distributed systems	UserService says ACTIVE; BillingService says SUSPENDED
law	Conflicts between statutes, contracts, and evidence	NLRA's collective bargaining right vs employment contract waiver
science	Temporal conflicts between hypothesis and incoming evidence	Pre-1984 ulcer theory vs H. pylori discovery
content	Self-contradictions in an influencer's published statements	"I never take supplements" vs daily whey shake promotion
teams	Conflicts between team norms and a new joiner's stated principles	Async-only team vs AI agent that requires daily standups

Each domain contains 30 items: 18 conflict (3 × L0–L5), 8 consistent, 4 ambiguous.

How This Benchmark Compares

Every major benchmark measures a different underlying skill. The table below maps 10 standard benchmarks against the Semantic Conflicts Benchmark (SCB) across the dimensions that actually differentiate them — not just topic coverage.

Benchmark Skill Taxonomy

Benchmark	Core Skill	Input Format	Structured Difficulty	Explanation Scored	Detects Implicit Conflicts	Ambiguity Class	Domain Coverage
MMLU	Factual recall	MCQ (4-choice)	—	—	—	—	57 academic subjects
GSM8K	Arithmetic reasoning	Word problem → number	—	—	—	—	Math only
HumanEval	Code synthesis	Docstring → Python fn	—	—	—	—	Code only
BIG-bench	Diverse reasoning (200+ subtasks)	Varies per task	Per-task only	—	—	—	Mixed
HellaSwag	Commonsense completion	Partial sentence → MCQ	—	—	—	—	Everyday text
TruthfulQA	Factual honesty / hallucination	Open-ended question	—	—	—	—	Myths & misconceptions
ARC	Elementary science reasoning	MCQ (4-choice)	Easy / Challenge split	—	—	—	Science only
Winogrande	Pronoun coreference	Fill-in-the-blank (2-choice)	—	—	—	—	Language
DROP	Reading comprehension + arithmetic	Passage + question → span	—	—	—	—	News / history
MT-Bench / Chatbot Arena	Conversational quality	Multi-turn dialogue	—	—	—	—	Open-ended
SCB (this work) ✦	Semantic conflict detection	Passage pair → 3-class verdict	L0–L5 (6 levels)	BERTScore vs. reasoning chain	Yes (multi-hop)	Yes	6 real-world domains

Unique Value Propositions

1. Difficulty defined by reasoning depth, not subject matter. L0–L5 maps to the exact number of 100%-certain inference steps required to surface a conflict. This pinpoints where a model's reasoning chain breaks — not just whether it gets the final answer right. No other benchmark in this list offers structured depth at this granularity.

2. Tests the skill that determines real-world reliability. MMLU asks "do you know X?" SCB asks "do you notice that source A and source B contradict each other?" — the skill a model needs when reviewing contracts, audit trails, distributed-system logs, or scientific literature. Pattern-matching is insufficient; derived semantics must be checked.

3. Explanation faithfulness is a first-class metric. A model can predict "conflict" for the wrong reason. Scoring model explanations with BERTScore against ground-truth reasoning_chain fields catches this. No other benchmark in this list measures whether the model traced the correct logical path.

4. Ambiguity is a scored class, not noise. Most benchmarks force binary correctness. SCB's 3-class taxonomy (conflict / consistent / ambiguous) rewards models that correctly surface genuine uncertainty, and penalises both over-confident labelling and reflexive hedging. Macro-F1 over all three classes is reported alongside the primary LW-F1.

5. Domains drawn from real professional contexts. Philosophy, software, law, science, content moderation, team dynamics — chosen because semantic conflicts arise in each with real consequences, not because they produce convenient synthetic puzzles.

6. LW-F1 prevents shallow-reasoning models from hiding. A model that only catches surface (L0) contradictions scores ≈ 0.048 on LW-F1 even with 100% L0 accuracy. The level-weighted primary metric exposes the "pattern-matching vs. reasoning" gap that flat accuracy obscures.

When to reach for this benchmark

If you want to measure…	Reach for
General knowledge breadth across subjects	MMLU
Math, code, or language fundamentals	GSM8K · HumanEval · Winogrande
Truthfulness and hallucination resistance	TruthfulQA
Multi-hop semantic conflict detection and reasoning faithfulness	SCB

Tasks and Metrics

Task 1: Verdict Classification (3-class)

The model reads two passages and classifies the relationship as:

conflict — the passages are logically incompatible (directly or through inference)
consistent — the passages are compatible; no logical contradiction exists
ambiguous — whether a conflict exists depends on information not provided

Metrics: accuracy, macro-F1, binary conflict F1 (conflict+ambiguous vs. consistent), per-class F1, LW-F1 (primary).

Task 2: Explanation Quality

For conflict and ambiguous predictions, the model's explanation is scored against the ground-truth reasoning_chain using BERTScore F1 (distilbert-base-uncased). This measures whether the model traces the correct logical path using semantic similarity rather than token overlap, capturing paraphrasing and synonymy that surface metrics miss.

System Prompt Strategy

The built-in system_prompt (in benchmark/adapters/base.py) instructs models to:

Derive implications: for each passage, enumerate every statement that follows with 100% logical certainty.
Check for conflicts: look for contradictions both between literal statements and between their derived implications.
Classify: return exactly CONFLICT, CONSISTENT, or AMBIGUOUS on the first line.

This forces models to work in the derived semantic space, not just pattern-match surface text — which is precisely what L2–L5 items require.

Dataset Format

See data/schema.md for the full schema and contribution guide.

Quick example — a level-3 conflict item:

{
  "id": "philosophy_L3_001",
  "domain": "philosophy",
  "conflict_level": 3,
  "difficulty": "hard",
  "input": {
    "passages": [
      { "id": "A", "text": "Rational self-interest is the objective moral standard...", "source": "Ayn Rand" },
      { "id": "B", "text": "Moral values are human constructions...", "source": "Nietzsche" }
    ],
    "question": "Do these passages present a conflict, are they consistent, or is the relationship ambiguous?"
  },
  "label": {
    "verdict": "conflict",
    "conflict_level": 3,
    "reasoning_chain": [
      "Step 1 — From A: 'objective moral standard' entails universal correctness...",
      "Step 2 — Universal correctness entails some values are correct independently of construction...",
      "Step 3 — This entails NOT all values are human constructions...",
      "CONFLICT — Step 3 contradicts B ('moral values are human constructions')."
    ]
  }
}

Output Files

After a run, results/ contains:

File	Contents
`<run_id>.json`	Full results: overall scores, per-level, per-domain, per-item predictions
`<run_id>.md`	Human-readable Markdown report with tables and failure examples
`leaderboard.csv`	One row per run, appended — suitable for version-controlled tracking

Running Tests

pip install -e ".[dev]"
python -m pytest tests/ -v

Leaderboard Columns

model, run_id, n_items, lw_f1, accuracy, macro_f1, binary_conflict_f1, expl_bert_f1, level_0_f1, level_1_f1, level_2_f1, level_3_f1, level_4_f1, level_5_f1, philosophy_f1, software_f1, law_f1, science_f1, content_f1, teams_f1

Contributing

Read data/schema.md for item format and quality rules.
Add items to the appropriate data/<domain>/v1.json file (one JSON per line).
Run python -m pytest tests/ -v to verify everything loads and scores correctly.
Open a pull request with a brief description of the items added.

Items at L4–L5 that follow the 100%-certainty step rule are especially welcome.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.github/workflows		.github/workflows
Roadmap		Roadmap
benchmark		benchmark
data		data
docs		docs
examples		examples
results		results
runs		runs
scripts		scripts
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements-lock.txt		requirements-lock.txt

Folders and files

Latest commit

History

Repository files navigation

Semantic Conflicts Benchmark

Leaderboard

Overall scores

By difficulty level

claude-sonnet-4-6

claude-haiku-4-5

By domain

claude-sonnet-4-6

claude-haiku-4-5

Quickstart

Custom Model (30 seconds)

Built-in Adapters

CLI Reference

Difficulty Levels

Level-Weighted F1 (LW-F1) — primary score

Domains

How This Benchmark Compares

Benchmark Skill Taxonomy

Unique Value Propositions

When to reach for this benchmark

Tasks and Metrics

Task 1: Verdict Classification (3-class)

Task 2: Explanation Quality

System Prompt Strategy

Dataset Format

Output Files

Running Tests

Leaderboard Columns

Contributing

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages