Skip to content

vivekkrishna/semantic-conflicts-benchmark

Repository files navigation

Semantic Conflicts Benchmark

A benchmark for measuring the ability of large language models to detect semantic conflicts across domains — including implicit conflicts that require multi-hop logical reasoning to uncover.

Modelled after TruthfulQA and FEVER. Plug in any model in one Python class.


Leaderboard

180 items · 6 domains · difficulty levels L0–L5 · evaluated April 2026

Overall scores

Model LW-F1 ↑ Accuracy Macro-F1 Bin-Conflict-F1 Expl BERTScore-F1
Optimal 1.0000 1.0000 1.0000 1.0000 1.0000
claude-sonnet-4-6 0.7319 0.8111 0.7211 0.9380 0.8223
claude-haiku-4-5 0.6783 0.8056 0.6893 0.9531 0.8136

Metric definitions

Metric Formula Why chosen What it diagnoses
LW-F1 (primary) Σ((L+1) × conflict_F1_L) / Σ(L+1) where conflict_F1_L is the conflict-class F1 computed on 18 level-L conflict items + all 72 consistent/ambiguous items; weights L0=1× … L5=6× Deeper conflicts require genuine multi-hop reasoning and should count more than surface pattern-matching. Including non-conflict items forces precision to be measured alongside recall, so a model can't score high by predicting "conflict" for everything. Whether the model actually reasons at depth. Flat scores across levels = single-class predictor. Perfect model = 1.000; always-predict-conflict baseline = 0.333.
Accuracy correct / N across all 180 items Interpretable to non-specialists; shows overall verdict quality at a glance. Gross correctness. Can be gamed by a model that memorises the 60 % conflict base rate, so must be read alongside Macro-F1.
Macro-F1 (F1_conflict + F1_consistent + F1_ambiguous) / 3 where each F1_c = 2PR/(P+R), P = TP/(TP+FP), R = TP/(TP+FN) Unweighted average penalises models that collapse predictions to one class. A model that always says "conflict" scores 0.333 regardless of accuracy, exposing the single-class failure mode. Class balance. Low Macro-F1 with high Accuracy = the model ignores consistent or ambiguous labels entirely.
Bin-Conflict-F1 Binary F1 with positive = {conflict, ambiguous}, negative = {consistent}; F1 = 2PR/(P+R) Measures the practically important ability to flag items that need human review (anything not clearly consistent) without requiring the model to distinguish conflict from ambiguous. Over- vs. under-flagging. High Bin-F1 + low Macro-F1 = model detects risk but can't distinguish ambiguous from outright conflict.
Expl BERTScore-F1 BERTScore F1 (distilbert-base-uncased) of model explanation vs. ground-truth reasoning_chain, averaged over conflict and ambiguous items only Checks whether the model traces the correct logical path, not just gets the label right. BERTScore uses contextual embeddings to capture semantic similarity and paraphrasing that token-overlap metrics miss. Explanation faithfulness. A model can predict "conflict" for the wrong reason; low BERTScore exposes this. Not computed for consistent items (no reference chain).

By difficulty level

Each level's N = 18 conflict items at that depth + 72 consistent/ambiguous items (90 total). This measures whether the model can identify level-L conflicts without over-predicting on non-conflict items.

Statistical note: Per-level scores at L4–L5 are based on 18 conflict items each (95% CI ≈ ±0.33 on conflict-class F1) and should be interpreted with wide uncertainty. Cross-level comparisons at these depths are indicative only until the dataset is expanded (see Roadmap).

claude-sonnet-4-6

Level Name Accuracy Macro-F1 Conflict-F1 BERTScore-F1 N
L0 Surface 0.7889 0.7474 0.8095 0.8164 90
L1 Immediate 0.7889 0.7457 0.8095 0.8225 90
L2 Two-hop 0.7778 0.7315 0.7805 0.8232 90
L3 Three-hop 0.7889 0.7474 0.8095 0.8219 90
L4 Four-hop 0.7222 0.6562 0.6111 0.8215 90
L5 Deep 0.7556 0.7035 0.7179 0.8258 90

Sonnet degrades sharply at L4 (0.611) but partially recovers at L5 (0.718), suggesting it handles some five-hop chains better than four-hop ones — likely due to L5 items having more explicit intermediate statements.

claude-haiku-4-5

Level Name Accuracy Macro-F1 Conflict-F1 BERTScore-F1 N
L0 Surface 0.7778 0.6929 0.7391 0.8004 90
L1 Immediate 0.7556 0.6642 0.6818 0.8087 90
L2 Two-hop 0.7667 0.6779 0.7111 0.8116 90
L3 Three-hop 0.7667 0.6779 0.7111 0.8093 90
L4 Four-hop 0.7444 0.6510 0.6512 0.8116 90
L5 Deep 0.7444 0.6510 0.6512 0.8107 90

Haiku shows a consistent monotonic decline from L0 (0.739) through L4–L5 (0.651), confirming steady degradation with reasoning depth.

By domain

claude-sonnet-4-6

Domain Accuracy Macro-F1 Conflict-F1 BERTScore-F1 N
philosophy 0.8000 0.7606 0.8485 0.8235 30
software 0.8667 0.8159 0.9143 0.8229 30
law 0.6667 0.5858 0.7333 0.8286 30
science 0.8000 0.6814 0.8824 0.8175 30
content 0.8333 0.6707 0.9730 0.8166 30
teams 0.9000 0.8174 0.9474 0.8239 30

claude-haiku-4-5

Domain Accuracy Macro-F1 Conflict-F1 BERTScore-F1 N
philosophy 0.8000 0.6827 0.8649 0.8113 30
software 0.8000 0.5926 0.8889 0.8202 30
law 0.6000 0.5480 0.6667 0.8135 30
science 0.8000 0.6814 0.8824 0.8188 30
content 0.9333 0.8875 0.9714 0.8052 30
teams 0.9000 0.7744 0.9231 0.8137 30

Quickstart

git clone https://github.com/your-org/semantic-conflicts-benchmark
cd semantic-conflicts-benchmark
pip install -r requirements-lock.txt   # exact versions used for published results
pip install -e ".[anthropic]"

ANTHROPIC_API_KEY=sk-... scbench \
  --adapter anthropic \
  --model claude-haiku-4-5-20251001 \
  --verbose

Results are written to results/ as JSON, Markdown, and a cumulative leaderboard.csv.


Custom Model (30 seconds)

Implement one class and run:

# my_model.py
from benchmark.adapters.base import AdapterResponse, ModelAdapter

class MyAdapter(ModelAdapter):
    @property
    def name(self) -> str:
        return "my-model-v1"

    def predict(self, prompt: str) -> AdapterResponse:
        raw = my_model_inference(self.system_prompt, prompt)
        verdict = "conflict" if "CONFLICT" in raw.upper() else \
                  "ambiguous" if "AMBIGUOUS" in raw.upper() else "consistent"
        return AdapterResponse(verdict=verdict, explanation=raw, raw_response=raw)
scbench --adapter-class my_model.MyAdapter

See examples/custom_adapter.py for a full Ollama example.


Built-in Adapters

Flag Model argument Install extra
--adapter openai --model gpt-4o pip install -e ".[openai]"
--adapter anthropic --model claude-sonnet-4-6 pip install -e ".[anthropic]"
--adapter hf --model mistralai/Mistral-7B-Instruct-v0.2 pip install -e ".[hf]"

CLI Reference

scbench [options]

  --adapter {openai,anthropic,hf}   Built-in adapter to use
  --adapter-class MODULE.CLASS      Custom adapter class (overrides --adapter)
  --model MODEL                     Model name/ID passed to the adapter
  --domains D1 D2 ...               Subset of domains to evaluate (default: all six)
  --data-dir PATH                   Path to data directory (default: data/)
  --output PATH                     JSON output path (default: results/<run_id>.json)
  --leaderboard PATH                CSV leaderboard path (default: results/leaderboard.csv)
  --verbose                         Print per-item predictions during evaluation

Difficulty Levels

Every conflict item is tagged with a conflict_level (0–5) representing the minimum number of logically certain inference steps required to detect the conflict. Each step must follow from the previous with 100% certainty — no probabilistic leaps allowed.

Level Name Description Expected model behaviour
0 Surface Contradiction is explicitly stated All LLMs detect
1 Immediate One certain inference collides with the other passage Strong base models
2 Two-hop Two chained inferences expose the conflict GPT-4 class usually
3 Three-hop Three chained inferences Frontier models sometimes miss
4 Four-hop Four chained inferences Most models fail
5 Deep Five chained inferences Near-universal failure zone

Level-Weighted F1 (LW-F1) — primary score

LW-F1 = Σ ((level + 1) × conflict_F1_at_level) / Σ (level + 1)

where conflict_F1_at_level is the conflict-class F1 computed on the level-L conflict items plus all consistent/ambiguous items (90 items per level). Including non-conflict items at every level means precision is penalised — a model that blindly labels everything "conflict" gets low precision and therefore low F1.

Weights: L0=1, L1=2, L2=3, L3=4, L4=5, L5=6. A model that only detects surface conflicts scores ~0.048; a perfect model scores 1.0; a naive always-predict-conflict baseline scores 0.333. This prevents high-accuracy models from hiding shallow reasoning.


Domains

Domain Conflict type Example
philosophy Conceptual/ideological conflicts between philosophical positions Kant's categorical imperative vs Utilitarian calculus
software State and contract conflicts in distributed systems UserService says ACTIVE; BillingService says SUSPENDED
law Conflicts between statutes, contracts, and evidence NLRA's collective bargaining right vs employment contract waiver
science Temporal conflicts between hypothesis and incoming evidence Pre-1984 ulcer theory vs H. pylori discovery
content Self-contradictions in an influencer's published statements "I never take supplements" vs daily whey shake promotion
teams Conflicts between team norms and a new joiner's stated principles Async-only team vs AI agent that requires daily standups

Each domain contains 30 items: 18 conflict (3 × L0–L5), 8 consistent, 4 ambiguous.


How This Benchmark Compares

Every major benchmark measures a different underlying skill. The table below maps 10 standard benchmarks against the Semantic Conflicts Benchmark (SCB) across the dimensions that actually differentiate them — not just topic coverage.

Benchmark Skill Taxonomy

Benchmark Core Skill Input Format Structured Difficulty Explanation Scored Detects Implicit Conflicts Ambiguity Class Domain Coverage
MMLU Factual recall MCQ (4-choice) 57 academic subjects
GSM8K Arithmetic reasoning Word problem → number Math only
HumanEval Code synthesis Docstring → Python fn Code only
BIG-bench Diverse reasoning (200+ subtasks) Varies per task Per-task only Mixed
HellaSwag Commonsense completion Partial sentence → MCQ Everyday text
TruthfulQA Factual honesty / hallucination Open-ended question Myths & misconceptions
ARC Elementary science reasoning MCQ (4-choice) Easy / Challenge split Science only
Winogrande Pronoun coreference Fill-in-the-blank (2-choice) Language
DROP Reading comprehension + arithmetic Passage + question → span News / history
MT-Bench / Chatbot Arena Conversational quality Multi-turn dialogue Open-ended
SCB (this work) ✦ Semantic conflict detection Passage pair → 3-class verdict L0–L5 (6 levels) BERTScore vs. reasoning chain Yes (multi-hop) Yes 6 real-world domains

Unique Value Propositions

1. Difficulty defined by reasoning depth, not subject matter. L0–L5 maps to the exact number of 100%-certain inference steps required to surface a conflict. This pinpoints where a model's reasoning chain breaks — not just whether it gets the final answer right. No other benchmark in this list offers structured depth at this granularity.

2. Tests the skill that determines real-world reliability. MMLU asks "do you know X?" SCB asks "do you notice that source A and source B contradict each other?" — the skill a model needs when reviewing contracts, audit trails, distributed-system logs, or scientific literature. Pattern-matching is insufficient; derived semantics must be checked.

3. Explanation faithfulness is a first-class metric. A model can predict "conflict" for the wrong reason. Scoring model explanations with BERTScore against ground-truth reasoning_chain fields catches this. No other benchmark in this list measures whether the model traced the correct logical path.

4. Ambiguity is a scored class, not noise. Most benchmarks force binary correctness. SCB's 3-class taxonomy (conflict / consistent / ambiguous) rewards models that correctly surface genuine uncertainty, and penalises both over-confident labelling and reflexive hedging. Macro-F1 over all three classes is reported alongside the primary LW-F1.

5. Domains drawn from real professional contexts. Philosophy, software, law, science, content moderation, team dynamics — chosen because semantic conflicts arise in each with real consequences, not because they produce convenient synthetic puzzles.

6. LW-F1 prevents shallow-reasoning models from hiding. A model that only catches surface (L0) contradictions scores ≈ 0.048 on LW-F1 even with 100% L0 accuracy. The level-weighted primary metric exposes the "pattern-matching vs. reasoning" gap that flat accuracy obscures.

When to reach for this benchmark

If you want to measure… Reach for
General knowledge breadth across subjects MMLU
Math, code, or language fundamentals GSM8K · HumanEval · Winogrande
Truthfulness and hallucination resistance TruthfulQA
Multi-hop semantic conflict detection and reasoning faithfulness SCB

Tasks and Metrics

Task 1: Verdict Classification (3-class)

The model reads two passages and classifies the relationship as:

  • conflict — the passages are logically incompatible (directly or through inference)
  • consistent — the passages are compatible; no logical contradiction exists
  • ambiguous — whether a conflict exists depends on information not provided

Metrics: accuracy, macro-F1, binary conflict F1 (conflict+ambiguous vs. consistent), per-class F1, LW-F1 (primary).

Task 2: Explanation Quality

For conflict and ambiguous predictions, the model's explanation is scored against the ground-truth reasoning_chain using BERTScore F1 (distilbert-base-uncased). This measures whether the model traces the correct logical path using semantic similarity rather than token overlap, capturing paraphrasing and synonymy that surface metrics miss.


System Prompt Strategy

The built-in system_prompt (in benchmark/adapters/base.py) instructs models to:

  1. Derive implications: for each passage, enumerate every statement that follows with 100% logical certainty.
  2. Check for conflicts: look for contradictions both between literal statements and between their derived implications.
  3. Classify: return exactly CONFLICT, CONSISTENT, or AMBIGUOUS on the first line.

This forces models to work in the derived semantic space, not just pattern-match surface text — which is precisely what L2–L5 items require.


Dataset Format

See data/schema.md for the full schema and contribution guide.

Quick example — a level-3 conflict item:

{
  "id": "philosophy_L3_001",
  "domain": "philosophy",
  "conflict_level": 3,
  "difficulty": "hard",
  "input": {
    "passages": [
      { "id": "A", "text": "Rational self-interest is the objective moral standard...", "source": "Ayn Rand" },
      { "id": "B", "text": "Moral values are human constructions...", "source": "Nietzsche" }
    ],
    "question": "Do these passages present a conflict, are they consistent, or is the relationship ambiguous?"
  },
  "label": {
    "verdict": "conflict",
    "conflict_level": 3,
    "reasoning_chain": [
      "Step 1 — From A: 'objective moral standard' entails universal correctness...",
      "Step 2 — Universal correctness entails some values are correct independently of construction...",
      "Step 3 — This entails NOT all values are human constructions...",
      "CONFLICT — Step 3 contradicts B ('moral values are human constructions')."
    ]
  }
}

Output Files

After a run, results/ contains:

File Contents
<run_id>.json Full results: overall scores, per-level, per-domain, per-item predictions
<run_id>.md Human-readable Markdown report with tables and failure examples
leaderboard.csv One row per run, appended — suitable for version-controlled tracking

Running Tests

pip install -e ".[dev]"
python -m pytest tests/ -v

Leaderboard Columns

model, run_id, n_items, lw_f1, accuracy, macro_f1, binary_conflict_f1, expl_bert_f1, level_0_f1, level_1_f1, level_2_f1, level_3_f1, level_4_f1, level_5_f1, philosophy_f1, software_f1, law_f1, science_f1, content_f1, teams_f1


Contributing

  1. Read data/schema.md for item format and quality rules.
  2. Add items to the appropriate data/<domain>/v1.json file (one JSON per line).
  3. Run python -m pytest tests/ -v to verify everything loads and scores correctly.
  4. Open a pull request with a brief description of the items added.

Items at L4–L5 that follow the 100%-certainty step rule are especially welcome.

About

Benchmarking the ability of large language models to detect semantic conflicts across domains, documents, and evolving knowledge bases.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages