SIGTURK 2026 is a shared task focused on Turkish terminology detection and correction in scientific text. Given English–Turkish parallel sentences, participating systems must identify English technical terms, provide their correct Turkish equivalents, and produce fluent edited target sentences. This repository provides the official command-line evaluator and development data.
- Leaderboard
- Subtasks
- Directory Layout
- Quick Start
- JSON Schemas
- Design Notes
- Examples
- Troubleshooting
- Contact
📄 View full leaderboard with all scores →
This evaluator supports three subtasks on paired JSON files containing golden set annotations and system predictions:
- Term Detection (English word-level) — Precision / Recall / F1 (Macro + Micro)
- Term Correction (Exact match) — Accuracy (Macro + Micro)
- End-to-End (sentence quality) — Sentence-level BLEU and chrF (Macro only) computed with
sacrebleu
The evaluator strictly aligns records using (paragraph_id, sentence_id) and fails fast on any mismatch, duplicates, or missing IDs.
sigturk2026_sharedtask/
├─ dev_data/
│ ├─ subtask_1.json
│ ├─ subtask_2.json
│ ├─ subtask_3.json
│ └─ terimler_org_data.json
├─ evaluation/
│ ├─ requirements.txt
│ ├─ eval.py
│ ├─ subtask1_term_detection/
│ │ ├─ golden_set.json
│ │ └─ predictions.json
│ ├─ subtask2_term_correction/
│ │ ├─ golden_set.json
│ │ └─ predictions.json
│ └─ subtask3_end2end/
│ ├─ golden_set.json
│ └─ predictions.json
└─ README.md
pip install -r evaluation/requirements.txtrequirements.txt must contain:
sacrebleu==2.4.3
Pin the version to ensure reproducible scores across systems.
Use the same interface for all subtasks by switching --task and the JSON file paths.
# Term Detection (EN word-level)
python evaluation/eval.py \
--task detection \
--golden_set evaluation/subtask1_term_detection/golden_set.json \
--predictions evaluation/subtask1_term_detection/predictions.json
# Term Correction (Exact Match)
python evaluation/eval.py \
--task correction \
--golden_set evaluation/subtask2_term_correction/golden_set.json \
--predictions evaluation/subtask2_term_correction/predictions.json
# End-to-End (Sentence BLEU/chrF; sentence-level mean only)
python evaluation/eval.py \
--task end2end \
--golden_set evaluation/subtask3_end2end/golden_set.json \
--predictions evaluation/subtask3_end2end/predictions.jsonEach JSON file can be an object or an array of objects (both are supported). All entries must include integer paragraph_id and sentence_id. The evaluator pairs rows by these IDs and exits with an error if there are extra or missing pairs.
| Key | Type | Required | Description |
|---|---|---|---|
paragraph_id |
int | ✅ | Paragraph identifier |
sentence_id |
int | ✅ | Sentence identifier within paragraph |
Gold and predictions both use English spans over the source sentence.
golden_set.json
{
"paragraph_id": 3,
"sentence_id": 2,
"source_sentence": "We discuss p-branes, plane waves, ...",
"term_pairs": [
{"en_start": 3, "en_end": 10},
{"en_start": 12, "en_end": 20}
]
}predictions.json
{
"paragraph_id": 3,
"sentence_id": 2,
"term_pairs": [
{"en_start": 3, "en_end": 10}
]
}- Spans are half-open character ranges
[start, end)oversource_sentence. - The evaluator normalizes spans (clamps to sentence length, fixes reversed indices, drops empty/invalid, deduplicates).
- Tokens are detected with a Unicode
\w+regex; token labels (0/1) are derived by strict interval overlap with spans.
Metrics printed
| Scope | Metrics |
|---|---|
| Macro (mean across items) | Precision / Recall / F1 |
| Micro (pooled over corpus) | TP / FP / TN / FN + Precision / Recall / F1 |
golden_set.json
{
"paragraph_id": 3,
"sentence_id": 2,
"source_sentence": "We discuss p-branes, plane waves, ...",
"term_pairs": [
{"en": "p-branes", "en_start": 10, "en_end": 17, "correction": "p-zarları"}
]
}predictions.json
{
"paragraph_id": 3,
"sentence_id": 2,
"term_pairs": [
{"en": "p-branes", "en_start": 10, "en_end": 17, "correction": "p-zarları"}
]
}- Alignment per term uses: (clamped
en_start,en_end, normalizeden). - Scoring uses exact match on the
correctionstring after normalization.
Metrics printed
| Scope | Metrics |
|---|---|
| Macro | Mean accuracy per item |
| Micro | Correct / Total + Micro Accuracy |
Compares only the edited target sentences and reports sentence-level means (no corpus scores).
golden_set.json
{
"paragraph_id": 3,
"sentence_id": 2,
"edited_target_sentence": "Düzlem dalgaları, p-zarları, ..."
}predictions.json
{
"paragraph_id": 3,
"sentence_id": 2,
"edited_target_sentence": "Düzlem dalgaları, p-branları, ..."
}Metrics printed
| Metric | Description |
|---|---|
Mean_BLEU |
Mean sentence BLEU (sacrebleu) over all paired items |
Mean_chrF |
Mean sentence chrF (sacrebleu) over all paired items |
If either file lacks pairs after alignment or sentences are missing, the evaluator prints a descriptive message and exits.
- Strict pairing:
_pair_rowsenforces a 1:1 mapping by(paragraph_id, sentence_id), errors on duplicates, extras, or missing entries. - Normalization: text is NFKC-normalized, lowercased, and whitespace-collapsed; spans are clamped and deduped.
- Tokenization for detection uses
\w+to get word spans and then converts span overlaps to token labels. - Robust I/O: both arrays and single JSON objects are accepted via
as_list.
Using the sample provided (one pair only):
python evaluation/eval.py \
--task end2end \
--golden_set evaluation/subtask3_end2end/golden_set.json \
--predictions evaluation/subtask3_end2end/predictions.jsonExpected output format:
== End-to-End ==
Items=1
Mean_BLEU=...
Mean_chrF=...
| Error | Cause | Fix |
|---|---|---|
requires sacrebleu |
Package not installed | Run pip install sacrebleu==2.4.3 |
duplicate … in golden_set/predictions |
Repeated (paragraph_id, sentence_id) pair |
Ensure each pair is unique per file |
missing … in predictions / extra … in predictions |
Files have mismatched ID sets | Both files must contain the exact same set of (paragraph_id, sentence_id) |
| Zero items evaluated | Required fields absent | Check that term_pairs or edited_target_sentence are present for your task |
For questions regarding the shared task, please contact: sigturk2026.sharedtask@gmail.com