A massive, human-curated multiple-choice benchmark for Urdu language understanding — 26,431 native Urdu questions spanning Pakistani secondary and higher-secondary curricula (SSC-I through HSSC-II) across the humanities, social sciences, STEM, professional studies, and general knowledge.
This repository holds the build pipeline, evaluation harness, and source code.
The released dataset lives on the Hugging Face Hub at
MBZUAI/UrduMMLU:
from datasets import load_dataset
ds = load_dataset("MBZUAI/UrduMMLU", split="test")| Questions | 26,431 |
| Language | Urdu (ur) |
| Format | Single-answer multiple choice (4–5 options) |
| Levels | SSC-I, SSC-II, HSSC-I, HSSC-II |
| Domains | 5 (Humanities, Social Sciences, STEM, Profession, Other) — 26 subdomains |
| Sources | 9 native Urdu exam & practice repositories + provincial boards |
Unlike machine-translated MMLU variants, every item is sourced from native Urdu exam material, then cleaned, de-duplicated, schema-normalized, and human-verified.
The benchmark is produced by a 27-stage pipeline. Each stage is a small, idempotent
module under src/<stage>/ that reads data/<N>-<name>/ and writes the next stage —
so any single transform can be re-run without disturbing the rest. The numeric prefix
on each data/ directory is its step badge.
| Phase | Stages | What happens |
|---|---|---|
| Collection | 1-raw → 3-consolidated |
Scrape & merge raw MCQs from web sources and OCR'd exam PDFs |
| Normalization | 4-rtl-aligned → 14-bidi-isolated |
RTL alignment, quote/character/punctuation normalization, schema canonicalization, option-prefix stripping, blank normalization, exact & fuzzy dedup, bidi isolation |
| Filtering & sampling | 15-english-filtered → 17-batching |
Drop non-Urdu rows, cap per-subdomain counts, build annotation batches |
| Annotation | 18-assignments → 22-final-combined |
Group-aware dual annotation, then combine & finalize annotator verdicts |
| Release | 23-anonymize → 27-hf |
Anonymize annotators, final dedup, then split into the eval snapshot (26-eval) and the published release (27-hf) |
The pipeline ends in two sibling snapshots, both slim KEEP_KEYS views of the final data:
data/26-eval/— evaluation snapshot with the original ids, unsorted. Every eval config points here, so locked results join back byid. Built bysrc/eval_snapshot/build.py; holdsmcqs.json,mcqs_eval.jsonl(flattened for lm-eval),mcqs_dev.json(few-shot pool), andid_map.json(the27-hf → 26-evalid crosswalk).data/27-hf/— the Hugging Face release, sorted by domain → subdomain and renumbered0…N. Built bysrc/hf/build.pyfrom26-eval; holdsurdummlu.json,stats.json, the dataset card, and logo. This is the only folder published to the Hub.
urdu-mmlu/
├── data/ # pipeline stages 1-27
├── src/ # one module per pipeline stage (+ ocr/, analysis/)
├── web/ # the public site (landing, annotator, admin, preview)
├── scripts/ # build_site.py, deploy.py, prep_lm_eval.py
├── docs/ # generated leaderboard / site assets
├── config.yaml # LLM evaluation experiment config
└── eval.sh # lm-evaluation-harness runner (0/3/5-shot)
Treat each item as a single-answer MCQ: present question + options, compare the
model's chosen key against correct_key. Exact-match accuracy is the primary metric;
we recommend reporting it broken down by domain and level.
Open models (via lm-evaluation-harness):
bash eval.sh # runs 0/3/5-shot tasks; results in output/lm_eval/API models (0-shot, Urdu & English prompts):
python src/run_experiment.py --config config.yamlgit clone git@github.com:mbzuai-nlp/urdu-mmlu.git
cd urdu-mmlu
pip install -r requirements.txt # or: uv syncProvide API keys for any providers you use (page classification / OCR / API-model eval)
via a .env file in the project root.
- Code (the pipeline in
src/, scripts, and harness) — GNU GPL v3.0. - Dataset (MBZUAI/UrduMMLU) — CC BY 4.0.
@misc{tabassum2026urdummlumassivemultitaskbenchmark,
title={UrduMMLU: A Massive Multitask Benchmark for Urdu Language Understanding},
author={Ahmer Tabassum and Sarfraz Ahmad and Hasan Iqbal and Owais Aijaz and Momina Ahsan and Preslav Nakov},
year={2026},
eprint={2606.07167},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2606.07167},
}