UrduMMLU

A Massive Multitask Benchmark for Urdu Language Understanding

TL;DR

A massive, human-curated multiple-choice benchmark for Urdu language understanding — 26,431 native Urdu questions spanning Pakistani secondary and higher-secondary curricula (SSC-I through HSSC-II) across the humanities, social sciences, STEM, professional studies, and general knowledge.

This repository holds the build pipeline, evaluation harness, and source code. The released dataset lives on the Hugging Face Hub at MBZUAI/UrduMMLU:

from datasets import load_dataset

ds = load_dataset("MBZUAI/UrduMMLU", split="test")

Dataset at a glance


Questions	26,431
Language	Urdu (`ur`)
Format	Single-answer multiple choice (4–5 options)
Levels	SSC-I, SSC-II, HSSC-I, HSSC-II
Domains	5 (Humanities, Social Sciences, STEM, Profession, Other) — 26 subdomains
Sources	9 native Urdu exam & practice repositories + provincial boards

Unlike machine-translated MMLU variants, every item is sourced from native Urdu exam material, then cleaned, de-duplicated, schema-normalized, and human-verified.

How it's built

The benchmark is produced by a 27-stage pipeline. Each stage is a small, idempotent module under src/<stage>/ that reads data/<N>-<name>/ and writes the next stage — so any single transform can be re-run without disturbing the rest. The numeric prefix on each data/ directory is its step badge.

Phase	Stages	What happens
Collection	`1-raw` → `3-consolidated`	Scrape & merge raw MCQs from web sources and OCR'd exam PDFs
Normalization	`4-rtl-aligned` → `14-bidi-isolated`	RTL alignment, quote/character/punctuation normalization, schema canonicalization, option-prefix stripping, blank normalization, exact & fuzzy dedup, bidi isolation
Filtering & sampling	`15-english-filtered` → `17-batching`	Drop non-Urdu rows, cap per-subdomain counts, build annotation batches
Annotation	`18-assignments` → `22-final-combined`	Group-aware dual annotation, then combine & finalize annotator verdicts
Release	`23-anonymize` → `27-hf`	Anonymize annotators, final dedup, then split into the eval snapshot (`26-eval`) and the published release (`27-hf`)

The pipeline ends in two sibling snapshots, both slim KEEP_KEYS views of the final data:

data/26-eval/ — evaluation snapshot with the original ids, unsorted. Every eval config points here, so locked results join back by id. Built by src/eval_snapshot/build.py; holds mcqs.json, mcqs_eval.jsonl (flattened for lm-eval), mcqs_dev.json (few-shot pool), and id_map.json (the 27-hf → 26-eval id crosswalk).
data/27-hf/ — the Hugging Face release, sorted by domain → subdomain and renumbered 0…N. Built by src/hf/build.py from 26-eval; holds urdummlu.json, stats.json, the dataset card, and logo. This is the only folder published to the Hub.

Repository layout

urdu-mmlu/
├── data/              # pipeline stages 1-27
├── src/               # one module per pipeline stage (+ ocr/, analysis/)
├── web/               # the public site (landing, annotator, admin, preview)
├── scripts/           # build_site.py, deploy.py, prep_lm_eval.py
├── docs/              # generated leaderboard / site assets
├── config.yaml        # LLM evaluation experiment config
└── eval.sh            # lm-evaluation-harness runner (0/3/5-shot)

Evaluation

Treat each item as a single-answer MCQ: present question + options, compare the model's chosen key against correct_key. Exact-match accuracy is the primary metric; we recommend reporting it broken down by domain and level.

Open models (via lm-evaluation-harness):

bash eval.sh        # runs 0/3/5-shot tasks; results in output/lm_eval/

API models (0-shot, Urdu & English prompts):

python src/run_experiment.py --config config.yaml

Setup

git clone git@github.com:mbzuai-nlp/urdu-mmlu.git
cd urdu-mmlu
pip install -r requirements.txt      # or: uv sync

Provide API keys for any providers you use (page classification / OCR / API-model eval) via a .env file in the project root.

License

Code (the pipeline in src/, scripts, and harness) — GNU GPL v3.0.
Dataset (MBZUAI/UrduMMLU) — CC BY 4.0.

Citation

@misc{tabassum2026urdummlumassivemultitaskbenchmark,
      title={UrduMMLU: A Massive Multitask Benchmark for Urdu Language Understanding},
      author={Ahmer Tabassum and Sarfraz Ahmad and Hasan Iqbal and Owais Aijaz and Momina Ahsan and Preslav Nakov},
      year={2026},
      eprint={2606.07167},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2606.07167},
}

Name		Name	Last commit message	Last commit date
Latest commit History 121 Commits
.claude		.claude
.github/workflows		.github/workflows
.vscode		.vscode
assets		assets
data		data
docs		docs
output		output
scripts		scripts
src		src
tasks		tasks
web		web
.env.template		.env.template
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
eval.sh		eval.sh
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UrduMMLU

A Massive Multitask Benchmark for Urdu Language Understanding

Table of Contents

TL;DR

Dataset at a glance

How it's built

Repository layout

Evaluation

Setup

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

UrduMMLU

A Massive Multitask Benchmark for Urdu Language Understanding

Table of Contents

TL;DR

Dataset at a glance

How it's built

Repository layout

Evaluation

Setup

License

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages