Skip to content

mbzuai-nlp/UrduMMLU

Repository files navigation

UrduMMLU

UrduMMLU

A Massive Multitask Benchmark for Urdu Language Understanding

Hugging Face Dataset Paper Project Website Questions Language: Urdu
Code License: GPL 3.0 Data License: CC BY 4.0 HF Upload Action Python 3.10+ Models Evaluated

Table of Contents

TL;DR

A massive, human-curated multiple-choice benchmark for Urdu language understanding — 26,431 native Urdu questions spanning Pakistani secondary and higher-secondary curricula (SSC-I through HSSC-II) across the humanities, social sciences, STEM, professional studies, and general knowledge.

This repository holds the build pipeline, evaluation harness, and source code. The released dataset lives on the Hugging Face Hub at MBZUAI/UrduMMLU:

from datasets import load_dataset

ds = load_dataset("MBZUAI/UrduMMLU", split="test")

Dataset at a glance

Questions 26,431
Language Urdu (ur)
Format Single-answer multiple choice (4–5 options)
Levels SSC-I, SSC-II, HSSC-I, HSSC-II
Domains 5 (Humanities, Social Sciences, STEM, Profession, Other) — 26 subdomains
Sources 9 native Urdu exam & practice repositories + provincial boards

Unlike machine-translated MMLU variants, every item is sourced from native Urdu exam material, then cleaned, de-duplicated, schema-normalized, and human-verified.

How it's built

The benchmark is produced by a 27-stage pipeline. Each stage is a small, idempotent module under src/<stage>/ that reads data/<N>-<name>/ and writes the next stage — so any single transform can be re-run without disturbing the rest. The numeric prefix on each data/ directory is its step badge.

Phase Stages What happens
Collection 1-raw3-consolidated Scrape & merge raw MCQs from web sources and OCR'd exam PDFs
Normalization 4-rtl-aligned14-bidi-isolated RTL alignment, quote/character/punctuation normalization, schema canonicalization, option-prefix stripping, blank normalization, exact & fuzzy dedup, bidi isolation
Filtering & sampling 15-english-filtered17-batching Drop non-Urdu rows, cap per-subdomain counts, build annotation batches
Annotation 18-assignments22-final-combined Group-aware dual annotation, then combine & finalize annotator verdicts
Release 23-anonymize27-hf Anonymize annotators, final dedup, then split into the eval snapshot (26-eval) and the published release (27-hf)

The pipeline ends in two sibling snapshots, both slim KEEP_KEYS views of the final data:

  • data/26-eval/evaluation snapshot with the original ids, unsorted. Every eval config points here, so locked results join back by id. Built by src/eval_snapshot/build.py; holds mcqs.json, mcqs_eval.jsonl (flattened for lm-eval), mcqs_dev.json (few-shot pool), and id_map.json (the 27-hf → 26-eval id crosswalk).
  • data/27-hf/ — the Hugging Face release, sorted by domain → subdomain and renumbered 0…N. Built by src/hf/build.py from 26-eval; holds urdummlu.json, stats.json, the dataset card, and logo. This is the only folder published to the Hub.

Repository layout

urdu-mmlu/
├── data/              # pipeline stages 1-27
├── src/               # one module per pipeline stage (+ ocr/, analysis/)
├── web/               # the public site (landing, annotator, admin, preview)
├── scripts/           # build_site.py, deploy.py, prep_lm_eval.py
├── docs/              # generated leaderboard / site assets
├── config.yaml        # LLM evaluation experiment config
└── eval.sh            # lm-evaluation-harness runner (0/3/5-shot)

Evaluation

Treat each item as a single-answer MCQ: present question + options, compare the model's chosen key against correct_key. Exact-match accuracy is the primary metric; we recommend reporting it broken down by domain and level.

Open models (via lm-evaluation-harness):

bash eval.sh        # runs 0/3/5-shot tasks; results in output/lm_eval/

API models (0-shot, Urdu & English prompts):

python src/run_experiment.py --config config.yaml

Setup

git clone git@github.com:mbzuai-nlp/urdu-mmlu.git
cd urdu-mmlu
pip install -r requirements.txt      # or: uv sync

Provide API keys for any providers you use (page classification / OCR / API-model eval) via a .env file in the project root.

License

Citation

@misc{tabassum2026urdummlumassivemultitaskbenchmark,
      title={UrduMMLU: A Massive Multitask Benchmark for Urdu Language Understanding},
      author={Ahmer Tabassum and Sarfraz Ahmad and Hasan Iqbal and Owais Aijaz and Momina Ahsan and Preslav Nakov},
      year={2026},
      eprint={2606.07167},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2606.07167},
}

About

A Massive Multitask Benchmark for Urdu Language Understanding

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors