MALINDO BLiMP (Malay/Indonesian Benchmark of Linguistic Minimal Pairs)

Introduction

MALINDO BLiMP is a dataset for targeted syntactic evaluations of language models in Malay (zsm) and Indonesian (ind). In building MALINDO BLiMP, we closely followed the procedure adopted by the developers of JBLiMP (Japanese Benchmark of Linguistic Minimal Pairs). We collected our data from linguistics journals and books and created minimal pairs. These minimal pairs are classified into 12 phenomena consisting of 45 paradigms.

MALINDO BLiMP is still under construction, so we are releasing a small part of it first before releasing the full version.

MALINDO BLiMP is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.

How to Cite

Nomoto, Hiroki, Sri Budi Lestari, David Moeljadi, Farhan Athirah binti Abdul Razak, Kazuya Inagaki and Masashi Furihata. 2026. Challenges in building a benchmark of linguistic minimal pairs for low resource languages: The case of Malay and Indonesian. Proceedings of the Thirty-Second Annual Meeting of the Association for Natural Language Processing, 381-386.

@InProceedings{NomotoEtAl26,
    author = {Nomoto, Hiroki and Lestari, Sri Budi and Moeljadi, David and Farhan Athirah binti Abdul Razak and Inagaki, Kazuya and Furihata, Masashi},
    year = {2026},
    title = {Challenges in building a benchmark of linguistic minimal pairs for low resource languages: The case of Malay and Indonesian},
    booktitle = {Proceedings of the Thirty-Second Annual Meeting of the {A}ssociation for {N}atural {L}anguage {P}rocessing},
    pages = {381-386},
    url = {https://www.anlp.jp/proceedings/annual_meeting/2026/pdf_dir/Q1-6.pdf}
}

core/raw/: Minimal pairs used for human validation (300 pairs for both Malay and Indonesian)
core/validated/: Minimal pairs after human validation (174 pairs for Malay and 189 pairs for Indonesian)
core/validation/: Results of human validation (acceptability judgement experiment)
sources.bib: List of data sources

Data Format

Name	Description
ID	ID of the minimal pair
original_language	Language of the original sentence
{good/bad}_diacritic	Acceptability judgement in the source (`g` = no diacritic)
{good/bad}_sentence_raw	Raw sentence in the source (not necessarily in the target language)
{good/bad}_sentence	MALINDO BLiMP sentence (translated if the raw sentence is not in the target language)
{good/bad}_translation	English translation
{good/bad}_source	Author and publication year of the source
{good/bad}_page	Page number where the relevant sentence appears in the source
{good/bad}_num	Example number of the relevant sentence in the source
phenomenon	Categorization based on linguistic phenomenon
paradigm	Sub-categorization of phenomenon

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MALINDO BLiMP (Malay/Indonesian Benchmark of Linguistic Minimal Pairs)

Introduction

How to Cite

Contents

Data Format

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
core		core
README.md		README.md
sources.bib		sources.bib

Folders and files

Latest commit

History

Repository files navigation

MALINDO BLiMP (Malay/Indonesian Benchmark of Linguistic Minimal Pairs)

Introduction

How to Cite

Contents

Data Format

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages