Skip to content

koreankiwi99/Nunchi-Bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

388 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Nunchi-Bench 😬

Nunchi-Bench: Benchmarking Language Models on Cultural Reasoning with a Focus on Korean Superstition

Dataset Paper


🧠 Overview

Nunchi-Bench is a benchmark designed to evaluate language models’ cultural reasoning and understanding in the context of Korean superstitions. It includes 247 prompts across three question types:

Task Purpose #Items Versions
Multiple-Choice Factual recall of a superstition 31 EN / KR
Trap Advice in a culturally sensitive context 92 EN / KR × {Neutral, Specified}
Interpretation Explain a puzzling reaction 124 EN / KR × {Neutral, Specified}
Total 247

Version keys:

  • EN / KR: Prompt language (English or Korean)
  • Neutral: Scenario contains no cultural identity
  • Specified: Scenario explicitly states that the people involved are Korean

📋 Example Questions

Task Example
MCQ What does writing your name in red ink symbolize in Korean culture?
A. It will bring death
B. It is a sign of good luck
C. It is considered a mark of honor
D. It represents a new beginning
Trap My friend really likes the color red. I’m preparing her birthday cake.
Would it be okay to use red for the lettering of her name?
Interpretation My fan asked for an autograph, but all I had was a red pen.
When I went to sign and write his name, he hesitated and then refused.
I wonder why? He didn’t seem like he would resell the autograph.

🏆 Results & Leaderboard

Scores represent:

  • MCQ: Number of correct answers (out of 31)
  • Trap & Interpretation: Weighted sum of culturally appropriate completions (Scored as: 2 = Mentions the exact cultural context, 1 = Mentions cultural possibility, 0 = Does not mention culture, –1 = Hallucinates incorrect cultural information)

➤ Maximum scores: Trap = 184 (92 × 2), Interpretation = 248 (124 × 2) per variant

📘 Multiple-Choice (Factual Recall – Number of Correct Answers)

Rank EN KR
1 gemini-2.5-pro-preview (31) claude-opus (31)
2 claude-opus (30) gemini-2.5-pro-preview (31)
3 gemini1.5pro (30) gpt4turbo-0409 (30)
4 gpt4turbo-0409 (30) gpt-4.5-preview (29)
5 gpt-4o (30) gpt-4o (29)

🎯 Trap (Cultural Advice – Weighted Sum)

Rank EN (Neutral) EN (Specified) KR (Neutral) KR (Specified)
1 gemini-2.5-pro-preview (78.0) gemini-2.5-pro-preview (155.0) gemini-2.5-pro-preview (121.0) gemini-2.5-pro-preview (149.0)
2 gpt-4.5-preview (51.0) gpt-4.5-preview (133.0) claude-opus (105.0) claude-opus (132.0)
3 deepseek-chat (48.0) claude-opus (116.0) gpt-4.5-preview (98.0) gpt-4.5-preview (125.0)
4 claude-opus (44.0) gpt-4o (115.0) gpt-4o (75.0) deepseek-chat (104.0)
5 gpt-4o (38.0) deepseek-chat (113.0) deepseek-chat (73.0) gpt-4o (100.0)

🧠 Interpretation (Context Interpretation – Weighted Sum)

Rank EN (Neutral) EN (Specified) KR (Neutral) KR (Specified)
1 gemini-2.5-pro-preview (148.0) gemini-2.5-pro-preview (236.0) gemini-2.5-pro-preview (232.0) gemini-2.5-pro-preview (246.0)
2 claude-opus (145.0) gpt-4.5-preview (235.0) claude-opus (209.0) gpt-4.5-preview (237.0)
3 gpt-4.5-preview (144.0) claude-opus (223.0) gpt-4.5-preview (204.0) claude-opus (234.0)
4 deepseek-chat (140.0) deepseek-chat (217.0) deepseek-chat (172.0) gpt-4o (208.0)
5 gpt-4o (112.0) gpt4turbo-0409 (209.0) gpt-4o (168.0) deepseek-chat (200.0)

📁 Repository Structure


NUNCHI-BENCH/
├── data/                           # Benchmark data
│   ├── nunchi\_mcq.csv
│   ├── nunchi\_trap.csv
│   └── nunchi\_interpret.csv
│
├── evaluation/                    # Scoring scripts
│   ├── llm\_evaluator.py
│   └── mcqa\_evaluator.py
│
├── scripts/                       # Model generation scripts
│   ├── generate\_claude.py
│   ├── generate\_deepseek.py
│   ├── generate\_gemini.py
│   └── generate\_gpt.py
│
├── results/                       # Output + scores
│   ├── model\_outputs/
│   │   ├── interpret/
│   │   │   ├── eng\_default/
│   │   │   ├── eng\_kor/
│   │   │   ├── kor\_default/
│   │   │   └── kor\_kor/
│   │   ├── mcq/
│   │   └── trap/
├── prompts/                       # Prompt templates
├── environment.yml                # Conda environment
├── .gitignore
└── README.md

📖 Citation

If you use Nunchi-Bench in your work, please cite the arXiv version:

@misc{kim2025nunchibenchbenchmarkinglanguagemodels,
  title={Nunchi-Bench: Benchmarking Language Models on Cultural Reasoning with a Focus on Korean Superstition}, 
  author={Kyuhee Kim and Sangah Lee},
  year={2025},
  eprint={2507.04014},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2507.04014}
}

📬 Contact

For questions or contributions, feel free to reach out via GitHub Issues or email.


About

Nunchi-Bench: Benchmarking Language Models on Cultural Reasoning with a Focus on Korean Superstition

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages