mmlu

Here are 26 public repositories matching this topic...

baichuan-inc / Baichuan-7B

A large-scale 7B pretraining language model developed by BaiChuan-Inc.

natural-language-processing artificial-intelligence chinese llama huggingface ceval gpt-4 large-language-models chatgpt mmlu

Updated Jul 18, 2024
Python

baichuan-inc / Baichuan2

Star

A series of large language models developed by Baichuan Intelligent Technology

benchmark natural-language-processing artificial-intelligence chinese gpt huggingface ceval gpt-4 large-language-models chatgpt mmlu llama2

Updated Nov 8, 2024
Python

baichuan-inc / Baichuan-13B

Star

A 13B large language model developed by Baichuan Intelligent Technology

benchmark natural-language-processing artificial-intelligence chinese huggingface ceval gpt-4 large-language-models chatgpt mmlu

Updated Sep 6, 2023
Python

microsoft / MMLU-CF

Star

A Contamination-free Multi-task Language Understanding Benchmark [Official, ACL 2025]

benchmark contamination llm mmlu

Updated May 17, 2025

ExplainableML / in-context-impersonation

Star

[NeurIPS 2023 Spotlight] In-Context Impersonation Reveals Large Language Models' Strengths and Biases

chatbot text-generation artificial-intelligence llama clip reasoning bandit neurips-2023 mmlu llama2 in-context-impersonation

Updated Nov 30, 2024
Python

vignesh2027 / LLM-Evaluation-Framework

Star

Production-grade LLM Evaluation & Benchmarking Framework — GPT-4, Claude, Gemini, Mistral. Accuracy, latency, cost, hallucination, reasoning metrics.

Updated Jun 7, 2026
Python

SS47816 / AGI-Elo

Star

[NeurIPS 2025] AGI-Elo: How Far Are We From Mastering A Task?

benchmark leaderboard agi imagenet coco artificial-general-intelligence datasets evaluation-metrics elo-rating rating-system evaluation-framework sota ai-benchmarks waymo-open-dataset mmlu vision-language-action ai-evaluation-framework livecodebench navsim

Updated Oct 28, 2025
Python

notwitcheer / llm-bench-rig

Star

Dual-engine (llama.cpp + vLLM) LLM benchmarking pipeline for GGUF & safetensors on NVIDIA GPUs — speed, quality, live dashboard, publishable cards.

python benchmarking machine-learning cuda nvidia fastapi llm lm-evaluation-harness llama-cpp vllm mmlu gguf

Updated Jun 8, 2026
Python

mbzuai-nlp / UrduMMLU

Star

A Massive Multitask Benchmark for Urdu Language Understanding

urdu multiple-choice mmlu

Updated Jun 8, 2026
Jupyter Notebook

he-yufeng / LiteBench

Star

A pip-installable benchmark runner for LLMs and agents. Five minutes to your first eval.

python agent cli benchmark evaluation humaneval llm gsm8k mmlu litellm

Updated May 12, 2026
Python

RenaudGaudron / llm-quantisation-performance-study

Star

Code and data accompanying the article "The impact of quantising a small open source LLM". This repository explores how quantisation affects performance, VRAM usage, and inference speed in Qwen3 1.7B.

open-source ai quantization llm generative-ai mmlu

Updated Jul 5, 2025
Python

RobotStudyCompanion / Benchmark_LM

Sponsor

Star

Benchmark suite for open-source language models on the edge. Evaluates inference efficiency, MMLU accuracy, and LLM-rated teaching effectiveness.

python raspberry-pi benchmark language-models reproducibility edge-computing social-robots educational-robotics llm mmlu ollama teaching-effectiveness arso2026

Updated Apr 21, 2026
Python

NahuelGiudizi / llm-evaluation

Star

Enterprise-grade LLM evaluation framework | Multi-model benchmarking, honest dashboards, system profiling | Academic metrics: MMLU, TruthfulQA, HellaSwag | Zero fake data | PyPI: llm-benchmark-toolkit | Blog: https://dev.to/nahuelgiudizi/building-an-honest-llm-evaluation-framework-from-fake-metrics-to-real-benchmarks-2b90

visualization python benchmarking machine-learning performance-testing academic-metrics mmlu ollama llm-evaluation truthfulqa hellaswag

Updated Dec 5, 2025
Python

sergeyklay / factly

Sponsor

Star

CLI tool to evaluate LLM factuality on MMLU benchmark.

cli benchmark openai factuality ai-evaluation llm prompt-engineering chatgpt mmlu llm-evaluation

Updated Nov 26, 2025
Python

RenaudGaudron / MMLU_benchmark

Star

An easy-to-use and standardised framework for evaluating Large Language Models (LLMs) on the Massive Multitask Language Understanding (MMLU) dataset. Currently supported: Hugging Face transformer models and Bedrock models.

open-source benchmark ai llm generative-ai mmlu

Updated Jul 12, 2025
Python

chengjun-xu / ai-eval-platform

Star

大模型评测平台 — 本地/API/HuggingFace/OpenCompass 三路后端，支持数据生产(Self-Instruct/Evol-Instruct)、长尾场景生成、弱项挖掘、回归分析、污染检测、Bad Case归因。可扩展的 Benchmark 系统和 LLM-as-Judge 自动评分。

python flask humaneval ai-evaluation gsm8k mmlu llm-evaluation benchmark-platform rag-evaluation llm-as-judge opencompass llm-benchmark data-contamination-detection

Updated Jun 7, 2026
Python

abhigupta2909 / LLMPerformanceLab

Star

LLMs' performance analysis on CPU, GPU, Execution Time and Energy Usage

javascript mysql java spring-boot reactjs flask-restful humaneval llms mmlu ollama-api

Updated Apr 1, 2024
Java

AndrewHeller17 / Effect-of-Emotional-Framing-on-LLM-Performance

Star

Evaluated the impact of emotional prompt framing on LLM reasoning accuracy across industry benchmarks (MMLU, GPQA) using controlled experimental conditions.

python nlp machine-learning research llm chatgpt mmlu gpqa

Updated Mar 3, 2026
Jupyter Notebook

caiocezarq / llm-comparison-benchmark

Star

Framework modular em Python para benchmarking e análise reprodutível de LLMs, com execução via APIs, coleta estruturada de respostas, métricas automáticas (BLEU, ROUGE, BERTScore, MMLU, HellaSwag), rankings e relatórios consolidados.

python ai rouge rouge-metric bleu-score llm bertscore mmlu llms-benchmarking evidently-ai hellaswag

Updated Mar 6, 2026
HTML

AWSWind / ollama-model-evaluator

Star

Benchmark local LLMs running under Ollama - quality, speed, and side-by-side comparison, with a proper Web UI.

react python benchmark typescript evaluation hypothesis property-testing tailwindcss fastapi ai-tools humaneval llm gsm8k mmlu ollama llm-evaluation

Updated May 10, 2026
Python

Improve this page

Add a description, image, and links to the mmlu topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the mmlu topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mmlu

Here are 26 public repositories matching this topic...

baichuan-inc / Baichuan-7B

baichuan-inc / Baichuan2

baichuan-inc / Baichuan-13B

microsoft / MMLU-CF

ExplainableML / in-context-impersonation

vignesh2027 / LLM-Evaluation-Framework

SS47816 / AGI-Elo

notwitcheer / llm-bench-rig

mbzuai-nlp / UrduMMLU

he-yufeng / LiteBench

RenaudGaudron / llm-quantisation-performance-study

RobotStudyCompanion / Benchmark_LM

NahuelGiudizi / llm-evaluation

sergeyklay / factly

RenaudGaudron / MMLU_benchmark

chengjun-xu / ai-eval-platform

abhigupta2909 / LLMPerformanceLab

AndrewHeller17 / Effect-of-Emotional-Framing-on-LLM-Performance

caiocezarq / llm-comparison-benchmark

AWSWind / ollama-model-evaluator

Improve this page

Add this topic to your repo