A large-scale 7B pretraining language model developed by BaiChuan-Inc.
-
Updated
Jul 18, 2024 - Python
A large-scale 7B pretraining language model developed by BaiChuan-Inc.
A series of large language models developed by Baichuan Intelligent Technology
A 13B large language model developed by Baichuan Intelligent Technology
A Contamination-free Multi-task Language Understanding Benchmark [Official, ACL 2025]
[NeurIPS 2023 Spotlight] In-Context Impersonation Reveals Large Language Models' Strengths and Biases
Production-grade LLM Evaluation & Benchmarking Framework — GPT-4, Claude, Gemini, Mistral. Accuracy, latency, cost, hallucination, reasoning metrics.
[NeurIPS 2025] AGI-Elo: How Far Are We From Mastering A Task?
Dual-engine (llama.cpp + vLLM) LLM benchmarking pipeline for GGUF & safetensors on NVIDIA GPUs — speed, quality, live dashboard, publishable cards.
A Massive Multitask Benchmark for Urdu Language Understanding
Code and data accompanying the article "The impact of quantising a small open source LLM". This repository explores how quantisation affects performance, VRAM usage, and inference speed in Qwen3 1.7B.
Benchmark suite for open-source language models on the edge. Evaluates inference efficiency, MMLU accuracy, and LLM-rated teaching effectiveness.
Enterprise-grade LLM evaluation framework | Multi-model benchmarking, honest dashboards, system profiling | Academic metrics: MMLU, TruthfulQA, HellaSwag | Zero fake data | PyPI: llm-benchmark-toolkit | Blog: https://dev.to/nahuelgiudizi/building-an-honest-llm-evaluation-framework-from-fake-metrics-to-real-benchmarks-2b90
CLI tool to evaluate LLM factuality on MMLU benchmark.
An easy-to-use and standardised framework for evaluating Large Language Models (LLMs) on the Massive Multitask Language Understanding (MMLU) dataset. Currently supported: Hugging Face transformer models and Bedrock models.
大模型评测平台 — 本地/API/HuggingFace/OpenCompass 三路后端,支持数据生产(Self-Instruct/Evol-Instruct)、长尾场景生成、弱项挖掘、回归分析、污染检测、Bad Case归因。可扩展的 Benchmark 系统和 LLM-as-Judge 自动评分。
LLMs' performance analysis on CPU, GPU, Execution Time and Energy Usage
Framework modular em Python para benchmarking e análise reprodutível de LLMs, com execução via APIs, coleta estruturada de respostas, métricas automáticas (BLEU, ROUGE, BERTScore, MMLU, HellaSwag), rankings e relatórios consolidados.
Benchmark local LLMs running under Ollama - quality, speed, and side-by-side comparison, with a proper Web UI.
Add a description, image, and links to the mmlu topic page so that developers can more easily learn about it.
To associate your repository with the mmlu topic, visit your repo's landing page and select "manage topics."