Update with the new mainstream structure by Jeronymous · Pull Request #5 · OpenLLM-France/lighteval

Jeronymous · 2026-04-22T09:28:34Z

No description provided.

* option1 * also debugging the judge * also debugging the judge * debug * eval tracker fix 1 * likely fix for the GSM+ issue * stringify model judge + change max_length to what's actually passed instead of setting a bunch of overwrites * more memory for flow judge

…several combinations (huggingface#1017) * fix * added a warning message * fix unit tests * fix unit tests 2 * mini fix * minifix * test * update new metrics name * updated var names

…huggingface#828) Co-authored-by: Nathan Habib <30601243+NathanHB@users.noreply.github.com>

* homogeneize k and n in parametrizable metrics * updated aime, last metric fixs * fix * restore rm import * restore * update doc * gpqa fix * pass at * recall * test

Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>

* use inspect-ai to evaluate aime25 and gsm8k * revert file * working for 3 tasks * parallel evals of tasks * adds gpqa diamond to inspect * move tasks to individual files * move tasks to individual files * enable extended tasks as well * run precomit hook * fix mkqa * chaange extended suite to lighteval * chaange extended suite to lighteval * add metdata to tasks * add metdata to tasks * remove license notice and put docstring on top of file * homogenize tags * add docstring for all multilingual tasks * add docstring for all multilingual tasks * add name and dataset to metadata * use TASKS_TABLE for multilingual tasks * use TASKS_TABLE for default tasks * use TASKS_TABLE for default tasks * loads all tasks correclty * move community tasks to default tasks and update doc * move community tasks to default tasks and update doc * revert uneeded changes * fix doc build * fix doc build * remove custom tasks and let user decide if loading multilingual tasks * load-tasks multilingual fix * update doc * remove uneeded file * update readme * update readme * update readme * fix test * add back the custom tasks * add back the custom tasks * fix tasks * fix tasks * fix tasks * fix tests * fix tests

adds inspect-ai as backend for lighteval! Offloading backend implementation and maintenance - this allows for: - better logs - better paralelixzation - easier to add tasks tasks compatible with inspect ai (at term all the tasks will be compatible): - gpqa (fewshot compatible) - ifeval - hle - gsm8k (fewshot compatible) - agieval - aime24,25 ### run llama3.1-8b using all providers on `hf-inference-providers` on `gpqa`, `agieval` and `aime25`: ``` lighteval eval hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:cerebras \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:featherless-ai \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:fireworks-ai \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:novita \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:nebius \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:sambanova \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:scaleway \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:nscale \ "lighteval|gpqa|0,lighteval|agieval|0,lighteval|aime25|0" \ max-connections 50 --timeout 30 --retry-on-error 1 --max-retries 5 --epochs 1 --max-samples 1 ``` result: ``` | Model |agieval|aime25|gpqa| |----------------------------------------------------------------------|------:|-----:|---:| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:cerebras | 0.53| 0|0.33| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:featherless-ai| 0.71| 1|0.75| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:fireworks-ai | 0.71| 0|0.25| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:nebius | 0.53| 0|0.20| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:novita | 0.65| 0|0.75| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:sambanova | 0.71| 0|0.25| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:scaleway | 0.35| 0|0.25| ``` ### compare few shots diff on gsm8k ``` lighteval eval hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:cerebras \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:featherless-ai \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:fireworks-ai \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:novita \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:nebius \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:sambanova \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:scaleway \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:nscale \ "lighteval|gsm8k|0,lighteval|gsm8k|3" \ max-connections 50 --timeout 30 --retry-on-error 1 --max-retries 5 --epochs 1 --max-samples 1 ``` ``` | Model |gsm8k|gsm8k_3_shots| |----------------------------------------------------------------------|----:|------------:| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:cerebras | 0.6| 0.7| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:featherless-ai| 0.7| 0.7| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:fireworks-ai | 0.7| 0.8| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:nebius | 0.6| 0.7| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:novita | 0.5| 0.7| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:sambanova | 0.7| 0.7| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:scaleway | 0.4| 0.8| ``` --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* adds mmlu-pro * adds mmlu-pro * add mmlu-pro with inspectai

* adds mmlu-pro * adds mmlu-pro * add mmlu-pro with inspectai * fix reasoning effrot

…1034)

…#992) * fix * revert uneeded changes --------- Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>

* run all hf-providers * add example * remove uneeded params

* remove suites and make fewshot optional * fix docs to remove suites and fewshots * fix tests * fix tests * fix tests * fix tests * fix tests * fix tests * fix tests

Co-authored-by: Francesco Bertolotti <francesco.bertolotti@igenius.ai>

…uggingface#1051) * remove suites and make fewshot optional * fix docs to remove suites and fewshots * fix tests * fix tests * fix tests * fix tests * fix tests * fix tests * fix tests * Remove suite argument iin task config * Remove suite argument iin task config * fix try to cache functool.partial function * fix styling

…gface#1052) * add a task dump in registry for better documentation of tasks * Update src/lighteval/tasks/registry.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update src/lighteval/tasks/registry.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update src/lighteval/tasks/registry.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * fix * remove * fix aimo --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

set(1,2,3) -> {1,2,3} Co-authored-by: Francesco Bertolotti <francesco.bertolotti@igenius.ai> Co-authored-by: Francesco Bertolotti <francesco.bertolotti@gmail.com>

Co-authored-by: Kangda Wei <kangdawei@Kangdas-MacBook-Pro.local>

even though vllm produces openai compatible endpoint, to make work you have to use provider as hosted_vllm and use a hosted_vllm prefix prior to model name

moves all the prompts from `default_prompts.py` to their respective task file

Upstream refactor splits src/lighteval/tasks into per-task files under src/lighteval/tasks/tasks/ and src/lighteval/tasks/multilingual/tasks/, drops default_tasks.py / default_prompts.py / multilingual/tasks.py, and removes the suite field from LightevalTaskConfig. Port our edits to the new structure: - tasks/gsm_plus.py: generation_size 16384 - tasks/gsm8k.py: generation_size 2048 - tasks/mgsm.py: hf_revision, suffix exact_match + expr_gold_metric, language-specific stop sequences for all 11 subsets - tasks/piqa.py: switch to lighteval/piqa mirror - tasks/siqa.py: pin hf_revision - tasks/mmlu_pro.py: fix upstream's hardcoded ABCD letters so the prompt uses dynamic letters based on the number of options; add a parallel mmlu_pro_raw task exposing the handmade prompt (no inspect_ai) - tasks/ruler.py: new home for the ruler prompt helper - tasks/advbench.py: move here from community_tasks/ - multilingual/tasks/mathalea.py: move here from community_tasks/ - multilingual/tasks/french.py: keep jzhang86/fr_ifeval fallback and the generative GPQA-fr-diamond variant with prompt_gpqa_fr_instruct Other conflict resolutions: - pyproject.toml: take upstream unpinned transformers, vllm>=0.11.0, new inspect-ai and openai deps - vllm_model.py: keep max_seq_len_to_capture fallback, Mistral eos_token guard, prefix-cache None-skip in logprob loop, and skip_reading_prefix_cache via guarded attribute assignment; adopt upstream's build_vllm_token_prompts helper - llm_as_judge.py: keep max_model_len=65536, adopt upstream's api_key/base_url litellm pass-through - lighteval_task.py: preserve name/data_dir fallback in load_dataset while picking up upstream's data_files support; keep partial args detail in __str__ for deterministic cache hashing - cache_management.py: adopt name-only task_to_configs lookup; keep regex that strips function memory addresses for hash determinism

litellm.completion expects an int, not a (N,) tuple.

Current RAG-style tasks need the row-specific retrieved context to live in the system role, not prepended to the user query. Opt-in flag keeps all existing tasks unchanged.

…ge LLM (to avoid some memory errors)

…ntually called)

squad_v2 was filtering out questions with no answer, which is exactly the half of the dataset that tests refusal behavior. Replace the filter with an explicit "unanswerable" choice.

…options, not all the possible ones. Also increase generation_size from 100 to 1024 (for thinking models)

The generator had been narrowed to MCFFormulation + the ALL label only, which dropped the _cf/_hybrid variants and the CA/CS/UNK labels. Restore the full formulation list and sensitivity labels.

…kken + max_images to skip vision profiling)

…dict=False to get token ids, not a BatchEncoding)

…huggingface#1067 regression)

…text' setting

NathanHB and others added 30 commits October 14, 2025 16:06

Split up enhancement and features in release notes template (huggingf…

c2b83e2

…ace#984)

Fix nltk import failing (huggingface#1013)

3af8925

Fix 999: always provide parameters in the metric name to allow using …

70acb85

…several combinations (huggingface#1017) * fix * added a warning message * fix unit tests * fix unit tests 2 * mini fix * minifix * test * update new metrics name * updated var names

added fallback for incomplete configs for vlm models launched as llms (…

e7d885c

…huggingface#828) Co-authored-by: Nathan Habib <30601243+NathanHB@users.noreply.github.com>

Fixing naming for sample evals + adding reqs in aime24 (huggingface#989)

161d47c

* homogeneize k and n in parametrizable metrics * updated aime, last metric fixs * fix * restore rm import * restore * update doc * gpqa fix * pass at * recall * test

add translation literals indic (huggingface#1015)

bf8b547

Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>

adds mmlu-pro (huggingface#1031)

fa4860f

* adds mmlu-pro * adds mmlu-pro * add mmlu-pro with inspectai

Fix inspect reasoning effrot (huggingface#1033)

17e024b

* adds mmlu-pro * adds mmlu-pro * add mmlu-pro with inspectai * fix reasoning effrot

Update huggingface-cli login to use newer hf auth login (huggingface#…

97303ac

…1034)

add openai and inspect ai lower bound (huggingface#1035)

5aa09c5

fix lighteval task inspect command and tiny bench task (huggingface…

b5cbd91

…#992) * fix * revert uneeded changes --------- Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>

run all hf providers with :all (huggingface#1039)

5b7ca62

* run all hf-providers * add example * remove uneeded params

remove suites and make fewshot optional (huggingface#1038)

31433cc

* remove suites and make fewshot optional * fix docs to remove suites and fewshots * fix tests * fix tests * fix tests * fix tests * fix tests * fix tests * fix tests

put lower bound on typer to use literal type (huggingface#1042)

566a7be

remove suites from serbian_eval.py (huggingface#1044)

d04e4f9

neater bundle and logdir (huggingface#1043)

cd91dde

not forcing use_logits at True (huggingface#1050)

2247df7

Co-authored-by: Francesco Bertolotti <francesco.bertolotti@igenius.ai>

wrong attribute self.k -> self.n (huggingface#1049)

6524c6a

Co-authored-by: Francesco Bertolotti <francesco.bertolotti@igenius.ai>

Fix set using wrong syntax (huggingface#1057)

cb97d5c

set(1,2,3) -> {1,2,3} Co-authored-by: Francesco Bertolotti <francesco.bertolotti@igenius.ai> Co-authored-by: Francesco Bertolotti <francesco.bertolotti@gmail.com>

Fix: correct argument order in MajAtN.compute (huggingface#1058)

391d5b4

Co-authored-by: Kangda Wei <kangdawei@Kangdas-MacBook-Pro.local>

Update LiteLLM configuration for hosted_vllm provider (huggingface#1060)

af6b5b4

even though vllm produces openai compatible endpoint, to make work you have to use provider as hosted_vllm and use a hosted_vllm prefix prior to model name

use correct hf subset for ifbench multiturn (huggingface#1061)

ad58fed

One file one task definition (huggingface#1059)

babeec9

moves all the prompts from `default_prompts.py` to their respective task file

adding satrred tag for frontend

d9ea404

Adding AA Omniscience task (huggingface#1066)

5425c33

Jeronymous requested review from Lduignan1 and Oligou April 22, 2026 09:28

Jeronymous force-pushed the merge_hf_main branch from d621dce to d1cf663 Compare April 22, 2026 10:18

Jeronymous and others added 20 commits April 22, 2026 14:06

Fix ruff style and lint after merge

180975c

Solve version incompatibility in project install

2466d64

less differences with the upstream branch

68494ca

Add copyright

9ca1f4b

less differences with the upstream branch

6ee2a9e

do not build doc on fork

d9fe736

Add safety / red-teaming benchmarks

379ed71

fix max_tokens tuple bug in JudgeLM litellm call

a7febad

litellm.completion expects an int, not a (N,) tuple.

support per-doc system role via Doc.specific["instruction_as_system"]

b68623f

Current RAG-style tasks need the row-specific retrieved context to live in the system role, not prepended to the user query. Opt-in flag keeps all existing tasks unchanged.

Add environment variable to possibly tune the memory usage of the jud…

4ecdb69

…ge LLM (to avoid some memory errors)

make sure the memory of the LLM is freed (before the judge LLM is eve…

9f90fba

…ntually called)

Add generative task variant for MathAlea

ca639a2

keep unanswerable rows in squad_v2

5946dea

squad_v2 was filtering out questions with no answer, which is exactly the half of the dataset that tests refusal behavior. Replace the filter with an explicit "unanswerable" choice.

Fix MixEval: For FreeForm, the judge was onloy seeing the first good …

84f1717

…options, not all the possible ones. Also increase generation_size from 100 to 1024 (for thinking models)

add luciole_rag citation-aware grounded QA benchmark

7dabf27

Add Exo7 benchmark

cfb4c2d

Remove unsupported 'suite' argument from safety task configs

e03dd8a

Remove unsupported 'suite' argument from registry docstring example

eb76c0c

Restore CF/Hybrid formulations and sensitivity labels in global_mmlu

e0b6b4e

The generator had been narrowed to MCFFormulation + the ALL label only, which dropped the _cf/_hybrid variants and the CA/CS/UNK labels. Restore the full formulation list and sensitivity labels.

Add comet and metricx metrics to flores200

f122b15

Jeronymous force-pushed the merge_hf_main branch from 53a2e84 to f122b15 Compare June 17, 2026 13:38

Jeronymous added 5 commits June 17, 2026 17:45

vllm: fix Ministral on transformers v5 (mistral tokenizer_mode for te…

c36d670

…kken + max_images to skip vision profiling)

judge: fix vLLM judge on transformers v5 (apply_chat_template return_…

32707db

…dict=False to get token ids, not a BatchEncoding)

metrics: fix apply_metric for batched metrics returning list-of-dicts (…

e50ecd3

…huggingface#1067 regression)

Safety benchmarks: Use Llama Guard 4 judge. And don't compute 'no_con…

5535c34

…text' setting

Add AyaRedTeaming benchmark

9ab9827

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update with the new mainstream structure#5

Update with the new mainstream structure#5
Jeronymous wants to merge 95 commits into
mainfrom
merge_hf_main

Jeronymous commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Uh oh!

Conversation

Jeronymous commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants