Skip to content

Update with the new mainstream structure#5

Open
Jeronymous wants to merge 95 commits into
mainfrom
merge_hf_main
Open

Update with the new mainstream structure#5
Jeronymous wants to merge 95 commits into
mainfrom
merge_hf_main

Conversation

@Jeronymous

Copy link
Copy Markdown
Member

No description provided.

NathanHB and others added 30 commits October 14, 2025 16:06
* option1

* also debugging the judge

* also debugging the judge

* debug

* eval tracker fix 1

* likely fix for the GSM+ issue

* stringify model judge + change max_length to what's actually passed instead of setting a bunch of overwrites

* more memory for flow judge
…several combinations (huggingface#1017)

* fix

* added a warning message

* fix unit tests

* fix unit tests 2

* mini fix

* minifix

* test

* update new metrics name

* updated var names
…huggingface#828)

Co-authored-by: Nathan Habib <30601243+NathanHB@users.noreply.github.com>
* homogeneize k and n in parametrizable metrics

* updated aime, last metric fixs

* fix

* restore rm import

* restore

* update doc

* gpqa fix

* pass at

* recall

* test
Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>
* use inspect-ai to evaluate aime25 and gsm8k

* revert file

* working for 3 tasks

* parallel evals of tasks

* adds gpqa diamond to inspect

* move tasks to individual files

* move tasks to individual files

* enable extended tasks as well

* run precomit hook

* fix mkqa

* chaange extended suite to lighteval

* chaange extended suite to lighteval

* add metdata to tasks

* add metdata to tasks

* remove license notice and put docstring on top of file

* homogenize tags

* add docstring for all multilingual tasks

* add docstring for all multilingual tasks

* add name and dataset to metadata

* use TASKS_TABLE for multilingual tasks

* use TASKS_TABLE for default tasks

* use TASKS_TABLE for default tasks

* loads all tasks correclty

* move community tasks to default tasks and update doc

* move community tasks to default tasks and update doc

* revert uneeded changes

* fix doc build

* fix doc build

* remove custom tasks and let user decide if loading multilingual tasks

* load-tasks multilingual fix

* update doc

* remove uneeded file

* update readme

* update readme

* update readme

* fix test

* add back the custom tasks

* add back the custom tasks

* fix tasks

* fix tasks

* fix tasks

* fix tests

* fix tests
adds inspect-ai as backend for lighteval! Offloading backend implementation and maintenance

- this allows for:
- better logs
- better paralelixzation
- easier to add tasks

tasks compatible with inspect ai (at term all the tasks will be compatible):

- gpqa (fewshot compatible)
- ifeval
- hle
- gsm8k (fewshot compatible)
- agieval
- aime24,25

### run llama3.1-8b using all providers on `hf-inference-providers` on `gpqa`, `agieval` and `aime25`:

```
lighteval eval hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:cerebras \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:featherless-ai \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:fireworks-ai \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:novita \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:nebius \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:sambanova \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:scaleway \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:nscale \
"lighteval|gpqa|0,lighteval|agieval|0,lighteval|aime25|0" \
max-connections 50 --timeout 30  --retry-on-error 1 --max-retries 5 --epochs 1 --max-samples 1
```

result:

```
|                                Model                                 |agieval|aime25|gpqa|
|----------------------------------------------------------------------|------:|-----:|---:|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:cerebras      |   0.53|     0|0.33|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:featherless-ai|   0.71|     1|0.75|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:fireworks-ai  |   0.71|     0|0.25|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:nebius        |   0.53|     0|0.20|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:novita        |   0.65|     0|0.75|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:sambanova     |   0.71|     0|0.25|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:scaleway      |   0.35|     0|0.25|
```


### compare few shots diff on gsm8k

```
lighteval eval hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:cerebras \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:featherless-ai \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:fireworks-ai \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:novita \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:nebius \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:sambanova \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:scaleway \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:nscale \
"lighteval|gsm8k|0,lighteval|gsm8k|3" \
max-connections 50 --timeout 30  --retry-on-error 1 --max-retries 5 --epochs 1 --max-samples 1
```

```
|                                Model                                 |gsm8k|gsm8k_3_shots|
|----------------------------------------------------------------------|----:|------------:|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:cerebras      |  0.6|          0.7|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:featherless-ai|  0.7|          0.7|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:fireworks-ai  |  0.7|          0.8|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:nebius        |  0.6|          0.7|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:novita        |  0.5|          0.7|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:sambanova     |  0.7|          0.7|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:scaleway      |  0.4|          0.8|
```

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* adds mmlu-pro

* adds mmlu-pro

* add mmlu-pro with inspectai
* adds mmlu-pro

* adds mmlu-pro

* add mmlu-pro with inspectai

* fix reasoning effrot
…#992)

* fix

* revert uneeded changes

---------

Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>
* run all hf-providers

* add example

* remove uneeded params
* remove suites and make fewshot optional

* fix docs to remove suites and fewshots

* fix tests

* fix tests

* fix tests

* fix tests

* fix tests

* fix tests

* fix tests
Co-authored-by: Francesco Bertolotti <francesco.bertolotti@igenius.ai>
Co-authored-by: Francesco Bertolotti <francesco.bertolotti@igenius.ai>
…uggingface#1051)

* remove suites and make fewshot optional

* fix docs to remove suites and fewshots

* fix tests

* fix tests

* fix tests

* fix tests

* fix tests

* fix tests

* fix tests

* Remove suite argument iin task config

* Remove suite argument iin task config

* fix try to cache functool.partial function

* fix styling
…gface#1052)

* add a task dump in registry for better documentation of tasks

* Update src/lighteval/tasks/registry.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update src/lighteval/tasks/registry.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update src/lighteval/tasks/registry.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* fix

* remove

* fix aimo

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
set(1,2,3) -> {1,2,3}

Co-authored-by: Francesco Bertolotti <francesco.bertolotti@igenius.ai>
Co-authored-by: Francesco Bertolotti <francesco.bertolotti@gmail.com>
Co-authored-by: Kangda Wei <kangdawei@Kangdas-MacBook-Pro.local>
even though vllm produces openai compatible endpoint, to make work you have to use provider as hosted_vllm and use a hosted_vllm prefix prior to model name
moves all the prompts from `default_prompts.py` to their respective task file
@Jeronymous Jeronymous requested review from Lduignan1 and Oligou April 22, 2026 09:28
Upstream refactor splits src/lighteval/tasks into per-task files under
src/lighteval/tasks/tasks/ and src/lighteval/tasks/multilingual/tasks/,
drops default_tasks.py / default_prompts.py / multilingual/tasks.py, and
removes the suite field from LightevalTaskConfig.

Port our edits to the new structure:
- tasks/gsm_plus.py: generation_size 16384
- tasks/gsm8k.py: generation_size 2048
- tasks/mgsm.py: hf_revision, suffix exact_match + expr_gold_metric,
  language-specific stop sequences for all 11 subsets
- tasks/piqa.py: switch to lighteval/piqa mirror
- tasks/siqa.py: pin hf_revision
- tasks/mmlu_pro.py: fix upstream's hardcoded ABCD letters so the prompt
  uses dynamic letters based on the number of options; add a parallel
  mmlu_pro_raw task exposing the handmade prompt (no inspect_ai)
- tasks/ruler.py: new home for the ruler prompt helper
- tasks/advbench.py: move here from community_tasks/
- multilingual/tasks/mathalea.py: move here from community_tasks/
- multilingual/tasks/french.py: keep jzhang86/fr_ifeval fallback and the
  generative GPQA-fr-diamond variant with prompt_gpqa_fr_instruct

Other conflict resolutions:
- pyproject.toml: take upstream unpinned transformers, vllm>=0.11.0,
  new inspect-ai and openai deps
- vllm_model.py: keep max_seq_len_to_capture fallback, Mistral eos_token
  guard, prefix-cache None-skip in logprob loop, and
  skip_reading_prefix_cache via guarded attribute assignment; adopt
  upstream's build_vllm_token_prompts helper
- llm_as_judge.py: keep max_model_len=65536, adopt upstream's
  api_key/base_url litellm pass-through
- lighteval_task.py: preserve name/data_dir fallback in load_dataset
  while picking up upstream's data_files support; keep partial args
  detail in __str__ for deterministic cache hashing
- cache_management.py: adopt name-only task_to_configs lookup; keep
  regex that strips function memory addresses for hash determinism
Jeronymous and others added 20 commits April 22, 2026 14:06
litellm.completion expects an int, not a (N,) tuple.
Current RAG-style tasks need the row-specific retrieved context to
live in the system role, not prepended to the user query. Opt-in
flag keeps all existing tasks unchanged.
squad_v2 was filtering out questions with no answer, which is
exactly the half of the dataset that tests refusal behavior.
Replace the filter with an explicit "unanswerable" choice.
…options, not all the possible ones. Also increase generation_size from 100 to 1024 (for thinking models)
The generator had been narrowed to MCFFormulation + the ALL label only,
which dropped the _cf/_hybrid variants and the CA/CS/UNK labels. Restore
the full formulation list and sensitivity labels.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.