Update with the new mainstream structure#5
Open
Jeronymous wants to merge 95 commits into
Open
Conversation
* option1 * also debugging the judge * also debugging the judge * debug * eval tracker fix 1 * likely fix for the GSM+ issue * stringify model judge + change max_length to what's actually passed instead of setting a bunch of overwrites * more memory for flow judge
…several combinations (huggingface#1017) * fix * added a warning message * fix unit tests * fix unit tests 2 * mini fix * minifix * test * update new metrics name * updated var names
…huggingface#828) Co-authored-by: Nathan Habib <30601243+NathanHB@users.noreply.github.com>
* homogeneize k and n in parametrizable metrics * updated aime, last metric fixs * fix * restore rm import * restore * update doc * gpqa fix * pass at * recall * test
Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>
* use inspect-ai to evaluate aime25 and gsm8k * revert file * working for 3 tasks * parallel evals of tasks * adds gpqa diamond to inspect * move tasks to individual files * move tasks to individual files * enable extended tasks as well * run precomit hook * fix mkqa * chaange extended suite to lighteval * chaange extended suite to lighteval * add metdata to tasks * add metdata to tasks * remove license notice and put docstring on top of file * homogenize tags * add docstring for all multilingual tasks * add docstring for all multilingual tasks * add name and dataset to metadata * use TASKS_TABLE for multilingual tasks * use TASKS_TABLE for default tasks * use TASKS_TABLE for default tasks * loads all tasks correclty * move community tasks to default tasks and update doc * move community tasks to default tasks and update doc * revert uneeded changes * fix doc build * fix doc build * remove custom tasks and let user decide if loading multilingual tasks * load-tasks multilingual fix * update doc * remove uneeded file * update readme * update readme * update readme * fix test * add back the custom tasks * add back the custom tasks * fix tasks * fix tasks * fix tasks * fix tests * fix tests
adds inspect-ai as backend for lighteval! Offloading backend implementation and maintenance - this allows for: - better logs - better paralelixzation - easier to add tasks tasks compatible with inspect ai (at term all the tasks will be compatible): - gpqa (fewshot compatible) - ifeval - hle - gsm8k (fewshot compatible) - agieval - aime24,25 ### run llama3.1-8b using all providers on `hf-inference-providers` on `gpqa`, `agieval` and `aime25`: ``` lighteval eval hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:cerebras \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:featherless-ai \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:fireworks-ai \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:novita \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:nebius \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:sambanova \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:scaleway \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:nscale \ "lighteval|gpqa|0,lighteval|agieval|0,lighteval|aime25|0" \ max-connections 50 --timeout 30 --retry-on-error 1 --max-retries 5 --epochs 1 --max-samples 1 ``` result: ``` | Model |agieval|aime25|gpqa| |----------------------------------------------------------------------|------:|-----:|---:| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:cerebras | 0.53| 0|0.33| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:featherless-ai| 0.71| 1|0.75| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:fireworks-ai | 0.71| 0|0.25| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:nebius | 0.53| 0|0.20| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:novita | 0.65| 0|0.75| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:sambanova | 0.71| 0|0.25| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:scaleway | 0.35| 0|0.25| ``` ### compare few shots diff on gsm8k ``` lighteval eval hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:cerebras \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:featherless-ai \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:fireworks-ai \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:novita \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:nebius \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:sambanova \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:scaleway \ hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:nscale \ "lighteval|gsm8k|0,lighteval|gsm8k|3" \ max-connections 50 --timeout 30 --retry-on-error 1 --max-retries 5 --epochs 1 --max-samples 1 ``` ``` | Model |gsm8k|gsm8k_3_shots| |----------------------------------------------------------------------|----:|------------:| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:cerebras | 0.6| 0.7| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:featherless-ai| 0.7| 0.7| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:fireworks-ai | 0.7| 0.8| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:nebius | 0.6| 0.7| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:novita | 0.5| 0.7| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:sambanova | 0.7| 0.7| |hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:scaleway | 0.4| 0.8| ``` --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* adds mmlu-pro * adds mmlu-pro * add mmlu-pro with inspectai
* adds mmlu-pro * adds mmlu-pro * add mmlu-pro with inspectai * fix reasoning effrot
…#992) * fix * revert uneeded changes --------- Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>
* run all hf-providers * add example * remove uneeded params
* remove suites and make fewshot optional * fix docs to remove suites and fewshots * fix tests * fix tests * fix tests * fix tests * fix tests * fix tests * fix tests
Co-authored-by: Francesco Bertolotti <francesco.bertolotti@igenius.ai>
Co-authored-by: Francesco Bertolotti <francesco.bertolotti@igenius.ai>
…uggingface#1051) * remove suites and make fewshot optional * fix docs to remove suites and fewshots * fix tests * fix tests * fix tests * fix tests * fix tests * fix tests * fix tests * Remove suite argument iin task config * Remove suite argument iin task config * fix try to cache functool.partial function * fix styling
…gface#1052) * add a task dump in registry for better documentation of tasks * Update src/lighteval/tasks/registry.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update src/lighteval/tasks/registry.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update src/lighteval/tasks/registry.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * fix * remove * fix aimo --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
set(1,2,3) -> {1,2,3}
Co-authored-by: Francesco Bertolotti <francesco.bertolotti@igenius.ai>
Co-authored-by: Francesco Bertolotti <francesco.bertolotti@gmail.com>
Co-authored-by: Kangda Wei <kangdawei@Kangdas-MacBook-Pro.local>
even though vllm produces openai compatible endpoint, to make work you have to use provider as hosted_vllm and use a hosted_vllm prefix prior to model name
moves all the prompts from `default_prompts.py` to their respective task file
Upstream refactor splits src/lighteval/tasks into per-task files under src/lighteval/tasks/tasks/ and src/lighteval/tasks/multilingual/tasks/, drops default_tasks.py / default_prompts.py / multilingual/tasks.py, and removes the suite field from LightevalTaskConfig. Port our edits to the new structure: - tasks/gsm_plus.py: generation_size 16384 - tasks/gsm8k.py: generation_size 2048 - tasks/mgsm.py: hf_revision, suffix exact_match + expr_gold_metric, language-specific stop sequences for all 11 subsets - tasks/piqa.py: switch to lighteval/piqa mirror - tasks/siqa.py: pin hf_revision - tasks/mmlu_pro.py: fix upstream's hardcoded ABCD letters so the prompt uses dynamic letters based on the number of options; add a parallel mmlu_pro_raw task exposing the handmade prompt (no inspect_ai) - tasks/ruler.py: new home for the ruler prompt helper - tasks/advbench.py: move here from community_tasks/ - multilingual/tasks/mathalea.py: move here from community_tasks/ - multilingual/tasks/french.py: keep jzhang86/fr_ifeval fallback and the generative GPQA-fr-diamond variant with prompt_gpqa_fr_instruct Other conflict resolutions: - pyproject.toml: take upstream unpinned transformers, vllm>=0.11.0, new inspect-ai and openai deps - vllm_model.py: keep max_seq_len_to_capture fallback, Mistral eos_token guard, prefix-cache None-skip in logprob loop, and skip_reading_prefix_cache via guarded attribute assignment; adopt upstream's build_vllm_token_prompts helper - llm_as_judge.py: keep max_model_len=65536, adopt upstream's api_key/base_url litellm pass-through - lighteval_task.py: preserve name/data_dir fallback in load_dataset while picking up upstream's data_files support; keep partial args detail in __str__ for deterministic cache hashing - cache_management.py: adopt name-only task_to_configs lookup; keep regex that strips function memory addresses for hash determinism
d621dce to
d1cf663
Compare
litellm.completion expects an int, not a (N,) tuple.
Current RAG-style tasks need the row-specific retrieved context to live in the system role, not prepended to the user query. Opt-in flag keeps all existing tasks unchanged.
…ge LLM (to avoid some memory errors)
squad_v2 was filtering out questions with no answer, which is exactly the half of the dataset that tests refusal behavior. Replace the filter with an explicit "unanswerable" choice.
…options, not all the possible ones. Also increase generation_size from 100 to 1024 (for thinking models)
The generator had been narrowed to MCFFormulation + the ALL label only, which dropped the _cf/_hybrid variants and the CA/CS/UNK labels. Restore the full formulation list and sensitivity labels.
53a2e84 to
f122b15
Compare
…kken + max_images to skip vision profiling)
…dict=False to get token ids, not a BatchEncoding)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.