bump lm-eval to 0.4.11 by dtrifiro · Pull Request #48 · openshift-psap/llm-eval-test

dtrifiro · 2026-06-16T08:44:40Z

original author: @dhuangnm

Signed-off-by: Dan Huang dahuang@redhat.com

Summary by CodeRabbit

Chores
- Bumped project version to 0.4.3
- Updated multiple dependency versions including core libraries and ML evaluation packages

Signed-off-by: Dan Huang <dahuang@redhat.com>

coderabbitai · 2026-06-16T08:44:54Z

📝 Walkthrough

Walkthrough

Project version bumped to 0.4.3 in pyproject.toml, with lm-eval[api] pinned to 0.4.11 and the leaderboard optional dependency minimum raised to >=0.4.11. pylock.toml is regenerated with updated resolved versions for all transitive dependencies including ML, data, networking, and dev tooling packages.

Changes

lm-eval Upgrade and Lock Regeneration

Layer / File(s)	Summary
Version and constraint bumps `pyproject.toml`	Project version set to `0.4.3`; `lm-eval[api]` pinned to `0.4.11`; `leaderboard` optional minimum raised to `>=0.4.11`.
lm-eval and ML tooling lock entries `pylock.toml`	`lm-eval` entry upgraded with revised dependency list; `latex2sympy2-extended`, `math-verify`, `sacrebleu`, `sentencepiece`, `tiktoken`, `scikit-learn`, `scipy`, `torch`, and `tqdm` versions updated.
Data and HuggingFace stack lock entries `pylock.toml`	`datasets`, `huggingface-hub`, `hf-xet`, `fsspec`, `evaluate`, `pyarrow`, `multiprocess`, `dill`, `xxhash`, and `zstandard` entries updated with new versions and constraint adjustments.
Networking and core runtime lock entries `pylock.toml`	`aiohttp`, `anyio`, `httpcore`, `httpx`, `requests`, `urllib3`, `frozenlist`, `multidict`, `yarl`, `certifi`, `charset-normalizer`, and `tenacity` entries updated.
Numeric, NLP, and text processing lock entries `pylock.toml`	`numpy`, `pandas`, `nltk`, `regex`, `lxml`, `more-itertools`, `narwhals`, `networkx`, `packaging`, `joblib`, and `tabledata`/`tabulate` entries updated.
Dev tooling lock entries `pylock.toml`	`pre-commit`, `ruff`, `virtualenv` (adds `python-discovery`), `nodeenv`, `identify`, `distlib`, `platformdirs`, `setuptools`, `typer`, `pygments`, `rich`, `pyyaml`, `pytz`, and related entries updated.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Possibly related PRs

openshift-psap/llm-eval-test#44: Also bumps lm-eval[api] version and updates pyproject.toml project metadata and lock file in the same pattern.
openshift-psap/llm-eval-test#46: Directly overlaps — bumps lm-eval[api] to 0.4.11 and raises the leaderboard optional minimum to >=0.4.11 in pyproject.toml.
openshift-psap/llm-eval-test#43: Modifies pyproject.toml to bump lm-eval[api] and associated optional dependency constraints at the same packaging level.

Poem

🐇 Hop, hop, the versions grow,
lm-eval jumps to 0.4.11 in a row!
The lockfile hashes spin and churn,
Each package bumped in its turn.
A new 0.4.3 tag gleams—
All patched up, or so it seems! 🌟

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely describes the main change: bumping lm-eval to version 0.4.11, which is the primary modification across both pyproject.toml and pylock.toml.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

pyproject.toml (1)
3-3: API compatibility checks out; consider adding test coverage for the lm-eval 0.4.11 upgrade.

The simple_evaluate() and TaskManager() APIs used in lm_eval_wrapper.py and downloader.py are compatible with lm-eval 0.4.11. The function parameters (e.g., apply_chat_template, fewshot_as_multiturn, task_manager) and TaskManager initialization pattern (include_path, include_defaults, verbosity) match the documented 0.4.11 API. The optional extras (ifeval, math, sentencepiece) are also available in this version. However, no test or validation documentation is present in the PR to demonstrate this upgrade was tested against the current code paths.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pyproject.toml` at line 3, The version bump to 0.4.11 in pyproject.toml lacks
supporting test coverage or validation documentation. Add test cases or test
documentation that explicitly verify the lm-eval 0.4.11 upgrade works with the
current code paths used in lm_eval_wrapper.py and downloader.py, including
validation of the simple_evaluate() function calls and TaskManager
initialization patterns to demonstrate the upgrade was properly tested.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pylock.toml`:
- Line 849: The lockfile contains sqlitedict==2.1.0 at line 883, which has a
HIGH severity deserialization vulnerability (GHSA-g4r7-86gm-pgqc) and cannot be
released to production. This is a transitive dependency of lm-eval==0.4.11
located at lines 412-434. To resolve this blocking security issue, you must
either upgrade lm-eval to a newer version that depends on a patched sqlitedict
(version >2.1.0), or add an explicit sqlitedict constraint in the dependencies
to pin it to a patched version. Check the lm-eval project for available releases
that address this CVE, update the lm-eval version constraint accordingly, and
regenerate the lockfile to ensure sqlitedict is resolved to a patched version.
- Line 170: The PR introduces significant dependency version upgrades (pandas
3.0.3, datasets 5.0.0, numpy 2.4.6, torch 2.12.0) in the pylock.toml lockfile
without providing test evidence of compatibility. You need to run the evaluation
pipeline tests to validate that the codebase remains functional with these new
versions, specifically testing the numpy operations in truthfulqa/utils.py, the
downstream code paths in lm_eval_wrapper.py and downloader.py, the benchmark
tasks under benchmarks/tasks/**, and the leaderboard extra tasks (math, ifeval,
sentencepiece). Execute the comprehensive test suite, capture the results
demonstrating successful execution and numerical correctness of the scoring
pipeline, and add this test evidence to the PR description or commit message to
confirm no regressions were introduced by these dependency updates.

---

Nitpick comments:
In `@pyproject.toml`:
- Line 3: The version bump to 0.4.11 in pyproject.toml lacks supporting test
coverage or validation documentation. Add test cases or test documentation that
explicitly verify the lm-eval 0.4.11 upgrade works with the current code paths
used in lm_eval_wrapper.py and downloader.py, including validation of the
simple_evaluate() function calls and TaskManager initialization patterns to
demonstrate the upgrade was properly tested.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1a70bf93-fec9-4888-bf85-bc0a2be7c432

📥 Commits

Reviewing files that changed from the base of the PR and between b227674 and 9bfd2db.

⛔ Files ignored due to path filters (1)

pdm.lock is excluded by !**/*.lock

📒 Files selected for processing (2)

pylock.toml
pyproject.toml

coderabbitai · 2026-06-16T08:50:36Z

 name = "datasets"
-version = "3.6.0"
-requires-python = ">=3.9.0"
+version = "5.0.0"


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's look at the repository structure git ls-files | head -20

Repository: openshift-psap/llm-eval-test

Length of output: 861

🏁 Script executed:

# Find and examine the pylock.toml file find . -name "pylock.toml" -o -name "*.lock" -o -name "requirements*.txt" | head -20

Repository: openshift-psap/llm-eval-test

Length of output: 116

🏁 Script executed:

# Look for test files find . -name "*test*.py" -o -name "tests" -type d | head -20

Repository: openshift-psap/llm-eval-test

Length of output: 54

🏁 Script executed:

# Search for CI/CD configuration find . -name ".github" -type d -o -name ".gitlab-ci.yml" -o -name "*.yml" | grep -E "(github|gitlab|circle)" | head -20

Repository: openshift-psap/llm-eval-test

Length of output: 117

🏁 Script executed:

# Read pylock.toml to verify the versions mentioned in the review cat -n pylock.toml | head -100

Repository: openshift-psap/llm-eval-test

Length of output: 3021

🏁 Script executed:

# Check if these specific lines have the versions mentioned sed -n '170p;596p;578p;664p;963p' pylock.toml

Repository: openshift-psap/llm-eval-test

Length of output: 164

🏁 Script executed:

# Look for test files and CI configuration find . -type f -name "*test*.py" -o -name "conftest.py" 2>/dev/null | head -20

Repository: openshift-psap/llm-eval-test

Length of output: 54

🏁 Script executed:

# Check the CI workflow cat -n .github/workflows/pre-commit.yml

Repository: openshift-psap/llm-eval-test

Length of output: 496

🏁 Script executed:

# Search for the mentioned files find . -name "lm_eval_wrapper.py" -o -name "downloader.py" -o -name "truthfulqa" -type d 2>/dev/null

Repository: openshift-psap/llm-eval-test

Length of output: 193

🏁 Script executed:

# Check what package is on line 664 sed -n '660,670p' pylock.toml

Repository: openshift-psap/llm-eval-test

Length of output: 280

🏁 Script executed:

# Look for any test files more thoroughly find . -type f \( -name "test_*.py" -o -name "*_test.py" \) 2>/dev/null

Repository: openshift-psap/llm-eval-test

Length of output: 54

🏁 Script executed:

# Check pyproject.toml for test configuration cat pyproject.toml

Repository: openshift-psap/llm-eval-test

Length of output: 1422

🏁 Script executed:

# Look at the mentioned files to understand what they do cat -n ./src/llm_eval_test/downloader.py | head -50

Repository: openshift-psap/llm-eval-test

Length of output: 2542

🏁 Script executed:

# Check lm_eval_wrapper.py cat -n ./src/llm_eval_test/lm_eval_wrapper.py | head -50

Repository: openshift-psap/llm-eval-test

Length of output: 2545

🏁 Script executed:

# Check if there are any test scripts or pytest config find . -name "pytest.ini" -o -name "tox.ini" -o -name ".pytest.ini" 2>/dev/null

Repository: openshift-psap/llm-eval-test

Length of output: 54

🏁 Script executed:

# Check the truthfulqa utils file for numpy usage cat -n ./src/llm_eval_test/benchmarks/tasks/truthfulqa/utils.py

Repository: openshift-psap/llm-eval-test

Length of output: 7519

🏁 Script executed:

# Check if there's any actual numpy or pandas usage in the codebase rg "import numpy|from numpy|import pandas|from pandas" --type py

Repository: openshift-psap/llm-eval-test

Length of output: 146

🏁 Script executed:

# Check what exactly is imported from lm_eval rg "from lm_eval|import lm_eval" --type py

Repository: openshift-psap/llm-eval-test

Length of output: 1199

🏁 Script executed:

# Check if there are any GitHub Actions workflow files or test commands find . -type f -name "*.yml" -o -name "*.yaml" 2>/dev/null | xargs cat 2>/dev/null

Repository: openshift-psap/llm-eval-test

Length of output: 50384

🏁 Script executed:

# Check git history to see if tests were recently removed or if there's a test branch git log --oneline --all | head -20

Repository: openshift-psap/llm-eval-test

Length of output: 850

🏁 Script executed:

# Check if there are any scripts that run tests or validation find . -name "*.sh" -o -name "Makefile" -o -name "justfile" 2>/dev/null

Repository: openshift-psap/llm-eval-test

Length of output: 54

🏁 Script executed:

# Look for any pandas/numpy/torch direct usage in the codebase rg "pandas\.|pd\.|np\.|torch\." --type py -c

Repository: openshift-psap/llm-eval-test

Length of output: 196

🏁 Script executed:

# Check the actual dependency versions in pdm.lock to cross-verify grep -A 2 'name = "pandas"' pdm.lock | head -10

Repository: openshift-psap/llm-eval-test

Length of output: 133

🏁 Script executed:

# Check what lm-eval 0.4.11 changed web_search lm-eval 0.4.11 release notes breaking changes compatibility

Repository: openshift-psap/llm-eval-test

Length of output: 169

🌐 Web query:

lm-eval 0.4.11 release notes breaking changes

💡 Result:

The lm-eval v0.4.11 release (published February 13, 2026) is characterized as a minor release with no breaking changes in its own changelog [1][2]. Users should note that major breaking changes to installation and CLI usage were introduced in the preceding v0.4.10 release, rather than v0.4.11 [3][4]. If migrating from versions prior to v0.4.10, the following changes are applicable: Lightweight Core Installation: The base package lm-eval no longer includes heavy dependencies like transformers or torch by default [5][6][4]. Users must now install required backends explicitly using extras [4]: pip install lm_eval[hf] # For HuggingFace/transformers/torch pip install lm_eval[vllm] # For vLLM pip install lm_eval[api] # For OpenAI, Anthropic, etc. CLI Refactor: The CLI was refactored to support explicit subcommands (run, ls, validate) and YAML configuration files [5][7][4]. While the previous command syntax (e.g., lm-eval --model hf --tasks hellaswag) remains backward compatible, the new recommended usage is lm-eval run --model hf --tasks hellaswag [3][4]. API Changes: Accessing model classes directly via attribute no longer works (e.g., import lm_eval.models; lm_eval.models.huggingface.HFLM now raises an AttributeError) [3][4]. Direct imports such as from lm_eval.models.huggingface import HFLM remain functional [3][4]. Task Versions: The v0.4.11 release includes updates to specific task versions (such as afrobench_belebele, evalita_llm, and mgsm_direct), which may impact the comparability of results with previous versions [1][2].

Citations:

1: https://github.com/EleutherAI/lm-evaluation-harness/releases/tag/v0.4.11

2: https://zenodo.org/records/18636344

3: https://github.com/EleutherAI/lm-evaluation-harness/releases

4: https://github.com/EleutherAI/lm-evaluation-harness/releases/tag/v0.4.10

5: https://pypi.org/project/lm-eval/0.4.11/

6: https://github.com/EleutherAI/lm-evaluation-harness/tree/v0.4.11

7: https://github.com/EleutherAI/lm-evaluation-harness/blob/v0.4.11/README.md

🌐 Web query:

pandas 3.0.0 breaking changes dtype handling

💡 Result:

Pandas 3.0.0 introduced significant breaking changes to dtype handling, primarily centered around a new default string data type and consistent treatment of missing values [1][2][3]. Dedicated String Dtype Starting in version 3.0, pandas infers string data as the new dedicated str dtype instead of the generic numpy object dtype [1][2]. This change is intended to improve performance and type safety [2]. If the pyarrow library is installed, this new dtype is backed by PyArrow; otherwise, it uses an object-dtype fallback [1][4]. Key breaking consequences of this change include: - Type Checking: Code that checks for object dtype (e.g., ser.dtype == object) when expecting strings will break because the new str dtype is a pandas extension dtype, not a numpy dtype [5][2][6]. It is incompatible with functions that expect numpy types, such as np.issubdtype [5][7]. - Missing Value Sentinels: The new str dtype consistently uses NaN (np.nan) as the missing value sentinel [5][6]. Previously, object-dtype columns could contain a mix of None and NaN [5][6]. While pandas still accepts None as input, it is now converted to NaN [6]. - Strictness: Because the new string dtype is strictly typed for strings, workflows that previously relied on object dtype to store mixed-type data may fail unless explicitly forced back to object dtype [6]. Consistent NaN and NA Treatment Pandas 3.0 also standardized the treatment of missing values across nullable dtypes (e.g., Float64Dtype, int64[pyarrow]) [1]. NaN is now consistently treated as equivalent to NA in all cases, including in constructors, setitem, and contains operations [1]. Consequently, arithmetic and universal function operations that previously resulted in NaN will now produce NA [1]. An option to distinguish these values remains available for future use via pd.options.future.distinguish_nan_and_na [1]. For detailed migration guidance, users should refer to the official pandas migration guide for the new string data type [5][7].

Citations:

1: https://pandas.pydata.org/docs/whatsnew/v3.0.0.html

2: https://pandas.pydata.org/community/blog/pandas-3.0.html

3: https://github.com/pandas-dev/pandas/releases/tag/v3.0.0

4: https://pandas.pydata.org/pandas-docs/version/3.0.1/user_guide/text.html

5: https://pandas.pydata.org/pandas-docs/version/3.0/user_guide/migration-3-strings.html

6: String dtype: overview of breaking behaviour changes pandas-dev/pandas#59328

7: https://pandas.pydata.org/pandas-docs/version/3.0.1/user_guide/migration-3-strings.html

🌐 Web query:

numpy 2.0 breaking changes behavior

💡 Result:

NumPy 2.0 introduced significant breaking changes, primarily impacting the Application Binary Interface (ABI), the Python API namespace, and data type promotion rules [1][2]. Because these changes were incompatible with previous versions, they required a major release [1][3]. Key breaking changes include: ABI and C API Breakage: NumPy 2.0 introduced an ABI break, meaning binaries of packages that depend on the NumPy C API and were built against NumPy 1.x versions will not work with NumPy 2.0 [1][2]. Such packages typically raise an ImportError upon loading [1][2]. The internal structures, such as PyArray_Descr, were made more opaque, and complex type implementations were updated to use native C99 types, necessitating changes for projects using these low-level APIs [4]. Namespace Cleanup: To simplify the library, approximately 100 members were removed, moved, or deprecated from the main numpy (np) namespace [5][4]. This effort aimed to reduce clutter, remove outdated functionality, and clarify the distinction between public and private API elements [4][6][3]. Users are advised to replace deprecated aliases with their recommended, often backwards-compatible, alternatives [5][6]. Data Type Promotion: NumPy 2.0 adopted NEP 50, which changed how data type promotion (combining dissimilar types) occurs [5][4]. Additionally, the default integer type was changed to be consistent as np.intp on all platforms (previously, it could be equivalent to the C long type depending on the system/history) [5][4]. Migration Resources: To assist with the transition, the NumPy team released a comprehensive NumPy 2.0 migration guide [5][7][3]. Furthermore, developers provided a Ruff plugin to help automatically update Python code so that it remains compatible with both NumPy 1.x and 2.x versions [3]. Downstream packages are generally advised to recompile against NumPy 2.0 to ensure compatibility while maintaining support for 1.26.x if needed [1][2][3]. Top results: [1][5][4][3]

Citations:

1: https://numpy.org/doc/2.0/release/2.0.0-notes.html

2: https://github.com/numpy/numpy/releases/tag/v2.0.0

3: https://blog.scientific-python.org/numpy/numpy2/

4: https://numpy.org/doc/2.0/numpy_2_0_migration_guide.html

5: https://numpy.org/devdocs/numpy_2_0_migration_guide.html

6: https://www.datacamp.com/tutorial/numpy-2-release

7: https://numpy.org/doc/2.2/numpy_2_0_migration_guide.html

Provide test evidence for major dependency version updates.

The lockfile includes significant version bumps (pandas 3.0.3, datasets 5.0.0, numpy 2.4.6, torch 2.12.0), but no test results, CI/CD validation, or integration testing is provided in the PR to confirm compatibility. The codebase uses numpy operations (np.exp(), np.array(), np.sum(), np.nanmax() in truthfulqa/utils.py) that should remain functional with numpy 2.0, but without test evidence it is unclear whether:

The downstream code in lm_eval_wrapper.py, downloader.py, and benchmarks/tasks/** works with these new versions

The leaderboard extra tasks (math, ifeval, sentencepiece) execute without errors

Numerical scoring in truthfulqa produces correct results

Any regressions were introduced

Provide test results demonstrating successful execution of the evaluation pipeline with these dependency versions, or run a comprehensive test suite to validate integration.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pylock.toml` at line 170, The PR introduces significant dependency version upgrades (pandas 3.0.3, datasets 5.0.0, numpy 2.4.6, torch 2.12.0) in the pylock.toml lockfile without providing test evidence of compatibility. You need to run the evaluation pipeline tests to validate that the codebase remains functional with these new versions, specifically testing the numpy operations in truthfulqa/utils.py, the downstream code paths in lm_eval_wrapper.py and downloader.py, the benchmark tasks under benchmarks/tasks/**, and the leaderboard extra tasks (math, ifeval, sentencepiece). Execute the comprehensive test suite, capture the results demonstrating successful execution and numerical correctness of the scoring pipeline, and add this test evidence to the PR description or commit message to confirm no regressions were introduced by these dependency updates.

bump lm-eval to 0.4.11

9bfd2db

Signed-off-by: Dan Huang <dahuang@redhat.com>

coderabbitai Bot reviewed Jun 16, 2026

View reviewed changes

dtrifiro merged commit 357ff9a into openshift-psap:main Jun 16, 2026
2 checks passed

dtrifiro deleted the bump-lm-eval branch June 16, 2026 09:08

This was referenced Jun 16, 2026

update lm eval to 0.4.10 and bumped project version to 0.4.3 #45

Closed

Update files for lm-eva 0.4.11 #47

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bump lm-eval to 0.4.11#48

bump lm-eval to 0.4.11#48
dtrifiro merged 1 commit into
openshift-psap:mainfrom
dtrifiro:bump-lm-eval

dtrifiro commented Jun 16, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 16, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 16, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

dtrifiro commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dtrifiro commented Jun 16, 2026 •

edited

Loading

coderabbitai Bot commented Jun 16, 2026 •

edited

Loading