Skip to content

bump lm-eval to 0.4.11#48

Merged
dtrifiro merged 1 commit into
openshift-psap:mainfrom
dtrifiro:bump-lm-eval
Jun 16, 2026
Merged

bump lm-eval to 0.4.11#48
dtrifiro merged 1 commit into
openshift-psap:mainfrom
dtrifiro:bump-lm-eval

Conversation

@dtrifiro

@dtrifiro dtrifiro commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

original author: @dhuangnm

Signed-off-by: Dan Huang dahuang@redhat.com

Summary by CodeRabbit

  • Chores
    • Bumped project version to 0.4.3
    • Updated multiple dependency versions including core libraries and ML evaluation packages

Signed-off-by: Dan Huang <dahuang@redhat.com>
@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Project version bumped to 0.4.3 in pyproject.toml, with lm-eval[api] pinned to 0.4.11 and the leaderboard optional dependency minimum raised to >=0.4.11. pylock.toml is regenerated with updated resolved versions for all transitive dependencies including ML, data, networking, and dev tooling packages.

Changes

lm-eval Upgrade and Lock Regeneration

Layer / File(s) Summary
Version and constraint bumps
pyproject.toml
Project version set to 0.4.3; lm-eval[api] pinned to 0.4.11; leaderboard optional minimum raised to >=0.4.11.
lm-eval and ML tooling lock entries
pylock.toml
lm-eval entry upgraded with revised dependency list; latex2sympy2-extended, math-verify, sacrebleu, sentencepiece, tiktoken, scikit-learn, scipy, torch, and tqdm versions updated.
Data and HuggingFace stack lock entries
pylock.toml
datasets, huggingface-hub, hf-xet, fsspec, evaluate, pyarrow, multiprocess, dill, xxhash, and zstandard entries updated with new versions and constraint adjustments.
Networking and core runtime lock entries
pylock.toml
aiohttp, anyio, httpcore, httpx, requests, urllib3, frozenlist, multidict, yarl, certifi, charset-normalizer, and tenacity entries updated.
Numeric, NLP, and text processing lock entries
pylock.toml
numpy, pandas, nltk, regex, lxml, more-itertools, narwhals, networkx, packaging, joblib, and tabledata/tabulate entries updated.
Dev tooling lock entries
pylock.toml
pre-commit, ruff, virtualenv (adds python-discovery), nodeenv, identify, distlib, platformdirs, setuptools, typer, pygments, rich, pyyaml, pytz, and related entries updated.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Possibly related PRs

Poem

🐇 Hop, hop, the versions grow,
lm-eval jumps to 0.4.11 in a row!
The lockfile hashes spin and churn,
Each package bumped in its turn.
A new 0.4.3 tag gleams—
All patched up, or so it seems! 🌟

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely describes the main change: bumping lm-eval to version 0.4.11, which is the primary modification across both pyproject.toml and pylock.toml.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
pyproject.toml (1)

3-3: API compatibility checks out; consider adding test coverage for the lm-eval 0.4.11 upgrade.

The simple_evaluate() and TaskManager() APIs used in lm_eval_wrapper.py and downloader.py are compatible with lm-eval 0.4.11. The function parameters (e.g., apply_chat_template, fewshot_as_multiturn, task_manager) and TaskManager initialization pattern (include_path, include_defaults, verbosity) match the documented 0.4.11 API. The optional extras (ifeval, math, sentencepiece) are also available in this version. However, no test or validation documentation is present in the PR to demonstrate this upgrade was tested against the current code paths.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pyproject.toml` at line 3, The version bump to 0.4.11 in pyproject.toml lacks
supporting test coverage or validation documentation. Add test cases or test
documentation that explicitly verify the lm-eval 0.4.11 upgrade works with the
current code paths used in lm_eval_wrapper.py and downloader.py, including
validation of the simple_evaluate() function calls and TaskManager
initialization patterns to demonstrate the upgrade was properly tested.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pylock.toml`:
- Line 849: The lockfile contains sqlitedict==2.1.0 at line 883, which has a
HIGH severity deserialization vulnerability (GHSA-g4r7-86gm-pgqc) and cannot be
released to production. This is a transitive dependency of lm-eval==0.4.11
located at lines 412-434. To resolve this blocking security issue, you must
either upgrade lm-eval to a newer version that depends on a patched sqlitedict
(version >2.1.0), or add an explicit sqlitedict constraint in the dependencies
to pin it to a patched version. Check the lm-eval project for available releases
that address this CVE, update the lm-eval version constraint accordingly, and
regenerate the lockfile to ensure sqlitedict is resolved to a patched version.
- Line 170: The PR introduces significant dependency version upgrades (pandas
3.0.3, datasets 5.0.0, numpy 2.4.6, torch 2.12.0) in the pylock.toml lockfile
without providing test evidence of compatibility. You need to run the evaluation
pipeline tests to validate that the codebase remains functional with these new
versions, specifically testing the numpy operations in truthfulqa/utils.py, the
downstream code paths in lm_eval_wrapper.py and downloader.py, the benchmark
tasks under benchmarks/tasks/**, and the leaderboard extra tasks (math, ifeval,
sentencepiece). Execute the comprehensive test suite, capture the results
demonstrating successful execution and numerical correctness of the scoring
pipeline, and add this test evidence to the PR description or commit message to
confirm no regressions were introduced by these dependency updates.

---

Nitpick comments:
In `@pyproject.toml`:
- Line 3: The version bump to 0.4.11 in pyproject.toml lacks supporting test
coverage or validation documentation. Add test cases or test documentation that
explicitly verify the lm-eval 0.4.11 upgrade works with the current code paths
used in lm_eval_wrapper.py and downloader.py, including validation of the
simple_evaluate() function calls and TaskManager initialization patterns to
demonstrate the upgrade was properly tested.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1a70bf93-fec9-4888-bf85-bc0a2be7c432

📥 Commits

Reviewing files that changed from the base of the PR and between b227674 and 9bfd2db.

⛔ Files ignored due to path filters (1)
  • pdm.lock is excluded by !**/*.lock
📒 Files selected for processing (2)
  • pylock.toml
  • pyproject.toml

Comment thread pylock.toml
name = "datasets"
version = "3.6.0"
requires-python = ">=3.9.0"
version = "5.0.0"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's look at the repository structure
git ls-files | head -20

Repository: openshift-psap/llm-eval-test

Length of output: 861


🏁 Script executed:

# Find and examine the pylock.toml file
find . -name "pylock.toml" -o -name "*.lock" -o -name "requirements*.txt" | head -20

Repository: openshift-psap/llm-eval-test

Length of output: 116


🏁 Script executed:

# Look for test files
find . -name "*test*.py" -o -name "tests" -type d | head -20

Repository: openshift-psap/llm-eval-test

Length of output: 54


🏁 Script executed:

# Search for CI/CD configuration
find . -name ".github" -type d -o -name ".gitlab-ci.yml" -o -name "*.yml" | grep -E "(github|gitlab|circle)" | head -20

Repository: openshift-psap/llm-eval-test

Length of output: 117


🏁 Script executed:

# Read pylock.toml to verify the versions mentioned in the review
cat -n pylock.toml | head -100

Repository: openshift-psap/llm-eval-test

Length of output: 3021


🏁 Script executed:

# Check if these specific lines have the versions mentioned
sed -n '170p;596p;578p;664p;963p' pylock.toml

Repository: openshift-psap/llm-eval-test

Length of output: 164


🏁 Script executed:

# Look for test files and CI configuration
find . -type f -name "*test*.py" -o -name "conftest.py" 2>/dev/null | head -20

Repository: openshift-psap/llm-eval-test

Length of output: 54


🏁 Script executed:

# Check the CI workflow
cat -n .github/workflows/pre-commit.yml

Repository: openshift-psap/llm-eval-test

Length of output: 496


🏁 Script executed:

# Search for the mentioned files
find . -name "lm_eval_wrapper.py" -o -name "downloader.py" -o -name "truthfulqa" -type d 2>/dev/null

Repository: openshift-psap/llm-eval-test

Length of output: 193


🏁 Script executed:

# Check what package is on line 664
sed -n '660,670p' pylock.toml

Repository: openshift-psap/llm-eval-test

Length of output: 280


🏁 Script executed:

# Look for any test files more thoroughly
find . -type f \( -name "test_*.py" -o -name "*_test.py" \) 2>/dev/null

Repository: openshift-psap/llm-eval-test

Length of output: 54


🏁 Script executed:

# Check pyproject.toml for test configuration
cat pyproject.toml

Repository: openshift-psap/llm-eval-test

Length of output: 1422


🏁 Script executed:

# Look at the mentioned files to understand what they do
cat -n ./src/llm_eval_test/downloader.py | head -50

Repository: openshift-psap/llm-eval-test

Length of output: 2542


🏁 Script executed:

# Check lm_eval_wrapper.py
cat -n ./src/llm_eval_test/lm_eval_wrapper.py | head -50

Repository: openshift-psap/llm-eval-test

Length of output: 2545


🏁 Script executed:

# Check if there are any test scripts or pytest config
find . -name "pytest.ini" -o -name "tox.ini" -o -name ".pytest.ini" 2>/dev/null

Repository: openshift-psap/llm-eval-test

Length of output: 54


🏁 Script executed:

# Check the truthfulqa utils file for numpy usage
cat -n ./src/llm_eval_test/benchmarks/tasks/truthfulqa/utils.py

Repository: openshift-psap/llm-eval-test

Length of output: 7519


🏁 Script executed:

# Check if there's any actual numpy or pandas usage in the codebase
rg "import numpy|from numpy|import pandas|from pandas" --type py

Repository: openshift-psap/llm-eval-test

Length of output: 146


🏁 Script executed:

# Check what exactly is imported from lm_eval
rg "from lm_eval|import lm_eval" --type py

Repository: openshift-psap/llm-eval-test

Length of output: 1199


🏁 Script executed:

# Check if there are any GitHub Actions workflow files or test commands
find . -type f -name "*.yml" -o -name "*.yaml" 2>/dev/null | xargs cat 2>/dev/null

Repository: openshift-psap/llm-eval-test

Length of output: 50384


🏁 Script executed:

# Check git history to see if tests were recently removed or if there's a test branch
git log --oneline --all | head -20

Repository: openshift-psap/llm-eval-test

Length of output: 850


🏁 Script executed:

# Check if there are any scripts that run tests or validation
find . -name "*.sh" -o -name "Makefile" -o -name "justfile" 2>/dev/null

Repository: openshift-psap/llm-eval-test

Length of output: 54


🏁 Script executed:

# Look for any pandas/numpy/torch direct usage in the codebase
rg "pandas\.|pd\.|np\.|torch\." --type py -c

Repository: openshift-psap/llm-eval-test

Length of output: 196


🏁 Script executed:

# Check the actual dependency versions in pdm.lock to cross-verify
grep -A 2 'name = "pandas"' pdm.lock | head -10

Repository: openshift-psap/llm-eval-test

Length of output: 133


🏁 Script executed:

# Check what lm-eval 0.4.11 changed
web_search
lm-eval 0.4.11 release notes breaking changes compatibility

Repository: openshift-psap/llm-eval-test

Length of output: 169


🌐 Web query:

lm-eval 0.4.11 release notes breaking changes

💡 Result:

The lm-eval v0.4.11 release (published February 13, 2026) is characterized as a minor release with no breaking changes in its own changelog [1][2]. Users should note that major breaking changes to installation and CLI usage were introduced in the preceding v0.4.10 release, rather than v0.4.11 [3][4]. If migrating from versions prior to v0.4.10, the following changes are applicable: Lightweight Core Installation: The base package lm-eval no longer includes heavy dependencies like transformers or torch by default [5][6][4]. Users must now install required backends explicitly using extras [4]: pip install lm_eval[hf] # For HuggingFace/transformers/torch pip install lm_eval[vllm] # For vLLM pip install lm_eval[api] # For OpenAI, Anthropic, etc. CLI Refactor: The CLI was refactored to support explicit subcommands (run, ls, validate) and YAML configuration files [5][7][4]. While the previous command syntax (e.g., lm-eval --model hf --tasks hellaswag) remains backward compatible, the new recommended usage is lm-eval run --model hf --tasks hellaswag [3][4]. API Changes: Accessing model classes directly via attribute no longer works (e.g., import lm_eval.models; lm_eval.models.huggingface.HFLM now raises an AttributeError) [3][4]. Direct imports such as from lm_eval.models.huggingface import HFLM remain functional [3][4]. Task Versions: The v0.4.11 release includes updates to specific task versions (such as afrobench_belebele, evalita_llm, and mgsm_direct), which may impact the comparability of results with previous versions [1][2].

Citations:


🌐 Web query:

pandas 3.0.0 breaking changes dtype handling

💡 Result:

Pandas 3.0.0 introduced significant breaking changes to dtype handling, primarily centered around a new default string data type and consistent treatment of missing values [1][2][3]. Dedicated String Dtype Starting in version 3.0, pandas infers string data as the new dedicated str dtype instead of the generic numpy object dtype [1][2]. This change is intended to improve performance and type safety [2]. If the pyarrow library is installed, this new dtype is backed by PyArrow; otherwise, it uses an object-dtype fallback [1][4]. Key breaking consequences of this change include: - Type Checking: Code that checks for object dtype (e.g., ser.dtype == object) when expecting strings will break because the new str dtype is a pandas extension dtype, not a numpy dtype [5][2][6]. It is incompatible with functions that expect numpy types, such as np.issubdtype [5][7]. - Missing Value Sentinels: The new str dtype consistently uses NaN (np.nan) as the missing value sentinel [5][6]. Previously, object-dtype columns could contain a mix of None and NaN [5][6]. While pandas still accepts None as input, it is now converted to NaN [6]. - Strictness: Because the new string dtype is strictly typed for strings, workflows that previously relied on object dtype to store mixed-type data may fail unless explicitly forced back to object dtype [6]. Consistent NaN and NA Treatment Pandas 3.0 also standardized the treatment of missing values across nullable dtypes (e.g., Float64Dtype, int64[pyarrow]) [1]. NaN is now consistently treated as equivalent to NA in all cases, including in constructors, setitem, and contains operations [1]. Consequently, arithmetic and universal function operations that previously resulted in NaN will now produce NA [1]. An option to distinguish these values remains available for future use via pd.options.future.distinguish_nan_and_na [1]. For detailed migration guidance, users should refer to the official pandas migration guide for the new string data type [5][7].

Citations:


🌐 Web query:

numpy 2.0 breaking changes behavior

💡 Result:

NumPy 2.0 introduced significant breaking changes, primarily impacting the Application Binary Interface (ABI), the Python API namespace, and data type promotion rules [1][2]. Because these changes were incompatible with previous versions, they required a major release [1][3]. Key breaking changes include: ABI and C API Breakage: NumPy 2.0 introduced an ABI break, meaning binaries of packages that depend on the NumPy C API and were built against NumPy 1.x versions will not work with NumPy 2.0 [1][2]. Such packages typically raise an ImportError upon loading [1][2]. The internal structures, such as PyArray_Descr, were made more opaque, and complex type implementations were updated to use native C99 types, necessitating changes for projects using these low-level APIs [4]. Namespace Cleanup: To simplify the library, approximately 100 members were removed, moved, or deprecated from the main numpy (np) namespace [5][4]. This effort aimed to reduce clutter, remove outdated functionality, and clarify the distinction between public and private API elements [4][6][3]. Users are advised to replace deprecated aliases with their recommended, often backwards-compatible, alternatives [5][6]. Data Type Promotion: NumPy 2.0 adopted NEP 50, which changed how data type promotion (combining dissimilar types) occurs [5][4]. Additionally, the default integer type was changed to be consistent as np.intp on all platforms (previously, it could be equivalent to the C long type depending on the system/history) [5][4]. Migration Resources: To assist with the transition, the NumPy team released a comprehensive NumPy 2.0 migration guide [5][7][3]. Furthermore, developers provided a Ruff plugin to help automatically update Python code so that it remains compatible with both NumPy 1.x and 2.x versions [3]. Downstream packages are generally advised to recompile against NumPy 2.0 to ensure compatibility while maintaining support for 1.26.x if needed [1][2][3]. Top results: [1][5][4][3]

Citations:


Provide test evidence for major dependency version updates.

The lockfile includes significant version bumps (pandas 3.0.3, datasets 5.0.0, numpy 2.4.6, torch 2.12.0), but no test results, CI/CD validation, or integration testing is provided in the PR to confirm compatibility. The codebase uses numpy operations (np.exp(), np.array(), np.sum(), np.nanmax() in truthfulqa/utils.py) that should remain functional with numpy 2.0, but without test evidence it is unclear whether:

  • The downstream code in lm_eval_wrapper.py, downloader.py, and benchmarks/tasks/** works with these new versions
  • The leaderboard extra tasks (math, ifeval, sentencepiece) execute without errors
  • Numerical scoring in truthfulqa produces correct results
  • Any regressions were introduced

Provide test results demonstrating successful execution of the evaluation pipeline with these dependency versions, or run a comprehensive test suite to validate integration.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pylock.toml` at line 170, The PR introduces significant dependency version
upgrades (pandas 3.0.3, datasets 5.0.0, numpy 2.4.6, torch 2.12.0) in the
pylock.toml lockfile without providing test evidence of compatibility. You need
to run the evaluation pipeline tests to validate that the codebase remains
functional with these new versions, specifically testing the numpy operations in
truthfulqa/utils.py, the downstream code paths in lm_eval_wrapper.py and
downloader.py, the benchmark tasks under benchmarks/tasks/**, and the
leaderboard extra tasks (math, ifeval, sentencepiece). Execute the comprehensive
test suite, capture the results demonstrating successful execution and numerical
correctness of the scoring pipeline, and add this test evidence to the PR
description or commit message to confirm no regressions were introduced by these
dependency updates.

Comment thread pylock.toml
@dtrifiro dtrifiro merged commit 357ff9a into openshift-psap:main Jun 16, 2026
2 checks passed
@dtrifiro dtrifiro deleted the bump-lm-eval branch June 16, 2026 09:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants