bump lm-eval to 0.4.11#48
Conversation
Signed-off-by: Dan Huang <dahuang@redhat.com>
📝 WalkthroughWalkthroughProject version bumped to Changeslm-eval Upgrade and Lock Regeneration
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
pyproject.toml (1)
3-3: API compatibility checks out; consider adding test coverage for the lm-eval 0.4.11 upgrade.The
simple_evaluate()andTaskManager()APIs used inlm_eval_wrapper.pyanddownloader.pyare compatible with lm-eval 0.4.11. The function parameters (e.g.,apply_chat_template,fewshot_as_multiturn,task_manager) and TaskManager initialization pattern (include_path,include_defaults,verbosity) match the documented 0.4.11 API. The optional extras (ifeval,math,sentencepiece) are also available in this version. However, no test or validation documentation is present in the PR to demonstrate this upgrade was tested against the current code paths.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pyproject.toml` at line 3, The version bump to 0.4.11 in pyproject.toml lacks supporting test coverage or validation documentation. Add test cases or test documentation that explicitly verify the lm-eval 0.4.11 upgrade works with the current code paths used in lm_eval_wrapper.py and downloader.py, including validation of the simple_evaluate() function calls and TaskManager initialization patterns to demonstrate the upgrade was properly tested.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pylock.toml`:
- Line 849: The lockfile contains sqlitedict==2.1.0 at line 883, which has a
HIGH severity deserialization vulnerability (GHSA-g4r7-86gm-pgqc) and cannot be
released to production. This is a transitive dependency of lm-eval==0.4.11
located at lines 412-434. To resolve this blocking security issue, you must
either upgrade lm-eval to a newer version that depends on a patched sqlitedict
(version >2.1.0), or add an explicit sqlitedict constraint in the dependencies
to pin it to a patched version. Check the lm-eval project for available releases
that address this CVE, update the lm-eval version constraint accordingly, and
regenerate the lockfile to ensure sqlitedict is resolved to a patched version.
- Line 170: The PR introduces significant dependency version upgrades (pandas
3.0.3, datasets 5.0.0, numpy 2.4.6, torch 2.12.0) in the pylock.toml lockfile
without providing test evidence of compatibility. You need to run the evaluation
pipeline tests to validate that the codebase remains functional with these new
versions, specifically testing the numpy operations in truthfulqa/utils.py, the
downstream code paths in lm_eval_wrapper.py and downloader.py, the benchmark
tasks under benchmarks/tasks/**, and the leaderboard extra tasks (math, ifeval,
sentencepiece). Execute the comprehensive test suite, capture the results
demonstrating successful execution and numerical correctness of the scoring
pipeline, and add this test evidence to the PR description or commit message to
confirm no regressions were introduced by these dependency updates.
---
Nitpick comments:
In `@pyproject.toml`:
- Line 3: The version bump to 0.4.11 in pyproject.toml lacks supporting test
coverage or validation documentation. Add test cases or test documentation that
explicitly verify the lm-eval 0.4.11 upgrade works with the current code paths
used in lm_eval_wrapper.py and downloader.py, including validation of the
simple_evaluate() function calls and TaskManager initialization patterns to
demonstrate the upgrade was properly tested.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 1a70bf93-fec9-4888-bf85-bc0a2be7c432
⛔ Files ignored due to path filters (1)
pdm.lockis excluded by!**/*.lock
📒 Files selected for processing (2)
pylock.tomlpyproject.toml
| name = "datasets" | ||
| version = "3.6.0" | ||
| requires-python = ">=3.9.0" | ||
| version = "5.0.0" |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's look at the repository structure
git ls-files | head -20Repository: openshift-psap/llm-eval-test
Length of output: 861
🏁 Script executed:
# Find and examine the pylock.toml file
find . -name "pylock.toml" -o -name "*.lock" -o -name "requirements*.txt" | head -20Repository: openshift-psap/llm-eval-test
Length of output: 116
🏁 Script executed:
# Look for test files
find . -name "*test*.py" -o -name "tests" -type d | head -20Repository: openshift-psap/llm-eval-test
Length of output: 54
🏁 Script executed:
# Search for CI/CD configuration
find . -name ".github" -type d -o -name ".gitlab-ci.yml" -o -name "*.yml" | grep -E "(github|gitlab|circle)" | head -20Repository: openshift-psap/llm-eval-test
Length of output: 117
🏁 Script executed:
# Read pylock.toml to verify the versions mentioned in the review
cat -n pylock.toml | head -100Repository: openshift-psap/llm-eval-test
Length of output: 3021
🏁 Script executed:
# Check if these specific lines have the versions mentioned
sed -n '170p;596p;578p;664p;963p' pylock.tomlRepository: openshift-psap/llm-eval-test
Length of output: 164
🏁 Script executed:
# Look for test files and CI configuration
find . -type f -name "*test*.py" -o -name "conftest.py" 2>/dev/null | head -20Repository: openshift-psap/llm-eval-test
Length of output: 54
🏁 Script executed:
# Check the CI workflow
cat -n .github/workflows/pre-commit.ymlRepository: openshift-psap/llm-eval-test
Length of output: 496
🏁 Script executed:
# Search for the mentioned files
find . -name "lm_eval_wrapper.py" -o -name "downloader.py" -o -name "truthfulqa" -type d 2>/dev/nullRepository: openshift-psap/llm-eval-test
Length of output: 193
🏁 Script executed:
# Check what package is on line 664
sed -n '660,670p' pylock.tomlRepository: openshift-psap/llm-eval-test
Length of output: 280
🏁 Script executed:
# Look for any test files more thoroughly
find . -type f \( -name "test_*.py" -o -name "*_test.py" \) 2>/dev/nullRepository: openshift-psap/llm-eval-test
Length of output: 54
🏁 Script executed:
# Check pyproject.toml for test configuration
cat pyproject.tomlRepository: openshift-psap/llm-eval-test
Length of output: 1422
🏁 Script executed:
# Look at the mentioned files to understand what they do
cat -n ./src/llm_eval_test/downloader.py | head -50Repository: openshift-psap/llm-eval-test
Length of output: 2542
🏁 Script executed:
# Check lm_eval_wrapper.py
cat -n ./src/llm_eval_test/lm_eval_wrapper.py | head -50Repository: openshift-psap/llm-eval-test
Length of output: 2545
🏁 Script executed:
# Check if there are any test scripts or pytest config
find . -name "pytest.ini" -o -name "tox.ini" -o -name ".pytest.ini" 2>/dev/nullRepository: openshift-psap/llm-eval-test
Length of output: 54
🏁 Script executed:
# Check the truthfulqa utils file for numpy usage
cat -n ./src/llm_eval_test/benchmarks/tasks/truthfulqa/utils.pyRepository: openshift-psap/llm-eval-test
Length of output: 7519
🏁 Script executed:
# Check if there's any actual numpy or pandas usage in the codebase
rg "import numpy|from numpy|import pandas|from pandas" --type pyRepository: openshift-psap/llm-eval-test
Length of output: 146
🏁 Script executed:
# Check what exactly is imported from lm_eval
rg "from lm_eval|import lm_eval" --type pyRepository: openshift-psap/llm-eval-test
Length of output: 1199
🏁 Script executed:
# Check if there are any GitHub Actions workflow files or test commands
find . -type f -name "*.yml" -o -name "*.yaml" 2>/dev/null | xargs cat 2>/dev/nullRepository: openshift-psap/llm-eval-test
Length of output: 50384
🏁 Script executed:
# Check git history to see if tests were recently removed or if there's a test branch
git log --oneline --all | head -20Repository: openshift-psap/llm-eval-test
Length of output: 850
🏁 Script executed:
# Check if there are any scripts that run tests or validation
find . -name "*.sh" -o -name "Makefile" -o -name "justfile" 2>/dev/nullRepository: openshift-psap/llm-eval-test
Length of output: 54
🏁 Script executed:
# Look for any pandas/numpy/torch direct usage in the codebase
rg "pandas\.|pd\.|np\.|torch\." --type py -cRepository: openshift-psap/llm-eval-test
Length of output: 196
🏁 Script executed:
# Check the actual dependency versions in pdm.lock to cross-verify
grep -A 2 'name = "pandas"' pdm.lock | head -10Repository: openshift-psap/llm-eval-test
Length of output: 133
🏁 Script executed:
# Check what lm-eval 0.4.11 changed
web_search
lm-eval 0.4.11 release notes breaking changes compatibilityRepository: openshift-psap/llm-eval-test
Length of output: 169
🌐 Web query:
lm-eval 0.4.11 release notes breaking changes
💡 Result:
The lm-eval v0.4.11 release (published February 13, 2026) is characterized as a minor release with no breaking changes in its own changelog [1][2]. Users should note that major breaking changes to installation and CLI usage were introduced in the preceding v0.4.10 release, rather than v0.4.11 [3][4]. If migrating from versions prior to v0.4.10, the following changes are applicable: Lightweight Core Installation: The base package lm-eval no longer includes heavy dependencies like transformers or torch by default [5][6][4]. Users must now install required backends explicitly using extras [4]: pip install lm_eval[hf] # For HuggingFace/transformers/torch pip install lm_eval[vllm] # For vLLM pip install lm_eval[api] # For OpenAI, Anthropic, etc. CLI Refactor: The CLI was refactored to support explicit subcommands (run, ls, validate) and YAML configuration files [5][7][4]. While the previous command syntax (e.g., lm-eval --model hf --tasks hellaswag) remains backward compatible, the new recommended usage is lm-eval run --model hf --tasks hellaswag [3][4]. API Changes: Accessing model classes directly via attribute no longer works (e.g., import lm_eval.models; lm_eval.models.huggingface.HFLM now raises an AttributeError) [3][4]. Direct imports such as from lm_eval.models.huggingface import HFLM remain functional [3][4]. Task Versions: The v0.4.11 release includes updates to specific task versions (such as afrobench_belebele, evalita_llm, and mgsm_direct), which may impact the comparability of results with previous versions [1][2].
Citations:
- 1: https://github.com/EleutherAI/lm-evaluation-harness/releases/tag/v0.4.11
- 2: https://zenodo.org/records/18636344
- 3: https://github.com/EleutherAI/lm-evaluation-harness/releases
- 4: https://github.com/EleutherAI/lm-evaluation-harness/releases/tag/v0.4.10
- 5: https://pypi.org/project/lm-eval/0.4.11/
- 6: https://github.com/EleutherAI/lm-evaluation-harness/tree/v0.4.11
- 7: https://github.com/EleutherAI/lm-evaluation-harness/blob/v0.4.11/README.md
🌐 Web query:
pandas 3.0.0 breaking changes dtype handling
💡 Result:
Pandas 3.0.0 introduced significant breaking changes to dtype handling, primarily centered around a new default string data type and consistent treatment of missing values [1][2][3]. Dedicated String Dtype Starting in version 3.0, pandas infers string data as the new dedicated str dtype instead of the generic numpy object dtype [1][2]. This change is intended to improve performance and type safety [2]. If the pyarrow library is installed, this new dtype is backed by PyArrow; otherwise, it uses an object-dtype fallback [1][4]. Key breaking consequences of this change include: - Type Checking: Code that checks for object dtype (e.g., ser.dtype == object) when expecting strings will break because the new str dtype is a pandas extension dtype, not a numpy dtype [5][2][6]. It is incompatible with functions that expect numpy types, such as np.issubdtype [5][7]. - Missing Value Sentinels: The new str dtype consistently uses NaN (np.nan) as the missing value sentinel [5][6]. Previously, object-dtype columns could contain a mix of None and NaN [5][6]. While pandas still accepts None as input, it is now converted to NaN [6]. - Strictness: Because the new string dtype is strictly typed for strings, workflows that previously relied on object dtype to store mixed-type data may fail unless explicitly forced back to object dtype [6]. Consistent NaN and NA Treatment Pandas 3.0 also standardized the treatment of missing values across nullable dtypes (e.g., Float64Dtype, int64[pyarrow]) [1]. NaN is now consistently treated as equivalent to NA in all cases, including in constructors, setitem, and contains operations [1]. Consequently, arithmetic and universal function operations that previously resulted in NaN will now produce NA [1]. An option to distinguish these values remains available for future use via pd.options.future.distinguish_nan_and_na [1]. For detailed migration guidance, users should refer to the official pandas migration guide for the new string data type [5][7].
Citations:
- 1: https://pandas.pydata.org/docs/whatsnew/v3.0.0.html
- 2: https://pandas.pydata.org/community/blog/pandas-3.0.html
- 3: https://github.com/pandas-dev/pandas/releases/tag/v3.0.0
- 4: https://pandas.pydata.org/pandas-docs/version/3.0.1/user_guide/text.html
- 5: https://pandas.pydata.org/pandas-docs/version/3.0/user_guide/migration-3-strings.html
- 6: String dtype: overview of breaking behaviour changes pandas-dev/pandas#59328
- 7: https://pandas.pydata.org/pandas-docs/version/3.0.1/user_guide/migration-3-strings.html
🌐 Web query:
numpy 2.0 breaking changes behavior
💡 Result:
NumPy 2.0 introduced significant breaking changes, primarily impacting the Application Binary Interface (ABI), the Python API namespace, and data type promotion rules [1][2]. Because these changes were incompatible with previous versions, they required a major release [1][3]. Key breaking changes include: ABI and C API Breakage: NumPy 2.0 introduced an ABI break, meaning binaries of packages that depend on the NumPy C API and were built against NumPy 1.x versions will not work with NumPy 2.0 [1][2]. Such packages typically raise an ImportError upon loading [1][2]. The internal structures, such as PyArray_Descr, were made more opaque, and complex type implementations were updated to use native C99 types, necessitating changes for projects using these low-level APIs [4]. Namespace Cleanup: To simplify the library, approximately 100 members were removed, moved, or deprecated from the main numpy (np) namespace [5][4]. This effort aimed to reduce clutter, remove outdated functionality, and clarify the distinction between public and private API elements [4][6][3]. Users are advised to replace deprecated aliases with their recommended, often backwards-compatible, alternatives [5][6]. Data Type Promotion: NumPy 2.0 adopted NEP 50, which changed how data type promotion (combining dissimilar types) occurs [5][4]. Additionally, the default integer type was changed to be consistent as np.intp on all platforms (previously, it could be equivalent to the C long type depending on the system/history) [5][4]. Migration Resources: To assist with the transition, the NumPy team released a comprehensive NumPy 2.0 migration guide [5][7][3]. Furthermore, developers provided a Ruff plugin to help automatically update Python code so that it remains compatible with both NumPy 1.x and 2.x versions [3]. Downstream packages are generally advised to recompile against NumPy 2.0 to ensure compatibility while maintaining support for 1.26.x if needed [1][2][3]. Top results: [1][5][4][3]
Citations:
- 1: https://numpy.org/doc/2.0/release/2.0.0-notes.html
- 2: https://github.com/numpy/numpy/releases/tag/v2.0.0
- 3: https://blog.scientific-python.org/numpy/numpy2/
- 4: https://numpy.org/doc/2.0/numpy_2_0_migration_guide.html
- 5: https://numpy.org/devdocs/numpy_2_0_migration_guide.html
- 6: https://www.datacamp.com/tutorial/numpy-2-release
- 7: https://numpy.org/doc/2.2/numpy_2_0_migration_guide.html
Provide test evidence for major dependency version updates.
The lockfile includes significant version bumps (pandas 3.0.3, datasets 5.0.0, numpy 2.4.6, torch 2.12.0), but no test results, CI/CD validation, or integration testing is provided in the PR to confirm compatibility. The codebase uses numpy operations (np.exp(), np.array(), np.sum(), np.nanmax() in truthfulqa/utils.py) that should remain functional with numpy 2.0, but without test evidence it is unclear whether:
- The downstream code in
lm_eval_wrapper.py,downloader.py, andbenchmarks/tasks/**works with these new versions - The leaderboard extra tasks (math, ifeval, sentencepiece) execute without errors
- Numerical scoring in truthfulqa produces correct results
- Any regressions were introduced
Provide test results demonstrating successful execution of the evaluation pipeline with these dependency versions, or run a comprehensive test suite to validate integration.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pylock.toml` at line 170, The PR introduces significant dependency version
upgrades (pandas 3.0.3, datasets 5.0.0, numpy 2.4.6, torch 2.12.0) in the
pylock.toml lockfile without providing test evidence of compatibility. You need
to run the evaluation pipeline tests to validate that the codebase remains
functional with these new versions, specifically testing the numpy operations in
truthfulqa/utils.py, the downstream code paths in lm_eval_wrapper.py and
downloader.py, the benchmark tasks under benchmarks/tasks/**, and the
leaderboard extra tasks (math, ifeval, sentencepiece). Execute the comprehensive
test suite, capture the results demonstrating successful execution and numerical
correctness of the scoring pipeline, and add this test evidence to the PR
description or commit message to confirm no regressions were introduced by these
dependency updates.
original author: @dhuangnm
Signed-off-by: Dan Huang dahuang@redhat.com
Summary by CodeRabbit