Skip to content

[PB] Packaging fixes to various lib to enable non-editable installing#82

Merged
thesofakillers merged 4 commits into
openai:mainfrom
stalkermustang:fix/fixes-to-paperbench
Oct 28, 2025
Merged

[PB] Packaging fixes to various lib to enable non-editable installing#82
thesofakillers merged 4 commits into
openai:mainfrom
stalkermustang:fix/fixes-to-paperbench

Conversation

@stalkermustang
Copy link
Copy Markdown
Contributor

@stalkermustang stalkermustang commented Oct 26, 2025

Hello OpenAI team,

I'm currently working on adapting PaperBench for the PrimeIntellect Environments hub (PR). This initiative closely resembles the previously reviewed MLE Bench Port—many thanks to @thesofakillers for an extensive review there!

I'm facing an issue while trying to reuse the existing Judge code from PaperBench, as I'd prefer adhering to the reference implementation rather than reinventing the wheel. To achieve this, I introduced PaperBench as a dependency and attempted to import SimpleJudge as follows:

from paperbench.judge.simple import SimpleJudge

However, this raises a couple of dependency-related issues:

  • alcatraz: The import chain eventually requires accessing JudgeOutput that leads to:

    from alcatraz.clusters.local import BaseAlcatrazCluster, ClusterConfig, LocalConfig
  • SimpleJudge: Internally attempts to import:

    from nanoeval.solvers.computer_tasks.code_execution_interface import ComputerInterface
How to reproduce in a clean environment (5 min):

  1. mkdir <some_name> & cd there & uv init --name frontier_evals_fix
  2. Update pyproject.toml with this:
[project]
name = "frontier-evals-fix"
version = "0.1.0"
description = "Add your description here"
readme = "README.md"
requires-python = ">=3.12"
dependencies = [
    "paperbench",
]

[tool.uv.sources]
paperbench = {git = "https://github.com/openai/frontier-evals.git", subdirectory = "project/paperbench", rev = "650e03aede9b5d5a81e9045bbc9843dd665d8fe4"}

  1. uv sync runs fine
  2. Update main.py with this:
def main():
    from paperbench.judge.simple import SimpleJudge
    print('Import is fine, ', SimpleJudge.__name__)

if __name__ == "__main__":
    main()
  1. uv run main.py will show the issue:
Traceback (most recent call last):
  File "/Users/seeall/Documents/_WORKSPACES/tmp_fr_ev/main.py", line 6, in <module>
    main()
  File "/Users/seeall/Documents/_WORKSPACES/tmp_fr_ev/main.py", line 2, in main
    from paperbench.judge.simple import SimpleJudge
  File "/Users/seeall/Documents/_WORKSPACES/tmp_fr_ev/.venv/lib/python3.12/site-packages/paperbench/judge/simple.py", line 26, in <module>
    from nanoeval.solvers.computer_tasks.code_execution_interface import ComputerInterface
ModuleNotFoundError: No module named 'nanoeval.solvers'

After a short debugging session with Codex, the only viable solution we found involved using an admittedly inelegant stub that modifies the PATH. I've looked into several adjustments to pyproject.toml, but unfortunately, none resolved the underlying issue.

The proposed fix in this PR addresses the problem by explicitly expanding the modules exposed for these two packages. An alternative approach for alcatraz might involve adding __init__.py files; however, it appears this was intentionally avoided in your design. For nanoeval, although __init__.py files exist, the exportable scope remains limited, perhaps intentionally, though I'm unsure.

Please let me know if you have a preferred approach or any suggestions for implementing this differently.

Additionally, I've updated the CONTEXT_WINDOW_LENGTHS mapping to include various GPT-5 family models. I've set the context length to 272k instead of 400k, as the existing implementation seems primarily concerned with input length rather than total tokens. If this interpretation is incorrect, please advise, and I can update the values accordingly.

Thanks in advance for your feedback!

Copy link
Copy Markdown
Contributor

@thesofakillers thesofakillers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for flagging this.

You are right and we had not properly packaged some of the libraries, and didn't catch this cuz we work with them installed in editable mode typically.

I've left some comments adjusting your fixes. Thanks!

Comment thread project/common/preparedness_turn_completer/preparedness_turn_completer/utils.py Outdated
Comment thread project/common/nanoeval/pyproject.toml Outdated
Comment thread project/common/nanoeval/pyproject.toml Outdated
Comment thread project/common/alcatraz/pyproject.toml Outdated
Comment thread project/common/nanoeval/pyproject.toml
Comment thread project/common/alcatraz/pyproject.toml
@thesofakillers thesofakillers changed the title Fixes to paperbench as an installable package for benchmarking [PB] Fixes to various libraries to enable non-editable installing Oct 27, 2025
@thesofakillers thesofakillers changed the title [PB] Fixes to various libraries to enable non-editable installing [PB] Packaging fixes to various lib to enable non-editable installing Oct 27, 2025
@thesofakillers
Copy link
Copy Markdown
Contributor

This looks good to me, not sure what's going on with CI i think it's still the OPENAI_API_KEY. I'll just test it locally now and override it if they pass

@thesofakillers thesofakillers merged commit 7acfcc7 into openai:main Oct 28, 2025
6 of 8 checks passed
@stalkermustang
Copy link
Copy Markdown
Contributor Author

This looks good to me, not sure what's going on with CI i think it's still the OPENAI_API_KEY. I'll just test it locally now and override it if they pass

Thanks for merging.
I've checked the CI logs, it seems one needs to specify GRADER_OPENAI_API_KEY as well?

@thesofakillers
Copy link
Copy Markdown
Contributor

Thanks for merging. I've checked the CI logs, it seems one needs to specify GRADER_OPENAI_API_KEY as well?

I don't think so, we only have OPENAI_API_KEY configured in the github settings here and that seems to be enough. Could just be weirdness with how CI interacts with forks 🤷🏻‍♂️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants