[PB] Packaging fixes to various lib to enable non-editable installing#82
Conversation
… dirs + added gpt-5 models to the context len mapping
thesofakillers
left a comment
There was a problem hiding this comment.
Thank you for flagging this.
You are right and we had not properly packaged some of the libraries, and didn't catch this cuz we work with them installed in editable mode typically.
I've left some comments adjusting your fixes. Thanks!
…ll packages in project/common
|
This looks good to me, not sure what's going on with CI i think it's still the OPENAI_API_KEY. I'll just test it locally now and override it if they pass |
Thanks for merging. |
I don't think so, we only have |
Hello OpenAI team,
I'm currently working on adapting PaperBench for the PrimeIntellect Environments hub (PR). This initiative closely resembles the previously reviewed MLE Bench Port—many thanks to @thesofakillers for an extensive review there!
I'm facing an issue while trying to reuse the existing Judge code from PaperBench, as I'd prefer adhering to the reference implementation rather than reinventing the wheel. To achieve this, I introduced PaperBench as a dependency and attempted to import
SimpleJudgeas follows:However, this raises a couple of dependency-related issues:
alcatraz: The import chain eventually requires accessingJudgeOutputthat leads to:SimpleJudge: Internally attempts to import:How to reproduce in a clean environment (5 min):
mkdir <some_name>& cd there &uv init --name frontier_evals_fixpyproject.tomlwith this:uv syncruns fineuv run main.pywill show the issue:After a short debugging session with Codex, the only viable solution we found involved using an admittedly inelegant stub that modifies the
PATH. I've looked into several adjustments topyproject.toml, but unfortunately, none resolved the underlying issue.The proposed fix in this PR addresses the problem by explicitly expanding the modules exposed for these two packages. An alternative approach for
alcatrazmight involve adding__init__.pyfiles; however, it appears this was intentionally avoided in your design. Fornanoeval, although__init__.pyfiles exist, the exportable scope remains limited, perhaps intentionally, though I'm unsure.Please let me know if you have a preferred approach or any suggestions for implementing this differently.
Additionally, I've updated the
CONTEXT_WINDOW_LENGTHSmapping to include various GPT-5 family models. I've set the context length to 272k instead of 400k, as the existing implementation seems primarily concerned with input length rather than total tokens. If this interpretation is incorrect, please advise, and I can update the values accordingly.Thanks in advance for your feedback!