feat(coding_agent_rl): add SWE-bench harness evaluation + uniagent mode#250
Open
aoshen02 wants to merge 1 commit into
Open
feat(coding_agent_rl): add SWE-bench harness evaluation + uniagent mode#250aoshen02 wants to merge 1 commit into
aoshen02 wants to merge 1 commit into
Conversation
Add `swebench_metadata` as a third evaluation route in `sandbox.evaluate()`, alongside the existing `swepro` and `eval_cmd` paths. Eval script generation and output parsing delegate to `uni_agent.reward.swe_bench` (make_eval_script / parse_eval_output), keeping the logic in one place. Also adds `SWE_AGENT_MODE` env switch in generate.py for uniagent/claude_code mode selection, adapter shim on :18001 (Anthropic Messages API -> vLLM generate), and `_abort()` degradation (rollout_log_probs=[0.0] + loss_mask=[0] placeholder). Docker: add uni-agent as a pip dependency (brings swebench transitively). Validated on 500-instance SWE-bench Verified eval with Qwen3.6-35B-A3B: uniagent mode 71.3% (355/498), matching the official 71.6%. Mirror of THUDM/slime#2079. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: aoshen <aoshen@inferact.ai>
There was a problem hiding this comment.
Code Review
This pull request adds support for SWE-bench evaluation using the uni-agent library, introduces a Modal sandbox backend alongside the existing E2B backend, and allows overriding the adapter URL via the ADAPTER_URL_OVERRIDE environment variable. However, a critical issue was identified where importing ModalSandbox from vime.agent.sandbox will fail with an ImportError because it is not defined or exported in that module.
| from pathlib import Path | ||
|
|
||
| from vime.agent.sandbox import E2BSandbox, Sandbox | ||
| from vime.agent.sandbox import E2BSandbox, ModalSandbox, Sandbox |
2 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add
swebench_metadataas a third evaluation route insandbox.evaluate(), alongside the existingsweproandeval_cmdpaths. Also adds uniagent mode support togenerate.py.Changes
docker/Dockerfile: adduni-agentdependency (bringsswebenchtransitively)examples/coding_agent_rl/sandbox.py: addswebench_metadataparam toevaluate(), add_run_swebench_eval()— delegates touni_agent.reward.swe_bench.make_eval_script/parse_eval_outputexamples/coding_agent_rl/generate.py:SWE_AGENT_MODEenv switch for uniagent/claude_code; adapter shim on:18001(Anthropic Messages API → vLLM generate);swebench_metadatapassthrough;_abort()degradation (rollout_log_probs=[0.0]+loss_mask=[0])Evaluation priority
Context
Validated on 500-instance SWE-bench Verified eval with Qwen3.6-35B-A3B: uniagent mode 71.3% (355/498), matching the official 71.6%.
Mirror of THUDM/slime#2079.
Test plan
docker buildcompletes with uni-agent installedpython -c "from uni_agent.reward.swe_bench import make_eval_script, parse_eval_output; print('ok')"inside containerswebench_metadatain sample metadata🤖 Generated with Claude Code