Skip to content

feat(coding_agent_rl): add SWE-bench harness evaluation + uniagent mode#250

Open
aoshen02 wants to merge 1 commit into
vllm-project:mainfrom
aoshen02:feat/swebench-eval-path
Open

feat(coding_agent_rl): add SWE-bench harness evaluation + uniagent mode#250
aoshen02 wants to merge 1 commit into
vllm-project:mainfrom
aoshen02:feat/swebench-eval-path

Conversation

@aoshen02

Copy link
Copy Markdown
Collaborator

Summary

Add swebench_metadata as a third evaluation route in sandbox.evaluate(), alongside the existing swepro and eval_cmd paths. Also adds uniagent mode support to generate.py.

Changes

  • docker/Dockerfile: add uni-agent dependency (brings swebench transitively)
  • examples/coding_agent_rl/sandbox.py: add swebench_metadata param to evaluate(), add _run_swebench_eval() — delegates to uni_agent.reward.swe_bench.make_eval_script / parse_eval_output
  • examples/coding_agent_rl/generate.py: SWE_AGENT_MODE env switch for uniagent/claude_code; adapter shim on :18001 (Anthropic Messages API → vLLM generate); swebench_metadata passthrough; _abort() degradation (rollout_log_probs=[0.0] + loss_mask=[0])

Evaluation priority

swepro            → SWEPro custom scripts (existing, unchanged)
swebench_metadata → SWE-bench official harness (new)
eval_cmd          → shell command fallback (existing, unchanged)

Context

Validated on 500-instance SWE-bench Verified eval with Qwen3.6-35B-A3B: uniagent mode 71.3% (355/498), matching the official 71.6%.

Mirror of THUDM/slime#2079.

Test plan

  • docker build completes with uni-agent installed
  • python -c "from uni_agent.reward.swe_bench import make_eval_script, parse_eval_output; print('ok')" inside container
  • Run a small SWE-bench eval with swebench_metadata in sample metadata

🤖 Generated with Claude Code

Add `swebench_metadata` as a third evaluation route in `sandbox.evaluate()`,
alongside the existing `swepro` and `eval_cmd` paths. Eval script generation
and output parsing delegate to `uni_agent.reward.swe_bench` (make_eval_script /
parse_eval_output), keeping the logic in one place.

Also adds `SWE_AGENT_MODE` env switch in generate.py for uniagent/claude_code
mode selection, adapter shim on :18001 (Anthropic Messages API -> vLLM generate),
and `_abort()` degradation (rollout_log_probs=[0.0] + loss_mask=[0] placeholder).

Docker: add uni-agent as a pip dependency (brings swebench transitively).

Validated on 500-instance SWE-bench Verified eval with Qwen3.6-35B-A3B:
uniagent mode 71.3% (355/498), matching the official 71.6%.

Mirror of THUDM/slime#2079.

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: aoshen <aoshen@inferact.ai>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for SWE-bench evaluation using the uni-agent library, introduces a Modal sandbox backend alongside the existing E2B backend, and allows overriding the adapter URL via the ADAPTER_URL_OVERRIDE environment variable. However, a critical issue was identified where importing ModalSandbox from vime.agent.sandbox will fail with an ImportError because it is not defined or exported in that module.

from pathlib import Path

from vime.agent.sandbox import E2BSandbox, Sandbox
from vime.agent.sandbox import E2BSandbox, ModalSandbox, Sandbox

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The import of ModalSandbox from vime.agent.sandbox will fail with an ImportError because ModalSandbox is not defined or exported in vime/agent/sandbox.py. Please implement and export ModalSandbox in vime/agent/sandbox.py or remove this import.

@read-the-docs-community

Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant