feat(coding_agent_rl): add SWE-bench harness evaluation + uniagent mode by aoshen02 · Pull Request #250 · vllm-project/vime

aoshen02 · 2026-06-15T04:32:31Z

Summary

Add swebench_metadata as a third evaluation route in sandbox.evaluate(), alongside the existing swepro and eval_cmd paths. Also adds uniagent mode support to generate.py.

Changes

docker/Dockerfile: add uni-agent dependency (brings swebench transitively)
examples/coding_agent_rl/sandbox.py: add swebench_metadata param to evaluate(), add _run_swebench_eval() — delegates to uni_agent.reward.swe_bench.make_eval_script / parse_eval_output
examples/coding_agent_rl/generate.py: SWE_AGENT_MODE env switch for uniagent/claude_code; adapter shim on :18001 (Anthropic Messages API → vLLM generate); swebench_metadata passthrough; _abort() degradation (rollout_log_probs=[0.0] + loss_mask=[0])

Evaluation priority

swepro            → SWEPro custom scripts (existing, unchanged)
swebench_metadata → SWE-bench official harness (new)
eval_cmd          → shell command fallback (existing, unchanged)

Context

Validated on 500-instance SWE-bench Verified eval with Qwen3.6-35B-A3B: uniagent mode 71.3% (355/498), matching the official 71.6%.

Mirror of THUDM/slime#2079.

Test plan

docker build completes with uni-agent installed
python -c "from uni_agent.reward.swe_bench import make_eval_script, parse_eval_output; print('ok')" inside container
Run a small SWE-bench eval with swebench_metadata in sample metadata

🤖 Generated with Claude Code

Add `swebench_metadata` as a third evaluation route in `sandbox.evaluate()`, alongside the existing `swepro` and `eval_cmd` paths. Eval script generation and output parsing delegate to `uni_agent.reward.swe_bench` (make_eval_script / parse_eval_output), keeping the logic in one place. Also adds `SWE_AGENT_MODE` env switch in generate.py for uniagent/claude_code mode selection, adapter shim on :18001 (Anthropic Messages API -> vLLM generate), and `_abort()` degradation (rollout_log_probs=[0.0] + loss_mask=[0] placeholder). Docker: add uni-agent as a pip dependency (brings swebench transitively). Validated on 500-instance SWE-bench Verified eval with Qwen3.6-35B-A3B: uniagent mode 71.3% (355/498), matching the official 71.6%. Mirror of THUDM/slime#2079. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: aoshen <aoshen@inferact.ai>

gemini-code-assist

Code Review

This pull request adds support for SWE-bench evaluation using the uni-agent library, introduces a Modal sandbox backend alongside the existing E2B backend, and allows overriding the adapter URL via the ADAPTER_URL_OVERRIDE environment variable. However, a critical issue was identified where importing ModalSandbox from vime.agent.sandbox will fail with an ImportError because it is not defined or exported in that module.

gemini-code-assist · 2026-06-15T04:33:44Z

 from pathlib import Path

-from vime.agent.sandbox import E2BSandbox, Sandbox
+from vime.agent.sandbox import E2BSandbox, ModalSandbox, Sandbox


The import of ModalSandbox from vime.agent.sandbox will fail with an ImportError because ModalSandbox is not defined or exported in vime/agent/sandbox.py. Please implement and export ModalSandbox in vime/agent/sandbox.py or remove this import.

read-the-docs-community · 2026-06-15T04:35:07Z

Documentation build overview

📚 vime | 🛠️ Build #33139557 | 📁 Comparing 8484445 against latest (491665d)

🔍 Preview build

26 files changed · ± 26 modified

± Modified

gemini-code-assist Bot reviewed Jun 15, 2026

View reviewed changes

aoshen02 mentioned this pull request Jun 15, 2026

[reward] refactor: extract make_eval_script / parse_eval_output as public helpers verl-project/uni-agent#62

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(coding_agent_rl): add SWE-bench harness evaluation + uniagent mode#250

feat(coding_agent_rl): add SWE-bench harness evaluation + uniagent mode#250
aoshen02 wants to merge 1 commit into
vllm-project:mainfrom
aoshen02:feat/swebench-eval-path

aoshen02 commented Jun 15, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 15, 2026

Uh oh!

read-the-docs-community Bot commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

aoshen02 commented Jun 15, 2026

Summary

Changes

Evaluation priority

Context

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

read-the-docs-community Bot commented Jun 15, 2026

Documentation build overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant