Add agentic arena benchmark#447
Conversation
WalkthroughTwo standalone benchmark runners are added under ChangesAgentic Benchmark Runner
Arena Benchmark Runner
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@benches/run_agentic_benchmark.py`:
- Around line 101-105: The run_terminal method executes arbitrary commands
derived from model output without restricting what commands can be executed. To
fix this, implement a command allowlist that validates and restricts the
commands that can be executed to only safe, benchmark-appropriate operations.
Alternatively, implement proper sandboxing using containerization (Docker) or
system-level sandboxing tools (nsjail/firejail) to isolate command execution and
prevent access to the host system's network and files outside the benchmark
workspace. Apply this security fix to all locations where subprocess.run is
called with untrusted model-generated commands, including the run_terminal
method and any other similar command execution points around line 283.
- Around line 891-899: The Pie path is computing task.goal_cached before calling
task.setup(ws), while the offline mode (around lines 842-844) calls setup first.
This ordering inconsistency causes divergent behavior when task goals depend on
setup state. Move the line task.goal_cached = task.goal(ws) to execute after the
task.setup(ws) call to match the offline mode's ordering and ensure setup state
is available before computing the goal.
- Around line 87-96: The write_file, read_file, and exists methods in the
TempWorkspace class directly concatenate the rel parameter without validating
that it stays within the workspace directory, allowing path traversal attacks
using absolute paths or .. segments. Add path validation in each method by
resolving both the workspace path (self.path.resolve()) and the target path,
then verify that the target's resolved path starts with the workspace's resolved
path. Raise a ValueError or similar exception if the target escapes the
workspace bounds before proceeding with the file operation.
In `@benches/run_arena_benchmark.py`:
- Around line 689-708: The auth_by_token() call is executed outside the
try/finally block, which means if authentication fails, the client connection is
never closed in the finally block. Move the conditional check for args.token and
the await client.auth_by_token(args.token) call inside the try block, right
before the loop that iterates through tasks. This ensures that the finally block
will always execute and properly close the client connection regardless of
whether the auth call succeeds or fails.
In `@benches/test_agentic_benchmark.py`:
- Around line 203-205: The test uses assert False to signal an expected failure
when SystemExit is not raised during B.make_web_backend(live), but this
assertion gets removed when Python runs with optimization flag -O, causing the
test to silently pass when it should fail. Replace the assert False statement
with an explicit raise AssertionError("expected SystemExit when the key env var
is missing") to ensure the test properly fails regardless of Python optimization
flags.
- Line 10: The module docstring contains an incorrect command path for running
the benchmark test. Update the docstring instruction at line 10 that currently
shows `python3 tests/test_agentic_benchmark.py` to reflect the correct path
`python3 benches/test_agentic_benchmark.py`, since the file is located in the
benches directory, not the tests directory. This will ensure users can
successfully copy-paste and run the command from the module docstring.
In `@benches/test_arena_benchmark.py`:
- Line 5: The module docstring in test_arena_benchmark.py contains an outdated
direct-run command that incorrectly references the file location as
tests/test_arena_benchmark.py when the file is actually located in the benches/
directory. Update the example command in the module docstring to use the correct
file path that accurately reflects where test_arena_benchmark.py resides in the
project structure.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: c5164c2f-1270-4d28-8c78-120d9a87fa30
📒 Files selected for processing (5)
benches/benchmark_tasks.example.jsonbenches/run_agentic_benchmark.pybenches/run_arena_benchmark.pybenches/test_agentic_benchmark.pybenches/test_arena_benchmark.py
| def write_file(self, rel: str, content: str) -> None: | ||
| target = self.path / rel | ||
| target.parent.mkdir(parents=True, exist_ok=True) | ||
| target.write_text(content, encoding="utf-8") | ||
|
|
||
| def read_file(self, rel: str) -> str: | ||
| return (self.path / rel).read_text(encoding="utf-8") | ||
|
|
||
| def exists(self, rel: str) -> bool: | ||
| return (self.path / rel).exists() |
There was a problem hiding this comment.
🔒 Security & Privacy | 🔴 Critical | ⚡ Quick win
Enforce workspace path containment for file tools.
Line 88 and Line 93 trust rel directly, so absolute paths or .. segments can escape the temp workspace and read/write host files. That breaks the isolation guarantee and is a real sandbox escape.
🔧 Proposed fix
class Workspace:
"""An isolated temp dir the agent acts on, with real file + terminal access."""
@@
+ def _resolve_inside_workspace(self, rel: str) -> Path:
+ root = self.path.resolve()
+ target = (self.path / rel).resolve()
+ if target != root and root not in target.parents:
+ raise ValueError(f"path escapes workspace: {rel}")
+ return target
+
def write_file(self, rel: str, content: str) -> None:
- target = self.path / rel
+ target = self._resolve_inside_workspace(rel)
target.parent.mkdir(parents=True, exist_ok=True)
target.write_text(content, encoding="utf-8")
def read_file(self, rel: str) -> str:
- return (self.path / rel).read_text(encoding="utf-8")
+ return self._resolve_inside_workspace(rel).read_text(encoding="utf-8")
def exists(self, rel: str) -> bool:
- return (self.path / rel).exists()
+ try:
+ return self._resolve_inside_workspace(rel).exists()
+ except ValueError:
+ return False📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def write_file(self, rel: str, content: str) -> None: | |
| target = self.path / rel | |
| target.parent.mkdir(parents=True, exist_ok=True) | |
| target.write_text(content, encoding="utf-8") | |
| def read_file(self, rel: str) -> str: | |
| return (self.path / rel).read_text(encoding="utf-8") | |
| def exists(self, rel: str) -> bool: | |
| return (self.path / rel).exists() | |
| def _resolve_inside_workspace(self, rel: str) -> Path: | |
| root = self.path.resolve() | |
| target = (self.path / rel).resolve() | |
| if target != root and root not in target.parents: | |
| raise ValueError(f"path escapes workspace: {rel}") | |
| return target | |
| def write_file(self, rel: str, content: str) -> None: | |
| target = self._resolve_inside_workspace(rel) | |
| target.parent.mkdir(parents=True, exist_ok=True) | |
| target.write_text(content, encoding="utf-8") | |
| def read_file(self, rel: str) -> str: | |
| return self._resolve_inside_workspace(rel).read_text(encoding="utf-8") | |
| def exists(self, rel: str) -> bool: | |
| try: | |
| return self._resolve_inside_workspace(rel).exists() | |
| except ValueError: | |
| return False |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@benches/run_agentic_benchmark.py` around lines 87 - 96, The write_file,
read_file, and exists methods in the TempWorkspace class directly concatenate
the rel parameter without validating that it stays within the workspace
directory, allowing path traversal attacks using absolute paths or .. segments.
Add path validation in each method by resolving both the workspace path
(self.path.resolve()) and the target path, then verify that the target's
resolved path starts with the workspace's resolved path. Raise a ValueError or
similar exception if the target escapes the workspace bounds before proceeding
with the file operation.
| def run_terminal(self, command: list[str], timeout: int = 30) -> dict[str, Any]: | ||
| start = time.time() | ||
| try: | ||
| proc = subprocess.run(command, cwd=self.path, capture_output=True, | ||
| text=True, timeout=timeout) |
There was a problem hiding this comment.
🔒 Security & Privacy | 🟠 Major | 🏗️ Heavy lift
run_terminal currently executes untrusted model commands on the host.
Line 101 and Line 283 execute arbitrary commands derived from model output. Even with shell=False, this is still direct command execution and can access network/files outside the benchmark workspace. Consider a hard sandbox boundary (container/nsjail/firejail) or a strict command allowlist for benchmark tasks.
Also applies to: 283-285
🧰 Tools
🪛 ast-grep (0.44.0)
[error] 103-104: Use of unsanitized data to create processes
Context: subprocess.run(command, cwd=self.path, capture_output=True,
text=True, timeout=timeout)
Note: [CWE-78] Improper Neutralization of Special Elements used in an OS Command ('OS Command Injection').
(os-system-unsanitized-data)
[error] 103-104: Command coming from incoming request
Context: subprocess.run(command, cwd=self.path, capture_output=True,
text=True, timeout=timeout)
Note: [CWE-78] Improper Neutralization of Special Elements used in an OS Command ('OS Command Injection').
(subprocess-from-request)
🪛 Ruff (0.15.18)
[error] 104-104: subprocess call: check for execution of untrusted input
(S603)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@benches/run_agentic_benchmark.py` around lines 101 - 105, The run_terminal
method executes arbitrary commands derived from model output without restricting
what commands can be executed. To fix this, implement a command allowlist that
validates and restricts the commands that can be executed to only safe,
benchmark-appropriate operations. Alternatively, implement proper sandboxing
using containerization (Docker) or system-level sandboxing tools
(nsjail/firejail) to isolate command execution and prevent access to the host
system's network and files outside the benchmark workspace. Apply this security
fix to all locations where subprocess.run is called with untrusted
model-generated commands, including the run_terminal method and any other
similar command execution points around line 283.
| task.goal_cached = task.goal(ws) # stable text for modular-cache modules | ||
| astep = make_pie_step_fn(run_inferlet_fn, client, family, language, task, mode, args) | ||
| history: list[dict[str, Any]] = [] | ||
| counts = {"tool_calls": 0, "tool_errors": 0, "tool_hallucinations": 0, "invalid": 0, | ||
| "had_error": False, "recovered": False} | ||
| final_answer, raw_parts = "", [] | ||
| try: | ||
| task.setup(ws) | ||
| start = time.time() |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Fix setup/goal ordering inconsistency in Pie path.
Line 891 computes task.goal_cached before Line 898 task.setup(ws), while offline mode does setup first (Line 842-844). Tasks whose goals depend on setup state will diverge between offline and Pie runs.
🔧 Proposed fix
async def run_task_pie(run_inferlet_fn, client, family, language, task, mode, args, web=None) -> dict[str, Any]:
ws = Workspace()
web = web or FrozenWebSearch()
- task.goal_cached = task.goal(ws) # stable text for modular-cache modules
astep = make_pie_step_fn(run_inferlet_fn, client, family, language, task, mode, args)
@@
try:
task.setup(ws)
+ task.goal_cached = task.goal(ws) # stable text for modular-cache modules
start = time.time()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| task.goal_cached = task.goal(ws) # stable text for modular-cache modules | |
| astep = make_pie_step_fn(run_inferlet_fn, client, family, language, task, mode, args) | |
| history: list[dict[str, Any]] = [] | |
| counts = {"tool_calls": 0, "tool_errors": 0, "tool_hallucinations": 0, "invalid": 0, | |
| "had_error": False, "recovered": False} | |
| final_answer, raw_parts = "", [] | |
| try: | |
| task.setup(ws) | |
| start = time.time() | |
| astep = make_pie_step_fn(run_inferlet_fn, client, family, language, task, mode, args) | |
| history: list[dict[str, Any]] = [] | |
| counts = {"tool_calls": 0, "tool_errors": 0, "tool_hallucinations": 0, "invalid": 0, | |
| "had_error": False, "recovered": False} | |
| final_answer, raw_parts = "", [] | |
| try: | |
| task.setup(ws) | |
| task.goal_cached = task.goal(ws) # stable text for modular-cache modules | |
| start = time.time() |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@benches/run_agentic_benchmark.py` around lines 891 - 899, The Pie path is
computing task.goal_cached before calling task.setup(ws), while the offline mode
(around lines 842-844) calls setup first. This ordering inconsistency causes
divergent behavior when task goals depend on setup state. Move the line
task.goal_cached = task.goal(ws) to execute after the task.setup(ws) call to
match the offline mode's ordering and ensure setup state is available before
computing the goal.
| client = PieClient(args.url) | ||
| await client.connect() | ||
| if args.token and args.token.lower() != "none": | ||
| await client.auth_by_token(args.token) | ||
|
|
||
| rows: list[dict[str, Any]] = [] | ||
| try: | ||
| for task in tasks: | ||
| print(f"\n== {task['id']} ({task['category']}) ==") | ||
| for language in languages: | ||
| for family in families: | ||
| print(f" {family}-{language}", flush=True) | ||
| if family == "modular-cache": | ||
| rows.extend(await run_modular_cache(run_inferlet_fn, client, language, task, args)) | ||
| elif family == "hierarchical-attention": | ||
| rows.extend(await run_hierarchical_attention(run_inferlet_fn, client, language, task, args)) | ||
| elif family == "mcts": | ||
| rows.extend(await run_mcts(run_inferlet_fn, client, language, task, args)) | ||
| finally: | ||
| await client.close() |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win
Close the client if auth_by_token fails.
connect() and auth_by_token() run before the try/finally, so an auth failure skips client.close() and leaks the connection. Move the auth call inside the try so cleanup always runs after a successful connect().
🔒 Proposed fix
await client.connect()
- if args.token and args.token.lower() != "none":
- await client.auth_by_token(args.token)
-
rows: list[dict[str, Any]] = []
try:
+ if args.token and args.token.lower() != "none":
+ await client.auth_by_token(args.token)
for task in tasks:🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@benches/run_arena_benchmark.py` around lines 689 - 708, The auth_by_token()
call is executed outside the try/finally block, which means if authentication
fails, the client connection is never closed in the finally block. Move the
conditional check for args.token and the await client.auth_by_token(args.token)
call inside the try block, right before the loop that iterates through tasks.
This ensures that the finally block will always execute and properly close the
client connection regardless of whether the auth call succeeds or fails.
|
|
||
| Run directly: | ||
|
|
||
| python3 tests/test_agentic_benchmark.py |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win
Fix the direct-run command in the module docstring.
Line 10 points to tests/test_agentic_benchmark.py, but this file lives under benches/; copy-pasting the current command from repo root will fail.
Suggested patch
- python3 tests/test_agentic_benchmark.py
+ python3 benches/test_agentic_benchmark.py📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| python3 tests/test_agentic_benchmark.py | |
| python3 benches/test_agentic_benchmark.py |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@benches/test_agentic_benchmark.py` at line 10, The module docstring contains
an incorrect command path for running the benchmark test. Update the docstring
instruction at line 10 that currently shows `python3
tests/test_agentic_benchmark.py` to reflect the correct path `python3
benches/test_agentic_benchmark.py`, since the file is located in the benches
directory, not the tests directory. This will ensure users can successfully
copy-paste and run the command from the module docstring.
| B.make_web_backend(live) | ||
| assert False, "expected SystemExit when the key env var is missing" | ||
| except SystemExit: |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win
Replace assert False in the expected-failure branch.
Line 204 uses assert False, which is removed with python -O; the test can silently pass if SystemExit is not raised. Raise AssertionError explicitly instead.
Suggested patch
try:
B.make_web_backend(live)
- assert False, "expected SystemExit when the key env var is missing"
+ raise AssertionError("expected SystemExit when the key env var is missing")
except SystemExit:
pass🧰 Tools
🪛 Ruff (0.15.18)
[warning] 204-204: Do not assert False (python -O removes these calls), raise AssertionError()
Replace assert False
(B011)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@benches/test_agentic_benchmark.py` around lines 203 - 205, The test uses
assert False to signal an expected failure when SystemExit is not raised during
B.make_web_backend(live), but this assertion gets removed when Python runs with
optimization flag -O, causing the test to silently pass when it should fail.
Replace the assert False statement with an explicit raise
AssertionError("expected SystemExit when the key env var is missing") to ensure
the test properly fails regardless of Python optimization flags.
Source: Linters/SAST tools
|
|
||
| Run directly: | ||
|
|
||
| python3 tests/test_arena_benchmark.py |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win
Fix stale direct-run path in module docstring.
The command points to tests/test_arena_benchmark.py, but this file lives under benches/, so the documented command is misleading.
Suggested fix
- python3 tests/test_arena_benchmark.py
+ python3 benches/test_arena_benchmark.py📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| python3 tests/test_arena_benchmark.py | |
| python3 benches/test_arena_benchmark.py |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@benches/test_arena_benchmark.py` at line 5, The module docstring in
test_arena_benchmark.py contains an outdated direct-run command that incorrectly
references the file location as tests/test_arena_benchmark.py when the file is
actually located in the benches/ directory. Update the example command in the
module docstring to use the correct file path that accurately reflects where
test_arena_benchmark.py resides in the project structure.
Summary
This PR adds a realistic agentic benchmark for evaluating Pie inferlets, especially the test-time-scaling examples: modular-cache, hierarchical-attention, and MCTS.
The benchmark is a local, reproducible stand-in for Agent Arena-style evaluation. Agent Arena itself is live and human-voted, so it cannot be reproduced directly from the repo. Instead, this benchmark gives Pie a concrete local harness with real tools, objective grading, pairwise scoring, and baseline-vs-method comparisons.
What Changed
Added:
benches/run_agentic_benchmark.pybenches/run_arena_benchmark.pybenches/test_agentic_benchmark.pybenches/test_arena_benchmark.pybenches/benchmark_tasks.example.jsonThe main benchmark runs a ReAct-style agent loop. On each step, the model emits either a tool call:
or a final answer:
The harness executes the tool, records the observation, adds it back into the context, and continues until the task is complete or the step budget is reached.
Tools
The benchmark includes three agent tool categories:
filesystem: real reads/writes in an isolated temporary workspace
terminal: real subprocess execution with real exit codes
web_search: deterministic local fake web by default, with optional live search support
The local fake web keeps default runs reproducible. A live backend can be enabled separately with an API key.
Tasks And Grading
The built-in tasks cover file workflows, coding/debugging, repo navigation, web research, and data processing. They are graded objectively using concrete ground truth: parsed JSON files, terminal exit codes, grep/wc output, or exact answer checks.
The benchmark also supports loading additional tasks from JSON with --tasks-file; benches/benchmark_tasks.example.json shows the format.
Scoring
The benchmark reports objective metrics such as task success, steps, tool calls, tool errors, tool hallucinations, error recovery and steerability.
It also builds an Arena-style leaderboard using pairwise Bradley-Terry scoring over shared task outcomes, plus component attribution over method, mode and language features.
Summary by CodeRabbit
New Features
Tests