Add agentic arena benchmark by agmankaruse · Pull Request #447 · pie-project/pie

agmankaruse · 2026-06-23T13:07:28Z

Summary

This PR adds a realistic agentic benchmark for evaluating Pie inferlets, especially the test-time-scaling examples: modular-cache, hierarchical-attention, and MCTS.

The benchmark is a local, reproducible stand-in for Agent Arena-style evaluation. Agent Arena itself is live and human-voted, so it cannot be reproduced directly from the repo. Instead, this benchmark gives Pie a concrete local harness with real tools, objective grading, pairwise scoring, and baseline-vs-method comparisons.

What Changed

Added:

benches/run_agentic_benchmark.py
benches/run_arena_benchmark.py
benches/test_agentic_benchmark.py
benches/test_arena_benchmark.py
benches/benchmark_tasks.example.json

The main benchmark runs a ReAct-style agent loop. On each step, the model emits either a tool call:

ACTION: <tool>
ARGS: <json>

or a final answer:

FINAL: <answer>

The harness executes the tool, records the observation, adds it back into the context, and continues until the task is complete or the step budget is reached.

Tools

The benchmark includes three agent tool categories:
filesystem: real reads/writes in an isolated temporary workspace
terminal: real subprocess execution with real exit codes
web_search: deterministic local fake web by default, with optional live search support
The local fake web keeps default runs reproducible. A live backend can be enabled separately with an API key.

Tasks And Grading

The built-in tasks cover file workflows, coding/debugging, repo navigation, web research, and data processing. They are graded objectively using concrete ground truth: parsed JSON files, terminal exit codes, grep/wc output, or exact answer checks.
The benchmark also supports loading additional tasks from JSON with --tasks-file; benches/benchmark_tasks.example.json shows the format.

Scoring

The benchmark reports objective metrics such as task success, steps, tool calls, tool errors, tool hallucinations, error recovery and steerability.
It also builds an Arena-style leaderboard using pairwise Bradley-Terry scoring over shared task outcomes, plus component attribution over method, mode and language features.

Summary by CodeRabbit

New Features
- Added comprehensive benchmarking infrastructure for evaluating agents with tools and comparing algorithm variants.
- Supports offline oracle evaluation mode and online integration with external inferlets.
- Includes configurable grading strategies, statistical ranking via Bradley–Terry analysis, and confidence intervals.
Tests
- Added test suites for agentic and arena-style benchmarking runners with end-to-end validation.

coderabbitai · 2026-06-23T13:07:43Z

Walkthrough

Two standalone benchmark runners are added under benches/. run_agentic_benchmark.py implements a tool-using agent loop with an isolated workspace, frozen/live web search, objective task grading, Bradley–Terry pairwise ranking, and offline oracle plus Pie-driven execution paths. run_arena_benchmark.py adds an Arena-style runner targeting three Pie inferlet families with proxy grading and family-specific control-flow metric parsers. Both include self-contained test suites and example/supporting data files.

Changes

Agentic Benchmark Runner

Layer / File(s)	Summary
Workspace, web search, and agent protocol `benches/run_agentic_benchmark.py`	`Workspace` creates an isolated temp directory with filesystem and subprocess ops. `FrozenWebSearch`, `HttpWebSearch`, and `make_web_backend` provide offline/live search with API key enforcement. `Action`, `parse_action`, `execute_tool`, `render_transcript`, and `run_agent` implement the full ACTION/ARGS protocol with allowlisted tool execution and hallucination/error tracking.
Built-in tasks, JSON task spec, and example task file `benches/run_agentic_benchmark.py`, `benches/benchmark_tasks.example.json`	`AgenticTask` subclasses implement setup, oracle trajectories, and objective grading for filesystem write, terminal bug-fix, grep navigation, web research, and CSV counting tasks. `grade_from_spec`, `JsonTask`, and `load_tasks_from_json` support fully JSON-defined tasks. The example JSON file defines two graded tasks (`json-write-settings`, `json-web-fact`).
Trajectory metrics, Bradley–Terry ranking, and leaderboard `benches/run_agentic_benchmark.py`	`trajectory_metrics` aggregates error recovery and steerability signals. `pairwise_outcome`, `bradley_terry`, `extended_bradley_terry`, `wilson_interval`, `build_leaderboard`, and `render_leaderboard_md` compute and render ranked method comparisons with component attribution.
Offline/Pie execution, output writing, and CLI `benches/run_agentic_benchmark.py`	`run_task_offline` runs the oracle loop end-to-end. `make_pie_step_fn` and `run_task_pie` drive external Pie inferlets step-by-step. `write_results` emits JSONL/CSV/markdown. `run_offline_self_check`, `run_pie`, `run_report`, `make_parser`, and `main` wire the full CLI with mode dispatch.
Agentic benchmark test suite `benches/test_agentic_benchmark.py`	Tests cover action parsing, workspace roundtrips, real subprocess execution, end-to-end oracle trajectories for all tasks, trajectory/pairwise/BT/Wilson/leaderboard correctness, `selected_tasks` filtering, web search adapter behavior, spec grading variants, and JSON task loading.

Arena Benchmark Runner

Layer / File(s)	Summary
Task catalog, prompt builders, proxy grading, and metric parsers `benches/run_arena_benchmark.py`	`TASKS` defines the static benchmark catalog with per-task term constraints. `make_modules`, `make_ha_prompt`, and `make_mcts_prompt` build family-specific prompts. `proxy_grade` uses deterministic keyword detection. `parse_modular_metrics`, `parse_ha_metrics`, and `parse_mcts_metrics` extract control-flow signals from inferlet logs.
Family runners: modular-cache, hierarchical-attention, and mcts `benches/run_arena_benchmark.py`	`run_modular_cache` executes a baseline run then three cache-reuse method runs, computing `control_flow_ok` from expected vs. observed cache-hit module counts. `run_hierarchical_attention` and `run_mcts` parameterize mode-specific launches, parse metrics, and set `control_flow_ok` per row.
Output, CLI, orchestration, and tests `benches/run_arena_benchmark.py`, `benches/test_arena_benchmark.py`	`write_results`, `print_summary`, and `dry_run_plan` handle output and planning. The async `run` loop authenticates, iterates task/family/language combinations with `finally`-guarded client close, and returns a pass/fail exit code. `make_parser` and `main` wire the CLI. The test suite validates parsers, proxy grading, and dry-run filtering.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 13.95% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Add agentic arena benchmark' directly and clearly summarizes the main change: adding a complete agentic arena benchmark system with multiple new files and comprehensive evaluation harness.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 7

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@benches/run_agentic_benchmark.py`:
- Around line 101-105: The run_terminal method executes arbitrary commands
derived from model output without restricting what commands can be executed. To
fix this, implement a command allowlist that validates and restricts the
commands that can be executed to only safe, benchmark-appropriate operations.
Alternatively, implement proper sandboxing using containerization (Docker) or
system-level sandboxing tools (nsjail/firejail) to isolate command execution and
prevent access to the host system's network and files outside the benchmark
workspace. Apply this security fix to all locations where subprocess.run is
called with untrusted model-generated commands, including the run_terminal
method and any other similar command execution points around line 283.
- Around line 891-899: The Pie path is computing task.goal_cached before calling
task.setup(ws), while the offline mode (around lines 842-844) calls setup first.
This ordering inconsistency causes divergent behavior when task goals depend on
setup state. Move the line task.goal_cached = task.goal(ws) to execute after the
task.setup(ws) call to match the offline mode's ordering and ensure setup state
is available before computing the goal.
- Around line 87-96: The write_file, read_file, and exists methods in the
TempWorkspace class directly concatenate the rel parameter without validating
that it stays within the workspace directory, allowing path traversal attacks
using absolute paths or .. segments. Add path validation in each method by
resolving both the workspace path (self.path.resolve()) and the target path,
then verify that the target's resolved path starts with the workspace's resolved
path. Raise a ValueError or similar exception if the target escapes the
workspace bounds before proceeding with the file operation.

In `@benches/run_arena_benchmark.py`:
- Around line 689-708: The auth_by_token() call is executed outside the
try/finally block, which means if authentication fails, the client connection is
never closed in the finally block. Move the conditional check for args.token and
the await client.auth_by_token(args.token) call inside the try block, right
before the loop that iterates through tasks. This ensures that the finally block
will always execute and properly close the client connection regardless of
whether the auth call succeeds or fails.

In `@benches/test_agentic_benchmark.py`:
- Around line 203-205: The test uses assert False to signal an expected failure
when SystemExit is not raised during B.make_web_backend(live), but this
assertion gets removed when Python runs with optimization flag -O, causing the
test to silently pass when it should fail. Replace the assert False statement
with an explicit raise AssertionError("expected SystemExit when the key env var
is missing") to ensure the test properly fails regardless of Python optimization
flags.
- Line 10: The module docstring contains an incorrect command path for running
the benchmark test. Update the docstring instruction at line 10 that currently
shows `python3 tests/test_agentic_benchmark.py` to reflect the correct path
`python3 benches/test_agentic_benchmark.py`, since the file is located in the
benches directory, not the tests directory. This will ensure users can
successfully copy-paste and run the command from the module docstring.

In `@benches/test_arena_benchmark.py`:
- Line 5: The module docstring in test_arena_benchmark.py contains an outdated
direct-run command that incorrectly references the file location as
tests/test_arena_benchmark.py when the file is actually located in the benches/
directory. Update the example command in the module docstring to use the correct
file path that accurately reflects where test_arena_benchmark.py resides in the
project structure.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c5164c2f-1270-4d28-8c78-120d9a87fa30

📥 Commits

Reviewing files that changed from the base of the PR and between 9ba726a and f9981cd.

📒 Files selected for processing (5)

benches/benchmark_tasks.example.json
benches/run_agentic_benchmark.py
benches/run_arena_benchmark.py
benches/test_agentic_benchmark.py
benches/test_arena_benchmark.py

coderabbitai · 2026-06-23T13:14:30Z

+    def write_file(self, rel: str, content: str) -> None:
+        target = self.path / rel
+        target.parent.mkdir(parents=True, exist_ok=True)
+        target.write_text(content, encoding="utf-8")
+
+    def read_file(self, rel: str) -> str:
+        return (self.path / rel).read_text(encoding="utf-8")
+
+    def exists(self, rel: str) -> bool:
+        return (self.path / rel).exists()


🔒 Security & Privacy | 🔴 Critical | ⚡ Quick win

Enforce workspace path containment for file tools.

Line 88 and Line 93 trust rel directly, so absolute paths or .. segments can escape the temp workspace and read/write host files. That breaks the isolation guarantee and is a real sandbox escape.

🔧 Proposed fix

class Workspace: """An isolated temp dir the agent acts on, with real file + terminal access.""" @@ + def _resolve_inside_workspace(self, rel: str) -> Path: + root = self.path.resolve() + target = (self.path / rel).resolve() + if target != root and root not in target.parents: + raise ValueError(f"path escapes workspace: {rel}") + return target + def write_file(self, rel: str, content: str) -> None: - target = self.path / rel + target = self._resolve_inside_workspace(rel) target.parent.mkdir(parents=True, exist_ok=True) target.write_text(content, encoding="utf-8") def read_file(self, rel: str) -> str: - return (self.path / rel).read_text(encoding="utf-8") + return self._resolve_inside_workspace(rel).read_text(encoding="utf-8") def exists(self, rel: str) -> bool: - return (self.path / rel).exists() + try: + return self._resolve_inside_workspace(rel).exists() + except ValueError: + return False

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def write_file(self, rel: str, content: str) -> None:

target = self.path / rel

target.parent.mkdir(parents=True, exist_ok=True)

target.write_text(content, encoding="utf-8")

def read_file(self, rel: str) -> str:

return (self.path / rel).read_text(encoding="utf-8")

def exists(self, rel: str) -> bool:

return (self.path / rel).exists()

def _resolve_inside_workspace(self, rel: str) -> Path:

root = self.path.resolve()

target = (self.path / rel).resolve()

if target != root and root not in target.parents:

raise ValueError(f"path escapes workspace: {rel}")

return target

def write_file(self, rel: str, content: str) -> None:

target = self._resolve_inside_workspace(rel)

target.parent.mkdir(parents=True, exist_ok=True)

target.write_text(content, encoding="utf-8")

def read_file(self, rel: str) -> str:

return self._resolve_inside_workspace(rel).read_text(encoding="utf-8")

def exists(self, rel: str) -> bool:

try:

return self._resolve_inside_workspace(rel).exists()

except ValueError:

return False

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@benches/run_agentic_benchmark.py` around lines 87 - 96, The write_file, read_file, and exists methods in the TempWorkspace class directly concatenate the rel parameter without validating that it stays within the workspace directory, allowing path traversal attacks using absolute paths or .. segments. Add path validation in each method by resolving both the workspace path (self.path.resolve()) and the target path, then verify that the target's resolved path starts with the workspace's resolved path. Raise a ValueError or similar exception if the target escapes the workspace bounds before proceeding with the file operation.

coderabbitai · 2026-06-23T13:14:30Z

+    def run_terminal(self, command: list[str], timeout: int = 30) -> dict[str, Any]:
+        start = time.time()
+        try:
+            proc = subprocess.run(command, cwd=self.path, capture_output=True,
+                                  text=True, timeout=timeout)


🔒 Security & Privacy | 🟠 Major | 🏗️ Heavy lift

run_terminal currently executes untrusted model commands on the host.

Line 101 and Line 283 execute arbitrary commands derived from model output. Even with shell=False, this is still direct command execution and can access network/files outside the benchmark workspace. Consider a hard sandbox boundary (container/nsjail/firejail) or a strict command allowlist for benchmark tasks.

Also applies to: 283-285

🧰 Tools

🪛 ast-grep (0.44.0)

[error] 103-104: Use of unsanitized data to create processes
Context: subprocess.run(command, cwd=self.path, capture_output=True,
text=True, timeout=timeout)
Note: [CWE-78] Improper Neutralization of Special Elements used in an OS Command ('OS Command Injection').

(os-system-unsanitized-data)

[error] 103-104: Command coming from incoming request
Context: subprocess.run(command, cwd=self.path, capture_output=True,
text=True, timeout=timeout)
Note: [CWE-78] Improper Neutralization of Special Elements used in an OS Command ('OS Command Injection').

(subprocess-from-request)

🪛 Ruff (0.15.18)

[error] 104-104: subprocess call: check for execution of untrusted input

(S603)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@benches/run_agentic_benchmark.py` around lines 101 - 105, The run_terminal method executes arbitrary commands derived from model output without restricting what commands can be executed. To fix this, implement a command allowlist that validates and restricts the commands that can be executed to only safe, benchmark-appropriate operations. Alternatively, implement proper sandboxing using containerization (Docker) or system-level sandboxing tools (nsjail/firejail) to isolate command execution and prevent access to the host system's network and files outside the benchmark workspace. Apply this security fix to all locations where subprocess.run is called with untrusted model-generated commands, including the run_terminal method and any other similar command execution points around line 283.

coderabbitai · 2026-06-23T13:14:30Z

+    task.goal_cached = task.goal(ws)  # stable text for modular-cache modules
+    astep = make_pie_step_fn(run_inferlet_fn, client, family, language, task, mode, args)
+    history: list[dict[str, Any]] = []
+    counts = {"tool_calls": 0, "tool_errors": 0, "tool_hallucinations": 0, "invalid": 0,
+              "had_error": False, "recovered": False}
+    final_answer, raw_parts = "", []
+    try:
+        task.setup(ws)
+        start = time.time()


🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Fix setup/goal ordering inconsistency in Pie path.

Line 891 computes task.goal_cached before Line 898 task.setup(ws), while offline mode does setup first (Line 842-844). Tasks whose goals depend on setup state will diverge between offline and Pie runs.

🔧 Proposed fix

async def run_task_pie(run_inferlet_fn, client, family, language, task, mode, args, web=None) -> dict[str, Any]: ws = Workspace() web = web or FrozenWebSearch() - task.goal_cached = task.goal(ws) # stable text for modular-cache modules astep = make_pie_step_fn(run_inferlet_fn, client, family, language, task, mode, args) @@ try: task.setup(ws) + task.goal_cached = task.goal(ws) # stable text for modular-cache modules start = time.time()

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

task.goal_cached = task.goal(ws) # stable text for modular-cache modules

astep = make_pie_step_fn(run_inferlet_fn, client, family, language, task, mode, args)

history: list[dict[str, Any]] = []

counts = {"tool_calls": 0, "tool_errors": 0, "tool_hallucinations": 0, "invalid": 0,

"had_error": False, "recovered": False}

final_answer, raw_parts = "", []

try:

task.setup(ws)

start = time.time()

astep = make_pie_step_fn(run_inferlet_fn, client, family, language, task, mode, args)

history: list[dict[str, Any]] = []

counts = {"tool_calls": 0, "tool_errors": 0, "tool_hallucinations": 0, "invalid": 0,

"had_error": False, "recovered": False}

final_answer, raw_parts = "", []

try:

task.setup(ws)

task.goal_cached = task.goal(ws) # stable text for modular-cache modules

start = time.time()

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@benches/run_agentic_benchmark.py` around lines 891 - 899, The Pie path is computing task.goal_cached before calling task.setup(ws), while the offline mode (around lines 842-844) calls setup first. This ordering inconsistency causes divergent behavior when task goals depend on setup state. Move the line task.goal_cached = task.goal(ws) to execute after the task.setup(ws) call to match the offline mode's ordering and ensure setup state is available before computing the goal.

coderabbitai · 2026-06-23T13:14:31Z

+    client = PieClient(args.url)
+    await client.connect()
+    if args.token and args.token.lower() != "none":
+        await client.auth_by_token(args.token)
+
+    rows: list[dict[str, Any]] = []
+    try:
+        for task in tasks:
+            print(f"\n== {task['id']} ({task['category']}) ==")
+            for language in languages:
+                for family in families:
+                    print(f"  {family}-{language}", flush=True)
+                    if family == "modular-cache":
+                        rows.extend(await run_modular_cache(run_inferlet_fn, client, language, task, args))
+                    elif family == "hierarchical-attention":
+                        rows.extend(await run_hierarchical_attention(run_inferlet_fn, client, language, task, args))
+                    elif family == "mcts":
+                        rows.extend(await run_mcts(run_inferlet_fn, client, language, task, args))
+    finally:
+        await client.close()


🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win

Close the client if auth_by_token fails.

connect() and auth_by_token() run before the try/finally, so an auth failure skips client.close() and leaks the connection. Move the auth call inside the try so cleanup always runs after a successful connect().

🔒 Proposed fix

await client.connect() - if args.token and args.token.lower() != "none": - await client.auth_by_token(args.token) - rows: list[dict[str, Any]] = [] try: + if args.token and args.token.lower() != "none": + await client.auth_by_token(args.token) for task in tasks:

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@benches/run_arena_benchmark.py` around lines 689 - 708, The auth_by_token() call is executed outside the try/finally block, which means if authentication fails, the client connection is never closed in the finally block. Move the conditional check for args.token and the await client.auth_by_token(args.token) call inside the try block, right before the loop that iterates through tasks. This ensures that the finally block will always execute and properly close the client connection regardless of whether the auth call succeeds or fails.

coderabbitai · 2026-06-23T13:14:31Z

+
+Run directly:
+
+    python3 tests/test_agentic_benchmark.py


📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Fix the direct-run command in the module docstring.

Line 10 points to tests/test_agentic_benchmark.py, but this file lives under benches/; copy-pasting the current command from repo root will fail.

Suggested patch

- python3 tests/test_agentic_benchmark.py + python3 benches/test_agentic_benchmark.py

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

python3 tests/test_agentic_benchmark.py

python3 benches/test_agentic_benchmark.py

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@benches/test_agentic_benchmark.py` at line 10, The module docstring contains an incorrect command path for running the benchmark test. Update the docstring instruction at line 10 that currently shows `python3 tests/test_agentic_benchmark.py` to reflect the correct path `python3 benches/test_agentic_benchmark.py`, since the file is located in the benches directory, not the tests directory. This will ensure users can successfully copy-paste and run the command from the module docstring.

coderabbitai · 2026-06-23T13:14:31Z

+        B.make_web_backend(live)
+        assert False, "expected SystemExit when the key env var is missing"
+    except SystemExit:


🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Replace assert False in the expected-failure branch.

Line 204 uses assert False, which is removed with python -O; the test can silently pass if SystemExit is not raised. Raise AssertionError explicitly instead.

Suggested patch

try: B.make_web_backend(live) - assert False, "expected SystemExit when the key env var is missing" + raise AssertionError("expected SystemExit when the key env var is missing") except SystemExit: pass

🧰 Tools

🪛 Ruff (0.15.18)

[warning] 204-204: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@benches/test_agentic_benchmark.py` around lines 203 - 205, The test uses assert False to signal an expected failure when SystemExit is not raised during B.make_web_backend(live), but this assertion gets removed when Python runs with optimization flag -O, causing the test to silently pass when it should fail. Replace the assert False statement with an explicit raise AssertionError("expected SystemExit when the key env var is missing") to ensure the test properly fails regardless of Python optimization flags.

Source: Linters/SAST tools

coderabbitai · 2026-06-23T13:14:31Z

+
+Run directly:
+
+    python3 tests/test_arena_benchmark.py


📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Fix stale direct-run path in module docstring.

The command points to tests/test_arena_benchmark.py, but this file lives under benches/, so the documented command is misleading.

Suggested fix

- python3 tests/test_arena_benchmark.py + python3 benches/test_arena_benchmark.py

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

python3 tests/test_arena_benchmark.py

python3 benches/test_arena_benchmark.py

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@benches/test_arena_benchmark.py` at line 5, The module docstring in test_arena_benchmark.py contains an outdated direct-run command that incorrectly references the file location as tests/test_arena_benchmark.py when the file is actually located in the benches/ directory. Update the example command in the module docstring to use the correct file path that accurately reflects where test_arena_benchmark.py resides in the project structure.

test: add agentic arena benchmark

f9981cd

coderabbitai Bot reviewed Jun 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add agentic arena benchmark#447

Add agentic arena benchmark#447
agmankaruse wants to merge 1 commit into
pie-project:mainfrom
agmankaruse:inferlet-arena-tester

agmankaruse commented Jun 23, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 23, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 23, 2026

Uh oh!

coderabbitai Bot Jun 23, 2026

Uh oh!

coderabbitai Bot Jun 23, 2026

Uh oh!

coderabbitai Bot Jun 23, 2026

Uh oh!

coderabbitai Bot Jun 23, 2026

Uh oh!

coderabbitai Bot Jun 23, 2026

Uh oh!

coderabbitai Bot Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	python3 tests/test_agentic_benchmark.py
	python3 benches/test_agentic_benchmark.py

	python3 tests/test_arena_benchmark.py
	python3 benches/test_arena_benchmark.py

Uh oh!

Conversation

agmankaruse commented Jun 23, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What Changed

Tools

Tasks And Grading

Scoring

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

agmankaruse commented Jun 23, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 23, 2026 •

edited

Loading