Skip to content

Add agentic arena benchmark#447

Open
agmankaruse wants to merge 1 commit into
pie-project:mainfrom
agmankaruse:inferlet-arena-tester
Open

Add agentic arena benchmark#447
agmankaruse wants to merge 1 commit into
pie-project:mainfrom
agmankaruse:inferlet-arena-tester

Conversation

@agmankaruse

@agmankaruse agmankaruse commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Summary

This PR adds a realistic agentic benchmark for evaluating Pie inferlets, especially the test-time-scaling examples: modular-cache, hierarchical-attention, and MCTS.

The benchmark is a local, reproducible stand-in for Agent Arena-style evaluation. Agent Arena itself is live and human-voted, so it cannot be reproduced directly from the repo. Instead, this benchmark gives Pie a concrete local harness with real tools, objective grading, pairwise scoring, and baseline-vs-method comparisons.

What Changed

Added:

  • benches/run_agentic_benchmark.py
  • benches/run_arena_benchmark.py
  • benches/test_agentic_benchmark.py
  • benches/test_arena_benchmark.py
  • benches/benchmark_tasks.example.json

The main benchmark runs a ReAct-style agent loop. On each step, the model emits either a tool call:

ACTION: <tool>
ARGS: <json>

or a final answer:

FINAL: <answer>

The harness executes the tool, records the observation, adds it back into the context, and continues until the task is complete or the step budget is reached.

Tools

The benchmark includes three agent tool categories:
filesystem: real reads/writes in an isolated temporary workspace
terminal: real subprocess execution with real exit codes
web_search: deterministic local fake web by default, with optional live search support
The local fake web keeps default runs reproducible. A live backend can be enabled separately with an API key.

Tasks And Grading

The built-in tasks cover file workflows, coding/debugging, repo navigation, web research, and data processing. They are graded objectively using concrete ground truth: parsed JSON files, terminal exit codes, grep/wc output, or exact answer checks.
The benchmark also supports loading additional tasks from JSON with --tasks-file; benches/benchmark_tasks.example.json shows the format.

Scoring

The benchmark reports objective metrics such as task success, steps, tool calls, tool errors, tool hallucinations, error recovery and steerability.
It also builds an Arena-style leaderboard using pairwise Bradley-Terry scoring over shared task outcomes, plus component attribution over method, mode and language features.

Summary by CodeRabbit

  • New Features

    • Added comprehensive benchmarking infrastructure for evaluating agents with tools and comparing algorithm variants.
    • Supports offline oracle evaluation mode and online integration with external inferlets.
    • Includes configurable grading strategies, statistical ranking via Bradley–Terry analysis, and confidence intervals.
  • Tests

    • Added test suites for agentic and arena-style benchmarking runners with end-to-end validation.

@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown

Review Change Stack

Walkthrough

Two standalone benchmark runners are added under benches/. run_agentic_benchmark.py implements a tool-using agent loop with an isolated workspace, frozen/live web search, objective task grading, Bradley–Terry pairwise ranking, and offline oracle plus Pie-driven execution paths. run_arena_benchmark.py adds an Arena-style runner targeting three Pie inferlet families with proxy grading and family-specific control-flow metric parsers. Both include self-contained test suites and example/supporting data files.

Changes

Agentic Benchmark Runner

Layer / File(s) Summary
Workspace, web search, and agent protocol
benches/run_agentic_benchmark.py
Workspace creates an isolated temp directory with filesystem and subprocess ops. FrozenWebSearch, HttpWebSearch, and make_web_backend provide offline/live search with API key enforcement. Action, parse_action, execute_tool, render_transcript, and run_agent implement the full ACTION/ARGS protocol with allowlisted tool execution and hallucination/error tracking.
Built-in tasks, JSON task spec, and example task file
benches/run_agentic_benchmark.py, benches/benchmark_tasks.example.json
AgenticTask subclasses implement setup, oracle trajectories, and objective grading for filesystem write, terminal bug-fix, grep navigation, web research, and CSV counting tasks. grade_from_spec, JsonTask, and load_tasks_from_json support fully JSON-defined tasks. The example JSON file defines two graded tasks (json-write-settings, json-web-fact).
Trajectory metrics, Bradley–Terry ranking, and leaderboard
benches/run_agentic_benchmark.py
trajectory_metrics aggregates error recovery and steerability signals. pairwise_outcome, bradley_terry, extended_bradley_terry, wilson_interval, build_leaderboard, and render_leaderboard_md compute and render ranked method comparisons with component attribution.
Offline/Pie execution, output writing, and CLI
benches/run_agentic_benchmark.py
run_task_offline runs the oracle loop end-to-end. make_pie_step_fn and run_task_pie drive external Pie inferlets step-by-step. write_results emits JSONL/CSV/markdown. run_offline_self_check, run_pie, run_report, make_parser, and main wire the full CLI with mode dispatch.
Agentic benchmark test suite
benches/test_agentic_benchmark.py
Tests cover action parsing, workspace roundtrips, real subprocess execution, end-to-end oracle trajectories for all tasks, trajectory/pairwise/BT/Wilson/leaderboard correctness, selected_tasks filtering, web search adapter behavior, spec grading variants, and JSON task loading.

Arena Benchmark Runner

Layer / File(s) Summary
Task catalog, prompt builders, proxy grading, and metric parsers
benches/run_arena_benchmark.py
TASKS defines the static benchmark catalog with per-task term constraints. make_modules, make_ha_prompt, and make_mcts_prompt build family-specific prompts. proxy_grade uses deterministic keyword detection. parse_modular_metrics, parse_ha_metrics, and parse_mcts_metrics extract control-flow signals from inferlet logs.
Family runners: modular-cache, hierarchical-attention, and mcts
benches/run_arena_benchmark.py
run_modular_cache executes a baseline run then three cache-reuse method runs, computing control_flow_ok from expected vs. observed cache-hit module counts. run_hierarchical_attention and run_mcts parameterize mode-specific launches, parse metrics, and set control_flow_ok per row.
Output, CLI, orchestration, and tests
benches/run_arena_benchmark.py, benches/test_arena_benchmark.py
write_results, print_summary, and dry_run_plan handle output and planning. The async run loop authenticates, iterates task/family/language combinations with finally-guarded client close, and returns a pass/fail exit code. make_parser and main wire the CLI. The test suite validates parsers, proxy grading, and dry-run filtering.
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 13.95% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Add agentic arena benchmark' directly and clearly summarizes the main change: adding a complete agentic arena benchmark system with multiple new files and comprehensive evaluation harness.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@benches/run_agentic_benchmark.py`:
- Around line 101-105: The run_terminal method executes arbitrary commands
derived from model output without restricting what commands can be executed. To
fix this, implement a command allowlist that validates and restricts the
commands that can be executed to only safe, benchmark-appropriate operations.
Alternatively, implement proper sandboxing using containerization (Docker) or
system-level sandboxing tools (nsjail/firejail) to isolate command execution and
prevent access to the host system's network and files outside the benchmark
workspace. Apply this security fix to all locations where subprocess.run is
called with untrusted model-generated commands, including the run_terminal
method and any other similar command execution points around line 283.
- Around line 891-899: The Pie path is computing task.goal_cached before calling
task.setup(ws), while the offline mode (around lines 842-844) calls setup first.
This ordering inconsistency causes divergent behavior when task goals depend on
setup state. Move the line task.goal_cached = task.goal(ws) to execute after the
task.setup(ws) call to match the offline mode's ordering and ensure setup state
is available before computing the goal.
- Around line 87-96: The write_file, read_file, and exists methods in the
TempWorkspace class directly concatenate the rel parameter without validating
that it stays within the workspace directory, allowing path traversal attacks
using absolute paths or .. segments. Add path validation in each method by
resolving both the workspace path (self.path.resolve()) and the target path,
then verify that the target's resolved path starts with the workspace's resolved
path. Raise a ValueError or similar exception if the target escapes the
workspace bounds before proceeding with the file operation.

In `@benches/run_arena_benchmark.py`:
- Around line 689-708: The auth_by_token() call is executed outside the
try/finally block, which means if authentication fails, the client connection is
never closed in the finally block. Move the conditional check for args.token and
the await client.auth_by_token(args.token) call inside the try block, right
before the loop that iterates through tasks. This ensures that the finally block
will always execute and properly close the client connection regardless of
whether the auth call succeeds or fails.

In `@benches/test_agentic_benchmark.py`:
- Around line 203-205: The test uses assert False to signal an expected failure
when SystemExit is not raised during B.make_web_backend(live), but this
assertion gets removed when Python runs with optimization flag -O, causing the
test to silently pass when it should fail. Replace the assert False statement
with an explicit raise AssertionError("expected SystemExit when the key env var
is missing") to ensure the test properly fails regardless of Python optimization
flags.
- Line 10: The module docstring contains an incorrect command path for running
the benchmark test. Update the docstring instruction at line 10 that currently
shows `python3 tests/test_agentic_benchmark.py` to reflect the correct path
`python3 benches/test_agentic_benchmark.py`, since the file is located in the
benches directory, not the tests directory. This will ensure users can
successfully copy-paste and run the command from the module docstring.

In `@benches/test_arena_benchmark.py`:
- Line 5: The module docstring in test_arena_benchmark.py contains an outdated
direct-run command that incorrectly references the file location as
tests/test_arena_benchmark.py when the file is actually located in the benches/
directory. Update the example command in the module docstring to use the correct
file path that accurately reflects where test_arena_benchmark.py resides in the
project structure.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c5164c2f-1270-4d28-8c78-120d9a87fa30

📥 Commits

Reviewing files that changed from the base of the PR and between 9ba726a and f9981cd.

📒 Files selected for processing (5)
  • benches/benchmark_tasks.example.json
  • benches/run_agentic_benchmark.py
  • benches/run_arena_benchmark.py
  • benches/test_agentic_benchmark.py
  • benches/test_arena_benchmark.py

Comment on lines +87 to +96
def write_file(self, rel: str, content: str) -> None:
target = self.path / rel
target.parent.mkdir(parents=True, exist_ok=True)
target.write_text(content, encoding="utf-8")

def read_file(self, rel: str) -> str:
return (self.path / rel).read_text(encoding="utf-8")

def exists(self, rel: str) -> bool:
return (self.path / rel).exists()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🔴 Critical | ⚡ Quick win

Enforce workspace path containment for file tools.

Line 88 and Line 93 trust rel directly, so absolute paths or .. segments can escape the temp workspace and read/write host files. That breaks the isolation guarantee and is a real sandbox escape.

🔧 Proposed fix
 class Workspace:
     """An isolated temp dir the agent acts on, with real file + terminal access."""
@@
+    def _resolve_inside_workspace(self, rel: str) -> Path:
+        root = self.path.resolve()
+        target = (self.path / rel).resolve()
+        if target != root and root not in target.parents:
+            raise ValueError(f"path escapes workspace: {rel}")
+        return target
+
     def write_file(self, rel: str, content: str) -> None:
-        target = self.path / rel
+        target = self._resolve_inside_workspace(rel)
         target.parent.mkdir(parents=True, exist_ok=True)
         target.write_text(content, encoding="utf-8")
 
     def read_file(self, rel: str) -> str:
-        return (self.path / rel).read_text(encoding="utf-8")
+        return self._resolve_inside_workspace(rel).read_text(encoding="utf-8")
 
     def exists(self, rel: str) -> bool:
-        return (self.path / rel).exists()
+        try:
+            return self._resolve_inside_workspace(rel).exists()
+        except ValueError:
+            return False
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def write_file(self, rel: str, content: str) -> None:
target = self.path / rel
target.parent.mkdir(parents=True, exist_ok=True)
target.write_text(content, encoding="utf-8")
def read_file(self, rel: str) -> str:
return (self.path / rel).read_text(encoding="utf-8")
def exists(self, rel: str) -> bool:
return (self.path / rel).exists()
def _resolve_inside_workspace(self, rel: str) -> Path:
root = self.path.resolve()
target = (self.path / rel).resolve()
if target != root and root not in target.parents:
raise ValueError(f"path escapes workspace: {rel}")
return target
def write_file(self, rel: str, content: str) -> None:
target = self._resolve_inside_workspace(rel)
target.parent.mkdir(parents=True, exist_ok=True)
target.write_text(content, encoding="utf-8")
def read_file(self, rel: str) -> str:
return self._resolve_inside_workspace(rel).read_text(encoding="utf-8")
def exists(self, rel: str) -> bool:
try:
return self._resolve_inside_workspace(rel).exists()
except ValueError:
return False
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benches/run_agentic_benchmark.py` around lines 87 - 96, The write_file,
read_file, and exists methods in the TempWorkspace class directly concatenate
the rel parameter without validating that it stays within the workspace
directory, allowing path traversal attacks using absolute paths or .. segments.
Add path validation in each method by resolving both the workspace path
(self.path.resolve()) and the target path, then verify that the target's
resolved path starts with the workspace's resolved path. Raise a ValueError or
similar exception if the target escapes the workspace bounds before proceeding
with the file operation.

Comment on lines +101 to +105
def run_terminal(self, command: list[str], timeout: int = 30) -> dict[str, Any]:
start = time.time()
try:
proc = subprocess.run(command, cwd=self.path, capture_output=True,
text=True, timeout=timeout)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🟠 Major | 🏗️ Heavy lift

run_terminal currently executes untrusted model commands on the host.

Line 101 and Line 283 execute arbitrary commands derived from model output. Even with shell=False, this is still direct command execution and can access network/files outside the benchmark workspace. Consider a hard sandbox boundary (container/nsjail/firejail) or a strict command allowlist for benchmark tasks.

Also applies to: 283-285

🧰 Tools
🪛 ast-grep (0.44.0)

[error] 103-104: Use of unsanitized data to create processes
Context: subprocess.run(command, cwd=self.path, capture_output=True,
text=True, timeout=timeout)
Note: [CWE-78] Improper Neutralization of Special Elements used in an OS Command ('OS Command Injection').

(os-system-unsanitized-data)


[error] 103-104: Command coming from incoming request
Context: subprocess.run(command, cwd=self.path, capture_output=True,
text=True, timeout=timeout)
Note: [CWE-78] Improper Neutralization of Special Elements used in an OS Command ('OS Command Injection').

(subprocess-from-request)

🪛 Ruff (0.15.18)

[error] 104-104: subprocess call: check for execution of untrusted input

(S603)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benches/run_agentic_benchmark.py` around lines 101 - 105, The run_terminal
method executes arbitrary commands derived from model output without restricting
what commands can be executed. To fix this, implement a command allowlist that
validates and restricts the commands that can be executed to only safe,
benchmark-appropriate operations. Alternatively, implement proper sandboxing
using containerization (Docker) or system-level sandboxing tools
(nsjail/firejail) to isolate command execution and prevent access to the host
system's network and files outside the benchmark workspace. Apply this security
fix to all locations where subprocess.run is called with untrusted
model-generated commands, including the run_terminal method and any other
similar command execution points around line 283.

Comment on lines +891 to +899
task.goal_cached = task.goal(ws) # stable text for modular-cache modules
astep = make_pie_step_fn(run_inferlet_fn, client, family, language, task, mode, args)
history: list[dict[str, Any]] = []
counts = {"tool_calls": 0, "tool_errors": 0, "tool_hallucinations": 0, "invalid": 0,
"had_error": False, "recovered": False}
final_answer, raw_parts = "", []
try:
task.setup(ws)
start = time.time()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Fix setup/goal ordering inconsistency in Pie path.

Line 891 computes task.goal_cached before Line 898 task.setup(ws), while offline mode does setup first (Line 842-844). Tasks whose goals depend on setup state will diverge between offline and Pie runs.

🔧 Proposed fix
 async def run_task_pie(run_inferlet_fn, client, family, language, task, mode, args, web=None) -> dict[str, Any]:
     ws = Workspace()
     web = web or FrozenWebSearch()
-    task.goal_cached = task.goal(ws)  # stable text for modular-cache modules
     astep = make_pie_step_fn(run_inferlet_fn, client, family, language, task, mode, args)
@@
     try:
         task.setup(ws)
+        task.goal_cached = task.goal(ws)  # stable text for modular-cache modules
         start = time.time()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
task.goal_cached = task.goal(ws) # stable text for modular-cache modules
astep = make_pie_step_fn(run_inferlet_fn, client, family, language, task, mode, args)
history: list[dict[str, Any]] = []
counts = {"tool_calls": 0, "tool_errors": 0, "tool_hallucinations": 0, "invalid": 0,
"had_error": False, "recovered": False}
final_answer, raw_parts = "", []
try:
task.setup(ws)
start = time.time()
astep = make_pie_step_fn(run_inferlet_fn, client, family, language, task, mode, args)
history: list[dict[str, Any]] = []
counts = {"tool_calls": 0, "tool_errors": 0, "tool_hallucinations": 0, "invalid": 0,
"had_error": False, "recovered": False}
final_answer, raw_parts = "", []
try:
task.setup(ws)
task.goal_cached = task.goal(ws) # stable text for modular-cache modules
start = time.time()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benches/run_agentic_benchmark.py` around lines 891 - 899, The Pie path is
computing task.goal_cached before calling task.setup(ws), while the offline mode
(around lines 842-844) calls setup first. This ordering inconsistency causes
divergent behavior when task goals depend on setup state. Move the line
task.goal_cached = task.goal(ws) to execute after the task.setup(ws) call to
match the offline mode's ordering and ensure setup state is available before
computing the goal.

Comment on lines +689 to +708
client = PieClient(args.url)
await client.connect()
if args.token and args.token.lower() != "none":
await client.auth_by_token(args.token)

rows: list[dict[str, Any]] = []
try:
for task in tasks:
print(f"\n== {task['id']} ({task['category']}) ==")
for language in languages:
for family in families:
print(f" {family}-{language}", flush=True)
if family == "modular-cache":
rows.extend(await run_modular_cache(run_inferlet_fn, client, language, task, args))
elif family == "hierarchical-attention":
rows.extend(await run_hierarchical_attention(run_inferlet_fn, client, language, task, args))
elif family == "mcts":
rows.extend(await run_mcts(run_inferlet_fn, client, language, task, args))
finally:
await client.close()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win

Close the client if auth_by_token fails.

connect() and auth_by_token() run before the try/finally, so an auth failure skips client.close() and leaks the connection. Move the auth call inside the try so cleanup always runs after a successful connect().

🔒 Proposed fix
     await client.connect()
-    if args.token and args.token.lower() != "none":
-        await client.auth_by_token(args.token)
-
     rows: list[dict[str, Any]] = []
     try:
+        if args.token and args.token.lower() != "none":
+            await client.auth_by_token(args.token)
         for task in tasks:
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benches/run_arena_benchmark.py` around lines 689 - 708, The auth_by_token()
call is executed outside the try/finally block, which means if authentication
fails, the client connection is never closed in the finally block. Move the
conditional check for args.token and the await client.auth_by_token(args.token)
call inside the try block, right before the loop that iterates through tasks.
This ensures that the finally block will always execute and properly close the
client connection regardless of whether the auth call succeeds or fails.


Run directly:

python3 tests/test_agentic_benchmark.py

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Fix the direct-run command in the module docstring.

Line 10 points to tests/test_agentic_benchmark.py, but this file lives under benches/; copy-pasting the current command from repo root will fail.

Suggested patch
-    python3 tests/test_agentic_benchmark.py
+    python3 benches/test_agentic_benchmark.py
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
python3 tests/test_agentic_benchmark.py
python3 benches/test_agentic_benchmark.py
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benches/test_agentic_benchmark.py` at line 10, The module docstring contains
an incorrect command path for running the benchmark test. Update the docstring
instruction at line 10 that currently shows `python3
tests/test_agentic_benchmark.py` to reflect the correct path `python3
benches/test_agentic_benchmark.py`, since the file is located in the benches
directory, not the tests directory. This will ensure users can successfully
copy-paste and run the command from the module docstring.

Comment on lines +203 to +205
B.make_web_backend(live)
assert False, "expected SystemExit when the key env var is missing"
except SystemExit:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Replace assert False in the expected-failure branch.

Line 204 uses assert False, which is removed with python -O; the test can silently pass if SystemExit is not raised. Raise AssertionError explicitly instead.

Suggested patch
     try:
         B.make_web_backend(live)
-        assert False, "expected SystemExit when the key env var is missing"
+        raise AssertionError("expected SystemExit when the key env var is missing")
     except SystemExit:
         pass
🧰 Tools
🪛 Ruff (0.15.18)

[warning] 204-204: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benches/test_agentic_benchmark.py` around lines 203 - 205, The test uses
assert False to signal an expected failure when SystemExit is not raised during
B.make_web_backend(live), but this assertion gets removed when Python runs with
optimization flag -O, causing the test to silently pass when it should fail.
Replace the assert False statement with an explicit raise
AssertionError("expected SystemExit when the key env var is missing") to ensure
the test properly fails regardless of Python optimization flags.

Source: Linters/SAST tools


Run directly:

python3 tests/test_arena_benchmark.py

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Fix stale direct-run path in module docstring.

The command points to tests/test_arena_benchmark.py, but this file lives under benches/, so the documented command is misleading.

Suggested fix
-    python3 tests/test_arena_benchmark.py
+    python3 benches/test_arena_benchmark.py
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
python3 tests/test_arena_benchmark.py
python3 benches/test_arena_benchmark.py
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benches/test_arena_benchmark.py` at line 5, The module docstring in
test_arena_benchmark.py contains an outdated direct-run command that incorrectly
references the file location as tests/test_arena_benchmark.py when the file is
actually located in the benches/ directory. Update the example command in the
module docstring to use the correct file path that accurately reflects where
test_arena_benchmark.py resides in the project structure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant