| title | PR Review RL Environment | |
|---|---|---|
| emoji | 🚀 | |
| colorFrom | blue | |
| colorTo | indigo | |
| sdk | docker | |
| pinned | false | |
| tags |
|
|
| license | apache-2.0 | |
| short_description | RL-Improvement |
A Meta OpenEnv Hackathon Submission
Current Large Language Models (LLMs) are frequently used for code generation, but training autonomous agents to review code requires a completely different skill set: spatial awareness, codebase navigation, and the ability to distinguish between genuine vulnerabilities and safe abstractions.
This OpenEnv project provides a high-fidelity simulation of a Senior Software Engineer's Pull Request workflow. Instead of spoon-feeding the agent a single static diff, this environment forces the agent to interactively traverse the repository, read dependent files, and leave precise, line-level spatial comments to secure the codebase.
- Interactive Dependency Traversal: Bugs rarely exist in a vacuum. A change in
api.pymight introduce a vulnerability due to a constant defined inconfig.py. Agents must actively use theread_filetool to explore thefile_treeand gather context before making a decision. - Spatial Action Space: Real reviewers don't leave massive global comments. The environment forces agents to map their text generation to precise spatial coordinates (
fileandline), converting a simple text task into a complex alignment task. - Anti-Hallucination Rigor: The grading engine includes a strict
-0.20penalty for False Rejections (rejecting a perfectly clean, bug-free PR). This provides a critical RL penalty to prevent agents from lazily guessing bugs just to farm points. - Strict Spec Compliance: Complete Pydantic typing and absolute mathematical clamping ensuring all rewards and final scores are strictly bounded within
(0.01, 0.99), ensuring stability across the OpenEnv evaluation pipeline.
The environment provides 25 distinct scenarios across three tiers of increasing complexity. Some scenarios are intentionally completely bug-free to train against hallucination.
| Task ID | Max Steps | Success Threshold | Description |
|---|---|---|---|
easy |
8 | 0.70 | Single-file PRs. Obvious errors like off-by-one loops, missing imports, and null dereferences. |
medium |
15 | 0.60 | Multi-file PRs. Requires navigating the file_tree to spot SQL injections, path traversals, and cross-file logic inconsistencies. |
hard |
20 | 0.50 | Complex PRs. Subtle security vulnerabilities (timing attacks), TOCTOU race conditions, and late-binding closure bugs. |
The agent receives a rich state payload at every step:
observation_space:
pr_title: string # The PR Title
pr_description: string # The PR Description
diff: string # The initial git diff
file_tree: list[string] # Available repository files
current_file_path: string # The file currently being read
current_file_content: string # The contents of the active file
comments_so_far: list[object]# Spatial history of agent's comments
step_count: integer # Current step
done: boolean # Episode termination flag
scenario_id: string # The current task identifierThe agent interacts using exactly one of four typed actions per turn:
action_space:
action_type: "enum[read_file, comment, approve, request_changes]"
file: "string (optional)" # Target file for reading or commenting
line: "integer (optional)" # Target line number for comments
body: "string" # The text of the code review comment- Neutral Exploration (
0.01): Granted when the agent successfully navigates to and reads a valid file. Encourages gathering context without penalizing step count. - Partial Success (
Variable up to 0.68): Granted when the agent leaves acommenton the correctfileandline(within a ±3 line buffer) that matches a hidden bug keyword. - Terminal Decision (
0.31or0.01): Granted at the end of the episode for correctly choosingapproveorrequest_changes.
The environment is fully containerized and deployable to Hugging Face Spaces using the Docker SDK.
# Build and start the environment server
docker-compose up --build envIf you prefer to run the environment outside of Docker for rapid iteration:
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
# Start the FastAPI environment server
uvicorn src.api:app --host 0.0.0.0 --port 7860The baseline agent utilizes a Multi-Turn ReAct Architecture to interact with the environment, navigating the file system before making a decision.
To run the agent against the server, you must provide your Hugging Face or OpenAI credentials:
# 1. Point the agent to the running environment (Update if using HF Spaces)
export ENV_URL="http://localhost:7860"
# 2. Set your LLM Inference credentials
export API_BASE_URL="[https://api-inference.huggingface.co/v1/](https://api-inference.huggingface.co/v1/)"
export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
export HF_TOKEN="hf_your_token_here"
# 3. Run the baseline evaluation
python inference.pyThe inference.py script strictly adheres to the OpenEnv standard stdout logging formatting requirements:
[START] task=medium env=PRReviewEnv model=Qwen/Qwen2.5-72B-Instruct
[STEP] step=1 action='{"action_type": "read_file", "file": "src/db.py"}' reward=0.01 done=false error=null
[STEP] step=2 action='{"action_type": "comment", "file": "src/db.py", "line": 3, "body": "SQL injection vulnerability..."}' reward=0.68 done=false error=null
[STEP] step=3 action='{"action_type": "request_changes"}' reward=0.31 done=true error=null
[END] success=true steps=3 score=0.990 rewards=0.01,0.68,0.31
The repository includes a standalone verification script to smoke-test the grading engine, Pydantic models, and strict mathematical reward clamping without requiring a live LLM or Docker container.
chmod +x verify.sh
./verify.shExpected Output: PASSED: 49 / FAILED: 0