Reproduction
OpenRewardSpec.environment_factory must supply GRPO with every ORS tool the rollout session can call. GRPO inspects bound Python methods once at trainer construction (before reset()), so the tool surface has to match what the live episode exposes.
If discovery uses only environment.list_tools() (OpenReward SDK → ORS GET /{env_name}/tools), only shared tools are returned. ORS also defines GET /{env_name}/task_tools (with X-Session-ID), which returns shared + task-specific tools (e.g. @tool(shared=False) and tools from list_task_tools()). See ORS HTTP API — tools vs task_tools and OpenReward — Using task-specific tools.
When task-scoped tools are omitted from binding, any rollout that invokes them fails (tool not found / not in schema), even though the same tool appears in session.list_tools() during a real session.
Related integration: #5696 (OpenReward Standard / OpenRewardSpec).
Reproduction
import os
import socket
import subprocess
import sys
import tempfile
import time
import textwrap
import requests
from trl.experimental.openreward import OpenRewardSpec
ECHO_ENV = textwrap.dedent("""
from openreward.environments import (
Environment, JSONObject, ListToolsOutput, Server,
TextBlock, ToolOutput, ToolSpec, tool,
)
from pydantic import BaseModel
TRAIN_TASKS = [{"id": "echo-0", "target": "hello"}]
class EchoTaskSpec(BaseModel):
id: str
target: str
class EchoParams(BaseModel):
text: str
class HintParams(BaseModel):
pass
class EchoEnvironment(Environment):
def __init__(self, task_spec={}, secrets={}):
super().__init__(task_spec)
self.config = EchoTaskSpec.model_validate(task_spec)
@classmethod
def list_splits(cls):
return ["train"]
@classmethod
def list_tasks(cls, split):
return TRAIN_TASKS
def get_prompt(self):
return [TextBlock(type="text", text=f"Echo '{self.config.target}' to win.")]
def list_task_tools(self):
return ListToolsOutput(tools=[
ToolSpec(
name="hint",
description="Task-scoped tool, only visible via /task_tools.",
input_schema=HintParams.model_json_schema(),
)
])
@tool
async def echo(self, params: EchoParams) -> ToolOutput:
correct = params.text == self.config.target
return ToolOutput(
blocks=[TextBlock(type="text", text="match" if correct else "no match")],
reward=1.0 if correct else 0.0,
finished=correct,
)
@tool(shared=False)
async def hint(self, params: HintParams) -> ToolOutput:
return ToolOutput(blocks=[TextBlock(text="try echo(text=...)")])
if __name__ == "__main__":
import os
Server([EchoEnvironment]).run(host="127.0.0.1", port=int(os.environ["PORT"]))
""")
def _free_port():
with socket.socket() as s:
s.bind(("127.0.0.1", 0))
return s.getsockname()[1]
with tempfile.NamedTemporaryFile("w", suffix=".py", delete=False) as f:
f.write(ECHO_ENV)
echo_env_path = f.name
port = _free_port()
proc = subprocess.Popen(
[sys.executable, echo_env_path],
env={**os.environ, "PORT": str(port)},
)
url = f"http://127.0.0.1:{port}"
for _ in range(30):
try:
if requests.get(f"{url}/health", timeout=1.0).status_code == 200:
break
except requests.RequestException:
pass
time.sleep(0.2)
os.environ["OPENREWARD_API_URL"] = url
os.environ["OPENREWARD_SESSION_URL"] = url
os.environ.setdefault("OPENREWARD_API_KEY", "test")
spec = OpenRewardSpec(url, env_name="echoenvironment", num_tasks=1, discover_task_tools=False)
env = spec.environment_factory()
print(callable(env.echo)) # True — shared tool, bound correctly
print(hasattr(env, "hint")) # False — task-scoped tool missing from GRPO schema
proc.terminate()
os.unlink(echo_env_path)
outputs:
/home/asyin/swapnil/trl-openreward/repro.py:13: TRLExperimentalWarning: You are importing from 'trl.experimental'. APIs here are unstable and may change or be removed without notice. Silence this warning by setting environment variable TRL_EXPERIMENTAL_SILENCE=1.
from trl.experimental.openreward import OpenRewardSpec
2026-05-07T17:39:31.737159Z [info ] server_starting [openreward.environments.server] build_sha=None host=127.0.0.1 port=55097 version=0.1.112
INFO: Started server process [3135731]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://127.0.0.1:55097 (Press CTRL+C to quit)
2026-05-07T17:39:31.803839Z [info ] request_handled [openreward.environments.server] httpRequest={'latency': '0.000782s', 'status': 200} method=GET path=/health session_id=
2026-05-07T17:39:33.339204Z [info ] request_handled [openreward.environments.server] httpRequest={'latency': '0.006219s', 'status': 200} method=POST path=/create_session session_id=
2026-05-07T17:39:33.342228Z [info ] request_handled [openreward.environments.server] httpRequest={'latency': '0.000317s', 'status': 200} method=GET path=/echoenvironment/tools session_id=0b9062c8-b86c-450e-9fde-903b359fa506
2026-05-07T17:39:33.343444Z [info ] request_handled [openreward.environments.server] httpRequest={'latency': '0.000159s', 'status': 200} method=POST path=/delete_session session_id=0b9062c8-b86c-450e-9fde-903b359fa506
True
False
INFO: Shutting down
INFO: Waiting for application shutdown.
INFO: Application shutdown complete.
INFO: Finished server process [3135731]
System Info
- Platform: Linux-6.8.0-1049-oracle-x86_64-with-glibc2.35
- Python version: 3.13.9
- TRL version: 1.4.0.dev0+8a6cc03
- PyTorch version: 2.10.0
- accelerator(s): NVIDIA GeForce RTX 4090
- Transformers version: 5.8.0
- Accelerate version: 1.13.0
- Accelerate config: not found
- Datasets version: 4.8.5
- HF Hub version: 1.14.0
- bitsandbytes version: 0.49.2
- DeepSpeed version: 0.18.9
- Liger-Kernel version: 0.8.0
- PEFT version: 0.19.1
- vLLM version: not installed
Checklist
Reproduction
OpenRewardSpec.environment_factorymust supply GRPO with every ORS tool the rollout session can call. GRPO inspects bound Python methods once at trainer construction (beforereset()), so the tool surface has to match what the live episode exposes.If discovery uses only
environment.list_tools()(OpenReward SDK → ORSGET /{env_name}/tools), only shared tools are returned. ORS also definesGET /{env_name}/task_tools(withX-Session-ID), which returns shared + task-specific tools (e.g.@tool(shared=False)and tools fromlist_task_tools()). See ORS HTTP API — tools vs task_tools and OpenReward — Using task-specific tools.When task-scoped tools are omitted from binding, any rollout that invokes them fails (tool not found / not in schema), even though the same tool appears in
session.list_tools()during a real session.Related integration: #5696 (OpenReward Standard /
OpenRewardSpec).Reproduction
outputs:
System Info
Checklist