Skip to content

OpenReward binding only discovers shared /tools, misses task-specific /task_tools #5727

@rycerzes

Description

@rycerzes

Reproduction

OpenRewardSpec.environment_factory must supply GRPO with every ORS tool the rollout session can call. GRPO inspects bound Python methods once at trainer construction (before reset()), so the tool surface has to match what the live episode exposes.

If discovery uses only environment.list_tools() (OpenReward SDK → ORS GET /{env_name}/tools), only shared tools are returned. ORS also defines GET /{env_name}/task_tools (with X-Session-ID), which returns shared + task-specific tools (e.g. @tool(shared=False) and tools from list_task_tools()). See ORS HTTP API — tools vs task_tools and OpenReward — Using task-specific tools.

When task-scoped tools are omitted from binding, any rollout that invokes them fails (tool not found / not in schema), even though the same tool appears in session.list_tools() during a real session.

Related integration: #5696 (OpenReward Standard / OpenRewardSpec).

Reproduction

import os
import socket
import subprocess
import sys
import tempfile
import time
import textwrap
import requests

from trl.experimental.openreward import OpenRewardSpec


ECHO_ENV = textwrap.dedent("""
    from openreward.environments import (
        Environment, JSONObject, ListToolsOutput, Server,
        TextBlock, ToolOutput, ToolSpec, tool,
    )
    from pydantic import BaseModel

    TRAIN_TASKS = [{"id": "echo-0", "target": "hello"}]

    class EchoTaskSpec(BaseModel):
        id: str
        target: str

    class EchoParams(BaseModel):
        text: str

    class HintParams(BaseModel):
        pass

    class EchoEnvironment(Environment):
        def __init__(self, task_spec={}, secrets={}):
            super().__init__(task_spec)
            self.config = EchoTaskSpec.model_validate(task_spec)

        @classmethod
        def list_splits(cls):
            return ["train"]

        @classmethod
        def list_tasks(cls, split):
            return TRAIN_TASKS

        def get_prompt(self):
            return [TextBlock(type="text", text=f"Echo '{self.config.target}' to win.")]

        def list_task_tools(self):
            return ListToolsOutput(tools=[
                ToolSpec(
                    name="hint",
                    description="Task-scoped tool, only visible via /task_tools.",
                    input_schema=HintParams.model_json_schema(),
                )
            ])

        @tool
        async def echo(self, params: EchoParams) -> ToolOutput:
            correct = params.text == self.config.target
            return ToolOutput(
                blocks=[TextBlock(type="text", text="match" if correct else "no match")],
                reward=1.0 if correct else 0.0,
                finished=correct,
            )

        @tool(shared=False)
        async def hint(self, params: HintParams) -> ToolOutput:
            return ToolOutput(blocks=[TextBlock(text="try echo(text=...)")])

    if __name__ == "__main__":
        import os
        Server([EchoEnvironment]).run(host="127.0.0.1", port=int(os.environ["PORT"]))
""")


def _free_port():
    with socket.socket() as s:
        s.bind(("127.0.0.1", 0))
        return s.getsockname()[1]


with tempfile.NamedTemporaryFile("w", suffix=".py", delete=False) as f:
    f.write(ECHO_ENV)
    echo_env_path = f.name

port = _free_port()
proc = subprocess.Popen(
    [sys.executable, echo_env_path],
    env={**os.environ, "PORT": str(port)},
)
url = f"http://127.0.0.1:{port}"

for _ in range(30):
    try:
        if requests.get(f"{url}/health", timeout=1.0).status_code == 200:
            break
    except requests.RequestException:
        pass
    time.sleep(0.2)

os.environ["OPENREWARD_API_URL"] = url
os.environ["OPENREWARD_SESSION_URL"] = url
os.environ.setdefault("OPENREWARD_API_KEY", "test")

spec = OpenRewardSpec(url, env_name="echoenvironment", num_tasks=1, discover_task_tools=False)
env = spec.environment_factory()

print(callable(env.echo))     # True  — shared tool, bound correctly
print(hasattr(env, "hint"))   # False — task-scoped tool missing from GRPO schema

proc.terminate()
os.unlink(echo_env_path)

outputs:

/home/asyin/swapnil/trl-openreward/repro.py:13: TRLExperimentalWarning: You are importing from 'trl.experimental'. APIs here are unstable and may change or be removed without notice. Silence this warning by setting environment variable TRL_EXPERIMENTAL_SILENCE=1.
  from trl.experimental.openreward import OpenRewardSpec
2026-05-07T17:39:31.737159Z [info     ] server_starting                [openreward.environments.server] build_sha=None host=127.0.0.1 port=55097 version=0.1.112
INFO:     Started server process [3135731]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:55097 (Press CTRL+C to quit)
2026-05-07T17:39:31.803839Z [info     ] request_handled                [openreward.environments.server] httpRequest={'latency': '0.000782s', 'status': 200} method=GET path=/health session_id=
2026-05-07T17:39:33.339204Z [info     ] request_handled                [openreward.environments.server] httpRequest={'latency': '0.006219s', 'status': 200} method=POST path=/create_session session_id=
2026-05-07T17:39:33.342228Z [info     ] request_handled                [openreward.environments.server] httpRequest={'latency': '0.000317s', 'status': 200} method=GET path=/echoenvironment/tools session_id=0b9062c8-b86c-450e-9fde-903b359fa506
2026-05-07T17:39:33.343444Z [info     ] request_handled                [openreward.environments.server] httpRequest={'latency': '0.000159s', 'status': 200} method=POST path=/delete_session session_id=0b9062c8-b86c-450e-9fde-903b359fa506
True
False
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [3135731]

System Info

  • Platform: Linux-6.8.0-1049-oracle-x86_64-with-glibc2.35
  • Python version: 3.13.9
  • TRL version: 1.4.0.dev0+8a6cc03
  • PyTorch version: 2.10.0
  • accelerator(s): NVIDIA GeForce RTX 4090
  • Transformers version: 5.8.0
  • Accelerate version: 1.13.0
  • Accelerate config: not found
  • Datasets version: 4.8.5
  • HF Hub version: 1.14.0
  • bitsandbytes version: 0.49.2
  • DeepSpeed version: 0.18.9
  • Liger-Kernel version: 0.8.0
  • PEFT version: 0.19.1
  • vLLM version: not installed

Checklist

  • I have checked that my issue isn't already filed (see open issues)
  • I have included my system information
  • Any code provided is minimal, complete, and reproducible (more on MREs)
  • Any code provided is properly formatted in code blocks, (no screenshot, more on code blocks)
  • Any traceback provided is complete

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions