OpenReward binding only discovers shared `/tools`, misses task-specific `/task_tools`

### Reproduction

`OpenRewardSpec.environment_factory` must supply GRPO with every ORS tool the rollout session can call. GRPO inspects bound Python methods **once** at trainer construction (before `reset()`), so the tool surface has to match what the live episode exposes.

If discovery uses only **`environment.list_tools()`** (OpenReward SDK → ORS **`GET /{env_name}/tools`**), only **shared** tools are returned. ORS also defines **`GET /{env_name}/task_tools`** (with `X-Session-ID`), which returns **shared + task-specific** tools (e.g. `@tool(shared=False)` and tools from `list_task_tools()`). See [ORS HTTP API — tools vs task_tools](https://openrewardstandard.io/specification/http-api.md) and [OpenReward — Using task-specific tools](https://docs.openreward.ai/environments/using-task-specific-tools.md).

When task-scoped tools are omitted from binding, any rollout that invokes them fails (tool not found / not in schema), even though the same tool appears in `session.list_tools()` during a real session.

Related integration: [#5696](https://github.com/huggingface/trl/pull/5696) (OpenReward Standard / `OpenRewardSpec`).

Reproduction
```python
import os
import socket
import subprocess
import sys
import tempfile
import time
import textwrap
import requests

from trl.experimental.openreward import OpenRewardSpec


ECHO_ENV = textwrap.dedent("""
    from openreward.environments import (
        Environment, JSONObject, ListToolsOutput, Server,
        TextBlock, ToolOutput, ToolSpec, tool,
    )
    from pydantic import BaseModel

    TRAIN_TASKS = [{"id": "echo-0", "target": "hello"}]

    class EchoTaskSpec(BaseModel):
        id: str
        target: str

    class EchoParams(BaseModel):
        text: str

    class HintParams(BaseModel):
        pass

    class EchoEnvironment(Environment):
        def __init__(self, task_spec={}, secrets={}):
            super().__init__(task_spec)
            self.config = EchoTaskSpec.model_validate(task_spec)

        @classmethod
        def list_splits(cls):
            return ["train"]

        @classmethod
        def list_tasks(cls, split):
            return TRAIN_TASKS

        def get_prompt(self):
            return [TextBlock(type="text", text=f"Echo '{self.config.target}' to win.")]

        def list_task_tools(self):
            return ListToolsOutput(tools=[
                ToolSpec(
                    name="hint",
                    description="Task-scoped tool, only visible via /task_tools.",
                    input_schema=HintParams.model_json_schema(),
                )
            ])

        @tool
        async def echo(self, params: EchoParams) -> ToolOutput:
            correct = params.text == self.config.target
            return ToolOutput(
                blocks=[TextBlock(type="text", text="match" if correct else "no match")],
                reward=1.0 if correct else 0.0,
                finished=correct,
            )

        @tool(shared=False)
        async def hint(self, params: HintParams) -> ToolOutput:
            return ToolOutput(blocks=[TextBlock(text="try echo(text=...)")])

    if __name__ == "__main__":
        import os
        Server([EchoEnvironment]).run(host="127.0.0.1", port=int(os.environ["PORT"]))
""")


def _free_port():
    with socket.socket() as s:
        s.bind(("127.0.0.1", 0))
        return s.getsockname()[1]


with tempfile.NamedTemporaryFile("w", suffix=".py", delete=False) as f:
    f.write(ECHO_ENV)
    echo_env_path = f.name

port = _free_port()
proc = subprocess.Popen(
    [sys.executable, echo_env_path],
    env={**os.environ, "PORT": str(port)},
)
url = f"http://127.0.0.1:{port}"

for _ in range(30):
    try:
        if requests.get(f"{url}/health", timeout=1.0).status_code == 200:
            break
    except requests.RequestException:
        pass
    time.sleep(0.2)

os.environ["OPENREWARD_API_URL"] = url
os.environ["OPENREWARD_SESSION_URL"] = url
os.environ.setdefault("OPENREWARD_API_KEY", "test")

spec = OpenRewardSpec(url, env_name="echoenvironment", num_tasks=1, discover_task_tools=False)
env = spec.environment_factory()

print(callable(env.echo))     # True  — shared tool, bound correctly
print(hasattr(env, "hint"))   # False — task-scoped tool missing from GRPO schema

proc.terminate()
os.unlink(echo_env_path)
```

outputs:
```
/home/asyin/swapnil/trl-openreward/repro.py:13: TRLExperimentalWarning: You are importing from 'trl.experimental'. APIs here are unstable and may change or be removed without notice. Silence this warning by setting environment variable TRL_EXPERIMENTAL_SILENCE=1.
  from trl.experimental.openreward import OpenRewardSpec
2026-05-07T17:39:31.737159Z [info     ] server_starting                [openreward.environments.server] build_sha=None host=127.0.0.1 port=55097 version=0.1.112
INFO:     Started server process [3135731]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:55097 (Press CTRL+C to quit)
2026-05-07T17:39:31.803839Z [info     ] request_handled                [openreward.environments.server] httpRequest={'latency': '0.000782s', 'status': 200} method=GET path=/health session_id=
2026-05-07T17:39:33.339204Z [info     ] request_handled                [openreward.environments.server] httpRequest={'latency': '0.006219s', 'status': 200} method=POST path=/create_session session_id=
2026-05-07T17:39:33.342228Z [info     ] request_handled                [openreward.environments.server] httpRequest={'latency': '0.000317s', 'status': 200} method=GET path=/echoenvironment/tools session_id=0b9062c8-b86c-450e-9fde-903b359fa506
2026-05-07T17:39:33.343444Z [info     ] request_handled                [openreward.environments.server] httpRequest={'latency': '0.000159s', 'status': 200} method=POST path=/delete_session session_id=0b9062c8-b86c-450e-9fde-903b359fa506
True
False
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [3135731]
```

### System Info

- Platform: Linux-6.8.0-1049-oracle-x86_64-with-glibc2.35
- Python version: 3.13.9
- TRL version: 1.4.0.dev0+8a6cc03
- PyTorch version: 2.10.0
- accelerator(s): NVIDIA GeForce RTX 4090
- Transformers version: 5.8.0
- Accelerate version: 1.13.0
- Accelerate config: not found
- Datasets version: 4.8.5
- HF Hub version: 1.14.0
- bitsandbytes version: 0.49.2
- DeepSpeed version: 0.18.9
- Liger-Kernel version: 0.8.0
- PEFT version: 0.19.1
- vLLM version: not installed

### Checklist

- [x] I have checked that my issue isn't already filed (see [open issues](https://github.com/huggingface/trl/issues?q=is%3Aissue))
- [x] I have included my system information
- [x] Any code provided is minimal, complete, and reproducible ([more on MREs](https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/creating-and-highlighting-code-blocks))
- [x] Any code provided is properly formatted in code blocks, (no screenshot, [more on code blocks](https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/creating-and-highlighting-code-blocks))
- [x] Any traceback provided is complete

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenReward binding only discovers shared `/tools`, misses task-specific `/task_tools` #5727

Reproduction

System Info

Checklist

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

OpenReward binding only discovers shared /tools, misses task-specific /task_tools #5727

Description

Reproduction

System Info

Checklist

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

OpenReward binding only discovers shared `/tools`, misses task-specific `/task_tools` #5727