diff --git a/docs/dev/model-service/proxy.md b/docs/dev/model-service/proxy.md
new file mode 100644
index 0000000000..b0f77da0f3
--- /dev/null
+++ b/docs/dev/model-service/proxy.md
@@ -0,0 +1,154 @@
+# model-service `proxy` 模式
+
+`rock model-service` 的 proxy 模式在 `/v1/chat/completions` 上提供一个 OpenAI 兼容的转发层，
+两种工作模式互斥：
+
+| 模式      | 触发条件                              | 上游调用 | 写盘                 |
+|-----------|---------------------------------------|----------|----------------------|
+| Recording | 默认                                  | 真实调用 | append 到 JSONL traj |
+| Replay    | `--replay-file` / `replay_file` 设置  | 不调用   | 不写                 |
+
+设计目标是让 SWE-agent / mini-swe-agent / OpenHands 等 agent 框架在录制 → 回放之间无感切换：
+agent 不变，只换 base URL。
+
+下文所有命令以 `rock model-service start` 启动；该子命令最终会 `subprocess` 拉起
+`rock.sdk.model.server.main`，两者支持的 flag 一致。直接调试时也可以用
+`python -m rock.sdk.model.server.main` 跳过 PID 文件管理。
+
+---
+
+## 1. Recording（默认）
+
+转发到单个上游，每次调用 append 一行 JSONL 到 `recording_file`（缺省 `LOG_DIR/LLMTraj.jsonl`，
+其中 `LOG_DIR = $ROCK_MODEL_SERVICE_DATA_DIR`）：
+
+```bash
+export OPENAI_API_KEY="sk-..."
+export ROCK_MODEL_SERVICE_DATA_DIR=/tmp/rock-traj
+
+rock model-service start \
+    --type proxy \
+    --proxy-base-url https://api.openai.com/v1 \
+    --port 8080
+```
+
+调用：
+
+```bash
+curl -X POST http://127.0.0.1:8080/v1/chat/completions \
+    -H "Authorization: Bearer $OPENAI_API_KEY" \
+    -H "Content-Type: application/json" \
+    -d '{"model":"gpt-3.5-turbo","messages":[{"role":"user","content":"hi"}]}'
+
+cat /tmp/rock-traj/LLMTraj.jsonl | jq '.model, .response.choices[0].message.content'
+```
+
+流式同样支持，上游字节原样转给客户端，recorder 在后台聚合最终的 `ChatCompletion` 写盘
+（用 openai SDK 的 `ChatCompletionStreamState`，所以 `tool_calls.function.arguments` 等
+跨 chunk 拼接的字段会被还原成完整形态）：
+
+```bash
+curl -N -X POST http://127.0.0.1:8080/v1/chat/completions \
+    -H "Authorization: Bearer $OPENAI_API_KEY" \
+    -H "Content-Type: application/json" \
+    -d '{"model":"gpt-3.5-turbo","stream":true,"messages":[{"role":"user","content":"count to 5"}]}'
+```
+
+显式指定写到别的路径：
+
+```bash
+rock model-service start \
+    --type proxy \
+    --proxy-base-url https://api.openai.com/v1 \
+    --recording-file /tmp/my-session.jsonl \
+    --port 8080
+```
+
+---
+
+## 2. Replay
+
+把 `--replay-file` 指到一个录好的 jsonl，proxy 不再访问真实 LLM，按录制顺序返回响应；
+agent 把 base URL 换成 `http://127.0.0.1:8081/v1` 即可重放：
+
+```bash
+rock model-service start \
+    --type proxy \
+    --replay-file /tmp/rock-traj/LLMTraj.jsonl \
+    --port 8081
+```
+
+行为细节：
+
+- cursor 单调推进，每次请求消耗一条记录；用尽后返回 **404**。
+- 流式请求会拿录制的 `ChatCompletion` 重新发一帧 SSE chunk + `[DONE]`。
+  `tool_calls` 的 `index` 字段会被自动注入（OpenAI 的流式协议要求 chunk delta 上有 `index`，
+  但录制态的 `message.tool_calls` 没有）。
+- request 里的 `model` 会跟录制的 `model` 比对，不一致只打 warning，不阻断。
+
+`recording_file` 和 `replay_file` 是**互斥**的——同时配置（无论是 CLI 还是 YAML）会在启动时
+被 Pydantic `model_validator` 拦下并报 `ValidationError`，避免"录到一半把源文件覆盖"这类隐性 bug。
+
+---
+
+## 3. 重试和超时
+
+- 默认对 connection error / timeout 和 `retryable_status_codes`（默认 `[429, 500]`）触发重试，
+  最多 6 次，指数退避 2s 起步 ×2 + 抖动；最后一次仍失败时把上游响应原样转给客户端
+  （**不**包装成 502/504，让 agent 自己看到真实状态码）。
+- 对**流式**请求，重试只发生在第一个字节抵达客户端**之前**——一旦字节流开始转发，
+  连接中断不会重试（已发出去的字节无法收回）。
+
+```bash
+rock model-service start \
+    --type proxy \
+    --proxy-base-url https://api.openai.com/v1 \
+    --retryable-status-codes 429,500,502,503 \
+    --request-timeout 60 \
+    --port 8080
+```
+
+---
+
+## 4. 多模型路由（YAML）
+
+按 model name 分流到不同上游需要 YAML（CLI 只暴露单一 `--proxy-base-url`）。新建 `routes.yaml`：
+
+```yaml
+proxy_rules:
+  gpt-3.5-turbo: "https://api.openai.com/v1"
+  gpt-4o:        "https://api.openai.com/v1"
+  default:       "https://api-inference.modelscope.cn/v1"
+
+retryable_status_codes: [429, 500, 502]
+request_timeout: 60
+recording_file: /tmp/rock-traj/multi.jsonl
+```
+
+启动：
+
+```bash
+rock model-service start \
+    --type proxy \
+    --config-file routes.yaml \
+    --port 8080
+```
+
+CLI flag（`--proxy-base-url` / `--port` / `--retryable-status-codes` / ...）覆盖 YAML 同名字段。
+路由解析顺序：`proxy_base_url` → `proxy_rules[model]` → `proxy_rules["default"]`，都没有则 400。
+
+---
+
+## 5. 实现要点（仅供参考）
+
+- `chat_completions` endpoint 把请求分发给 `app.state.backend`，后者要么是 `ForwardBackend`
+  要么是 `ReplayBackend`，由启动时的 `_configure_proxy_integrations` 根据 `replay_file`
+  是否设置二选一注入。
+- `ForwardBackend` 走 httpx 字节透传：non-stream 是 `await resp.aread()`，stream 是
+  `resp.aiter_bytes()` 直接 yield 给客户端，**不**经过任何 SDK 的反序列化/再序列化，所以上游
+  返回的 `reasoning_content` / `provider_specific_fields` 等任意 vendor 字段都不会被吃掉。
+  recorder 在另一条独立路径上把字节流喂给 openai SDK 的 stream-state aggregator，仅用于写盘。
+- `ReplayBackend` 完全本地，不持有 httpx client。
+
+更深入的代码导览看 [rock/sdk/model/server/api/proxy.py](../../../rock/sdk/model/server/api/proxy.py)
+顶部的 module docstring。
diff --git a/pyproject.toml b/pyproject.toml
index badb7d1a4b..d7d7a591b0 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -86,6 +86,8 @@ model-service = [
     "psutil",
     "swebench",
     "alibabacloud_cr20181201==2.0.5",
+    "openai>=1.50.0",
+    "httpx",
 ]
 
 
diff --git a/rock/cli/command/model_service.py b/rock/cli/command/model_service.py
index 87e6ca60e6..03cc59582d 100644
--- a/rock/cli/command/model_service.py
+++ b/rock/cli/command/model_service.py
@@ -82,6 +82,8 @@ async def arun(self, args: argparse.Namespace):
                 proxy_base_url=args.proxy_base_url,
                 retryable_status_codes=args.retryable_status_codes,
                 request_timeout=args.request_timeout,
+                recording_file=args.recording_file,
+                replay_file=args.replay_file,
             )
             logger.info(f"model service started, pid: {pid}")
             with open(self.DEFAULT_MODEL_SERVICE_PID_FILE, "w") as f:
@@ -178,6 +180,18 @@ async def add_parser_to(subparsers: argparse._SubParsersAction):
             default=None,
             help="Request timeout in seconds. Overrides config file.",
         )
+        start_parser.add_argument(
+            "--recording-file",
+            type=str,
+            default=None,
+            help="Proxy mode only: where to write the trajectory JSONL. Defaults to LOG_DIR/LLMTraj.jsonl.",
+        )
+        start_parser.add_argument(
+            "--replay-file",
+            type=str,
+            default=None,
+            help="Proxy mode only: replay from a recorded .jsonl traj file. Mutually exclusive with --recording-file.",
+        )
 
         watch_agent_parser = model_service_subparsers.add_parser(
             "watch-agent",
diff --git a/rock/sdk/model/server/api/proxy.py b/rock/sdk/model/server/api/proxy.py
index fb2b7bec3c..73f74e3f62 100644
--- a/rock/sdk/model/server/api/proxy.py
+++ b/rock/sdk/model/server/api/proxy.py
@@ -1,13 +1,45 @@
+"""OpenAI-compatible chat/completions proxy with trajectory record/replay.
+
+Two backends share the ``/v1/chat/completions`` route:
+
+1. **ForwardBackend** (default) — body bytes are POSTed verbatim to the
+   configured upstream via plain ``httpx``. The upstream response is forwarded
+   byte-for-byte back to the client (raw JSON for non-stream, raw SSE bytes
+   for stream). On the side we run a parser (``ChatCompletionChunk`` +
+   ``ChatCompletionStreamState`` from the openai SDK) to aggregate streaming
+   chunks into a final ChatCompletion that the recorder writes to JSONL. The
+   forward path itself does NOT depend on OpenAI types — anything the upstream
+   returns (provider-specific ``reasoning_content``, ``citations``, ...) is
+   passed through untouched.
+
+2. **ReplayBackend** (``replay_file`` set) — the request is served
+   directly from the next record in the ``SequentialCursor`` without any
+   upstream call. Streaming emits the recorded response as one SSE chunk +
+   ``[DONE]``.
+"""
+
+from __future__ import annotations
+
+import asyncio
+import json
+import random
+import time
+from collections.abc import AsyncIterator
 from typing import Any
 
 import httpx
 from fastapi import APIRouter, HTTPException, Request
-from fastapi.responses import JSONResponse
+from fastapi.responses import JSONResponse, Response, StreamingResponse
 
 from rock.logger import init_logger
 from rock.sdk.model.server.config import ModelServiceConfig
-from rock.sdk.model.server.utils import record_traj
-from rock.utils import retry_async
+from rock.sdk.model.server.sse import (
+    SSE_DONE,
+    completion_to_chunk_dict,
+    encode_sse_event,
+    parse_sse_data_chunks,
+)
+from rock.sdk.model.server.traj import SequentialCursor, TrajectoryExhausted, TrajectoryRecorder
 
 logger = init_logger(__name__)
 
@@ -15,111 +47,352 @@
 proxy_router = APIRouter()
 
 
-# Global HTTP client with a persistent connection pool
-http_client = httpx.AsyncClient()
+# Headers we never forward upstream:
+#   - host / content-length: rebuilt by httpx for the upstream request
+#   - transfer-encoding / connection: RFC 7230 hop-by-hop, scoped to one connection
+_HEADERS_NOT_TO_FORWARD = frozenset({"host", "content-length", "transfer-encoding", "connection"})
 
+# Retry knobs for upstream POST. Read at call-time so tests can monkeypatch them.
+# Default: up to 6 attempts with exponential backoff (2s → 4s → 8s → 16s → 32s, jittered).
+_RETRY_MAX_ATTEMPTS = 6
+_RETRY_DELAY_SECONDS = 2.0
+_RETRY_BACKOFF = 2.0
 
-@retry_async(
-    max_attempts=6,
-    delay_seconds=2.0,
-    backoff=2.0,  # Exponential backoff (2s, 4s, 8s, 16s, 32s).
-    jitter=True,  # Adds randomness to prevent "thundering herd" effect on the backend.
-    exceptions=(httpx.TimeoutException, httpx.ConnectError, httpx.HTTPStatusError),
-)
-async def perform_llm_request(url: str, body: dict, headers: dict, config: ModelServiceConfig):
-    """
-    Forwards the request and triggers retry ONLY if the status code
-    is in the explicit retryable whitelist.
+
+async def _send_with_retry(
+    client: httpx.AsyncClient,
+    url: str,
+    *,
+    body_bytes: bytes,
+    headers: dict[str, str],
+    retryable_codes: list[int],
+) -> httpx.Response:
+    """POST with retry on connection errors and whitelisted statuses, returning
+    an open streaming response.
+
+    Always uses ``stream=True`` so the same path serves both stream and non-stream
+    callers — non-stream just calls ``await resp.aread()`` to materialize the body.
+    Assumes a failed upstream returns its error body before any byte is yielded
+    to downstream (so retry can still discard it cleanly).
+
+    Caller MUST ``await resp.aclose()`` after consuming.
     """
-    response = await http_client.post(url, json=body, headers=headers, timeout=config.request_timeout)
-    status_code = response.status_code
+    last_exc: Exception | None = None
+    delay = _RETRY_DELAY_SECONDS
+    for attempt in range(1, _RETRY_MAX_ATTEMPTS + 1):
+        try:
+            resp = await client.send(
+                client.build_request("POST", url, content=body_bytes, headers=headers),
+                stream=True,
+            )
+        except (httpx.TimeoutException, httpx.ConnectError) as exc:
+            last_exc = exc
+            if attempt >= _RETRY_MAX_ATTEMPTS:
+                raise
+            logger.warning(f"connect failed (attempt {attempt}/{_RETRY_MAX_ATTEMPTS}): {exc}")
+            await asyncio.sleep(random.uniform(0, delay * 2))
+            delay *= _RETRY_BACKOFF
+            continue
 
-    # Check against the explicit whitelist
-    if status_code in config.retryable_status_codes:
-        logger.warning(f"Retryable error detected: {status_code}. Triggering retry for {url}...")
-        response.raise_for_status()
+        if resp.status_code in retryable_codes and attempt < _RETRY_MAX_ATTEMPTS:
+            await resp.aclose()
+            logger.warning(f"upstream status {resp.status_code}, retry {attempt}/{_RETRY_MAX_ATTEMPTS}")
+            await asyncio.sleep(random.uniform(0, delay * 2))
+            delay *= _RETRY_BACKOFF
+            continue
 
-    return response
+        return resp
 
+    raise last_exc  # pragma: no cover  # unreachable
 
-def get_base_url(model_name: str, config: ModelServiceConfig) -> str:
-    """
-    Selects the target backend URL based on model name matching.
 
-    If proxy_base_url is configured, it takes precedence over proxy_rules.
-    """
-    # If direct proxy base URL is configured, return it directly (bypass model name matching)
-    if config.proxy_base_url:
-        return config.proxy_base_url.rstrip("/")
-
-    if not model_name:
-        raise HTTPException(status_code=400, detail="Model name is required for routing.")
-
-    rules = config.proxy_rules
-    base_url = rules.get(model_name) or rules.get("default")
-    if not base_url:
-        raise HTTPException(
-            status_code=400, detail=f"Model '{model_name}' is not configured and no 'default' rule found."
+def _filter_headers(headers) -> dict[str, str]:
+    """Drop headers that are scoped to the client↔proxy hop or rebuilt by httpx.
+    ``Authorization`` is forwarded verbatim — proxy stays stateless about which
+    API key the client uses."""
+    out = {}
+    for key, value in headers.items():
+        if key.lower() in _HEADERS_NOT_TO_FORWARD:
+            continue
+        out[key] = value
+    return out
+
+
+class ReplayBackend:
+    """Serves requests from a pre-recorded trajectory; no upstream calls made."""
+
+    def __init__(self, cursor: SequentialCursor) -> None:
+        self._cursor = cursor
+
+    async def serve(self, *, model_name: str, is_stream: bool, **_: Any) -> Response:
+        try:
+            record = await self._cursor.next(expected_model=model_name)
+        except TrajectoryExhausted as exc:
+            raise HTTPException(status_code=404, detail=str(exc))
+
+        response_dict = record.get("response")
+        if not isinstance(response_dict, dict):
+            raise HTTPException(
+                status_code=500,
+                detail=f"replay record at step {self._cursor.position - 1} has no usable response dict",
+            )
+        logger.info(f"[replay] step {self._cursor.position}/{self._cursor.total} served for model={model_name!r}")
+
+        if is_stream:
+            return StreamingResponse(
+                self._sse_iter(response_dict, model=model_name),
+                media_type="text/event-stream",
+            )
+        return JSONResponse(status_code=200, content=response_dict)
+
+    @staticmethod
+    async def _sse_iter(response: dict, *, model: str) -> AsyncIterator[bytes]:
+        """Emit a recorded response as one SSE chunk + ``[DONE]``."""
+        yield encode_sse_event(completion_to_chunk_dict(response, model=model))
+        yield SSE_DONE
+
+
+class ForwardBackend:
+    """Forwards requests byte-for-byte to the upstream and optionally records the trajectory."""
+
+    def __init__(self, config: ModelServiceConfig, recorder: TrajectoryRecorder | None = None) -> None:
+        self._config = config
+        self._recorder = recorder
+
+    def _resolve_base_url(self, model_name: str) -> str:
+        """Pick the upstream base URL by model name.
+
+        ``proxy_base_url`` takes precedence; falls back to ``proxy_rules[model]`` and
+        then ``proxy_rules["default"]``. Trailing slashes are stripped so the caller
+        can append ``/chat/completions`` directly.
+        """
+        if self._config.proxy_base_url:
+            return self._config.proxy_base_url.rstrip("/")
+
+        if not model_name:
+            raise HTTPException(status_code=400, detail="Model name is required for routing.")
+
+        rules = self._config.proxy_rules
+        base_url = rules.get(model_name) or rules.get("default")
+        if not base_url:
+            raise HTTPException(
+                status_code=400,
+                detail=f"Model '{model_name}' is not configured and no 'default' rule found.",
+            )
+
+        return base_url.rstrip("/")
+
+    async def serve(
+        self,
+        *,
+        model_name: str,
+        is_stream: bool,
+        body_bytes: bytes,
+        fwd_headers: dict[str, str],
+        request_dict: dict[str, Any],
+        **_: Any,
+    ) -> Response:
+        upstream_url = f"{self._resolve_base_url(model_name)}/chat/completions"
+        logger.info(f"Routing model {model_name!r} to {upstream_url}")
+
+        if is_stream:
+            return StreamingResponse(
+                self._stream_and_record(
+                    upstream_url=upstream_url,
+                    body_bytes=body_bytes,
+                    fwd_headers=fwd_headers,
+                    request_dict=request_dict,
+                ),
+                media_type="text/event-stream",
+            )
+
+        # Non-stream: same retry path as stream (open with stream=True), then aread() the body.
+        start = time.time()
+        async with httpx.AsyncClient(timeout=self._config.request_timeout) as client:
+            try:
+                resp = await _send_with_retry(
+                    client,
+                    upstream_url,
+                    body_bytes=body_bytes,
+                    headers=fwd_headers,
+                    retryable_codes=self._config.retryable_status_codes,
+                )
+            except httpx.TimeoutException as exc:
+                if self._recorder is not None:
+                    await self._recorder.record(
+                        request=request_dict,
+                        response=None,
+                        status="failure",
+                        start_time=start,
+                        end_time=time.time(),
+                        error=f"timeout: {exc}",
+                    )
+                raise HTTPException(status_code=504, detail=f"Upstream timed out: {exc}")
+            except httpx.RequestError as exc:
+                if self._recorder is not None:
+                    await self._recorder.record(
+                        request=request_dict,
+                        response=None,
+                        status="failure",
+                        start_time=start,
+                        end_time=time.time(),
+                        error=f"{type(exc).__name__}: {exc}",
+                    )
+                raise HTTPException(status_code=502, detail=f"Upstream request failed: {exc}")
+
+            try:
+                response_bytes = await resp.aread()
+                status_code = resp.status_code
+                content_type = resp.headers.get("content-type", "application/json")
+            finally:
+                await resp.aclose()
+
+        response_text = response_bytes.decode("utf-8", errors="replace")
+        response_dict: dict | None = None
+        try:
+            parsed = json.loads(response_text) if response_text else None
+            if isinstance(parsed, dict):
+                response_dict = parsed
+        except json.JSONDecodeError:
+            pass
+
+        if self._recorder is not None:
+            await self._recorder.record(
+                request=request_dict,
+                response=response_dict,
+                status="success" if status_code < 400 else "failure",
+                start_time=start,
+                end_time=time.time(),
+                error=None if status_code < 400 else f"upstream_status={status_code}",
+            )
+
+        # Forward bytes verbatim — preserves any provider-specific fields untouched.
+        return Response(content=response_bytes, status_code=status_code, media_type=content_type)
+
+    async def _stream_and_record(
+        self,
+        *,
+        upstream_url: str,
+        body_bytes: bytes,
+        fwd_headers: dict[str, str],
+        request_dict: dict[str, Any],
+    ) -> AsyncIterator[bytes]:
+        """SSE bytes are forwarded verbatim; chunks are parsed in parallel and
+        aggregated into the final ChatCompletion that the recorder writes to JSONL.
+
+        Retry on connection errors and whitelisted statuses happens BEFORE any byte
+        is yielded; mid-stream connection drops are not retried (would corrupt the
+        client transmission)."""
+        # openai SDK is used purely as a stream-aggregation parser — keep the import
+        # local so module load doesn't pull it in for callers that never stream.
+        from openai.lib.streaming.chat import ChatCompletionStreamState
+        from openai.types.chat import ChatCompletionChunk
+
+        state = ChatCompletionStreamState()
+        start = time.time()
+        parse_buffer = b""
+        upstream_status = 0
+
+        async with httpx.AsyncClient(timeout=self._config.request_timeout) as client:
+            try:
+                resp = await _send_with_retry(
+                    client,
+                    upstream_url,
+                    body_bytes=body_bytes,
+                    headers=fwd_headers,
+                    retryable_codes=self._config.retryable_status_codes,
+                )
+            except (httpx.TimeoutException, httpx.ConnectError) as exc:
+                if self._recorder is not None:
+                    await self._recorder.record(
+                        request=request_dict,
+                        response=None,
+                        status="failure",
+                        start_time=start,
+                        end_time=time.time(),
+                        error=f"{type(exc).__name__}: {exc}",
+                    )
+                return
+
+            try:
+                upstream_status = resp.status_code
+                async for chunk in resp.aiter_bytes():
+                    yield chunk
+                    chunk_dicts, parse_buffer = parse_sse_data_chunks(parse_buffer + chunk)
+                    for chunk_dict in chunk_dicts:
+                        try:
+                            state.handle_chunk(ChatCompletionChunk.model_validate(chunk_dict))
+                        except Exception as exc:  # parser error: forward continues, traj will be partial
+                            logger.debug(f"[record] chunk parse failed (forward continues): {exc}")
+            except httpx.RequestError as exc:
+                # Connection died mid-stream — bytes already sent reach the client;
+                # record what we got and return.
+                if self._recorder is not None:
+                    await self._recorder.record(
+                        request=request_dict,
+                        response=None,
+                        status="failure",
+                        start_time=start,
+                        end_time=time.time(),
+                        error=f"{type(exc).__name__}: {exc}",
+                    )
+                return
+            finally:
+                await resp.aclose()
+
+        if self._recorder is None:
+            return
+
+        status = "success" if upstream_status < 400 else "failure"
+        final_dict: dict | None = None
+        if status == "success":
+            try:
+                final_dict = state.get_final_completion().model_dump()
+            except Exception as exc:
+                logger.warning(f"[record] stream aggregation failed: {exc}")
+
+        await self._recorder.record(
+            request=request_dict,
+            response=final_dict,
+            status=status,
+            start_time=start,
+            end_time=time.time(),
+            error=None if status == "success" else f"upstream_status={upstream_status}",
         )
 
-    return base_url.rstrip("/")
+
+CompletionBackend = ReplayBackend | ForwardBackend
+
+
+def _get_backend(request: Request) -> CompletionBackend:
+    """Typed accessor for the backend attached at startup by ``_configure_proxy_integrations``."""
+    return request.app.state.backend
 
 
 @proxy_router.post("/v1/chat/completions")
-@record_traj
-async def chat_completions(body: dict[str, Any], request: Request):
-    """
-    OpenAI-compatible chat completions proxy endpoint.
-    Handles routing, header transparent forwarding, and automatic retries.
-    """
-    config = request.app.state.model_service_config
-
-    # Step 1: Model Routing
-    model_name = body.get("model", "")
-    base_url = get_base_url(model_name, config)
-    target_url = f"{base_url}/chat/completions"
-    logger.info(f"Routing model '{model_name}' to URL: {target_url}")
-
-    # Step 2: Header Cleaning
-    # Preserve 'Authorization' for authentication while removing hop-by-hop transport headers.
-    forwarded_headers = {}
-    for key, value in request.headers.items():
-        if key.lower() in ["host", "content-length", "content-type", "transfer-encoding"]:
-            continue
-        forwarded_headers[key] = value
-
-    # Step 3: Strategy Enforcement
-    # Force non-streaming mode for the MVP phase to ensure stability.
-    if body.get("stream") is True:
-        raise HTTPException(
-            status_code=400,
-            detail="Streaming requests (stream=True) are not supported in the current version. Please set stream=False or omit the stream parameter.",
-        )
-    body["stream"] = False
+async def chat_completions(request: Request):
+    """OpenAI-compatible chat completions proxy endpoint.
 
+    Reads the body as raw bytes (no parsing on the forward path) and delegates
+    to the backend attached at startup (replay or forward).
+    """
+    body_bytes = await request.body()
     try:
-        # Step 4: Execute Request with Retry Logic
-        response = await perform_llm_request(target_url, body, forwarded_headers, config)
-        return JSONResponse(status_code=response.status_code, content=response.json())
-
-    except httpx.HTTPStatusError as e:
-        # Forward the raw backend error message to the client.
-        # This allows the Agent-side logic to detect keywords like 'context length exceeded'
-        # or 'content violation' and raise appropriate exceptions.
-        error_text = e.response.text if e.response else "No error details"
-        status_code = e.response.status_code if e.response else 502
-        logger.error(f"Final failure after retries. Status: {status_code}, Response: {error_text}")
-        return JSONResponse(
-            status_code=status_code,
-            content={
-                "error": {
-                    "message": f"LLM backend error: {error_text}",
-                    "type": "proxy_retry_failed",
-                    "code": status_code,
-                }
-            },
-        )
-    except Exception as e:
-        logger.error(f"Unexpected proxy error: {str(e)}")
-        # Raise standard 500 for non-HTTP related errors or system errors
-        raise HTTPException(status_code=500, detail=str(e))
+        request_dict = json.loads(body_bytes) if body_bytes else {}
+    except json.JSONDecodeError:
+        raise HTTPException(status_code=400, detail="Request body is not valid JSON.")
+    if not isinstance(request_dict, dict):
+        raise HTTPException(status_code=400, detail="Request body must be a JSON object.")
+
+    model_name = request_dict.get("model", "")
+    is_stream = bool(request_dict.get("stream"))
+    fwd_headers = _filter_headers(request.headers)
+
+    backend = _get_backend(request)
+    return await backend.serve(
+        model_name=model_name,
+        is_stream=is_stream,
+        body_bytes=body_bytes,
+        fwd_headers=fwd_headers,
+        request_dict=request_dict,
+    )
diff --git a/rock/sdk/model/server/config.py b/rock/sdk/model/server/config.py
index 2c96992b5c..e734c29878 100644
--- a/rock/sdk/model/server/config.py
+++ b/rock/sdk/model/server/config.py
@@ -1,7 +1,7 @@
 from pathlib import Path
 
 import yaml
-from pydantic import BaseModel, Field
+from pydantic import BaseModel, ConfigDict, Field, model_validator
 
 from rock import env_vars
 
@@ -27,6 +27,10 @@
 class ModelServiceConfig(BaseModel):
     """Configuration for the LLM Model Service."""
 
+    # validate_assignment=True so the recording/replay mutex below also fires when
+    # CLI overrides are applied field-by-field (not only at construction time).
+    model_config = ConfigDict(validate_assignment=True)
+
     host: str = "0.0.0.0"
     """Server host address."""
 
@@ -51,6 +55,23 @@ class ModelServiceConfig(BaseModel):
     request_timeout: int = Field(default=120)
     """Request timeout in seconds."""
 
+    recording_file: str | None = Field(default=None)
+    """Recording mode output: where ForwardBackend writes the trajectory JSONL.
+    None → uses TRAJ_FILE (LOG_DIR/LLMTraj.jsonl)."""
+
+    replay_file: str | None = Field(default=None)
+    """Replay mode input: a .jsonl trajectory file. When set, ReplayBackend serves
+    requests from recorded responses instead of calling a real upstream."""
+
+    @model_validator(mode="after")
+    def _recording_replay_mutually_exclusive(self):
+        if self.recording_file and self.replay_file:
+            raise ValueError(
+                "recording_file and replay_file are mutually exclusive — "
+                "set one (recording mode) or the other (replay mode), not both."
+            )
+        return self
+
     @classmethod
     def from_file(cls, config_path: str | None = None):
         """
diff --git a/rock/sdk/model/server/main.py b/rock/sdk/model/server/main.py
index 7f8dabebe2..89e87ac0f9 100644
--- a/rock/sdk/model/server/main.py
+++ b/rock/sdk/model/server/main.py
@@ -11,7 +11,7 @@
 from rock.logger import init_logger
 from rock.sdk.model.server.api.local import init_local_api, local_router
 from rock.sdk.model.server.api.proxy import proxy_router
-from rock.sdk.model.server.config import ModelServiceConfig
+from rock.sdk.model.server.config import TRAJ_FILE, ModelServiceConfig
 
 # Configure logging
 logger = init_logger(__name__)
@@ -52,6 +52,33 @@ async def global_exception_handler(request, exc):
     return app
 
 
+def _configure_proxy_integrations(app: FastAPI, config: ModelServiceConfig) -> None:
+    """Attach the appropriate backend to ``app.state.backend``.
+
+    - Replay mode (``replay_file`` set): ``ReplayBackend`` wrapping a
+      ``SequentialCursor``; no recorder — replaying back into the source file
+      would corrupt it.
+    - Forward mode (default): ``ForwardBackend`` with a ``TrajectoryRecorder``
+      writing to ``recording_file`` (or ``TRAJ_FILE`` if unset).
+    """
+    from rock.sdk.model.server.api.proxy import ForwardBackend, ReplayBackend
+
+    if config.replay_file:
+        from rock.sdk.model.server.traj import SequentialCursor
+
+        cursor = SequentialCursor.load(config.replay_file)
+        app.state.backend = ReplayBackend(cursor)
+        logger.info(f"replay backend attached, replay_file={config.replay_file}")
+        return
+
+    from rock.sdk.model.server.traj import TrajectoryRecorder
+
+    recording_path = config.recording_file or TRAJ_FILE
+    recorder = TrajectoryRecorder(traj_file=recording_path)
+    app.state.backend = ForwardBackend(config, recorder=recorder)
+    logger.info(f"forward backend attached, recording_file={recording_path}")
+
+
 def main(
     model_servie_type: str,
     config: ModelServiceConfig,
@@ -63,6 +90,7 @@ def main(
         asyncio.run(init_local_api())
         app.include_router(local_router, prefix="", tags=["local"])
     else:
+        _configure_proxy_integrations(app, config)
         app.include_router(proxy_router, prefix="", tags=["proxy"])
 
     logger.info(f"Starting LLM Service on {config.host}:{config.port}, type: {model_servie_type}")
@@ -100,6 +128,12 @@ def create_config_from_args(args) -> ModelServiceConfig:
     if args.request_timeout:
         config.request_timeout = args.request_timeout
         logger.info(f"request_timeout set from command line: {args.request_timeout}s")
+    if args.recording_file:
+        config.recording_file = args.recording_file
+        logger.info(f"recording_file set from command line: {args.recording_file}")
+    if args.replay_file:
+        config.replay_file = args.replay_file
+        logger.info(f"replay mode enabled via --replay-file: {args.replay_file}")
 
     return config
 
@@ -142,6 +176,18 @@ def create_config_from_args(args) -> ModelServiceConfig:
     parser.add_argument(
         "--request-timeout", type=int, default=None, help="Request timeout in seconds. Overrides config file."
     )
+    parser.add_argument(
+        "--recording-file",
+        type=str,
+        default=None,
+        help="Forward mode: where to write the trajectory JSONL. Defaults to TRAJ_FILE.",
+    )
+    parser.add_argument(
+        "--replay-file",
+        type=str,
+        default=None,
+        help="Replay mode: path to a recorded .jsonl traj file. Disables real LLM upstreams.",
+    )
     args = parser.parse_args()
 
     config = create_config_from_args(args)
diff --git a/rock/sdk/model/server/sse.py b/rock/sdk/model/server/sse.py
new file mode 100644
index 0000000000..f1cca034e6
--- /dev/null
+++ b/rock/sdk/model/server/sse.py
@@ -0,0 +1,99 @@
+"""SSE codec utilities for the chat/completions proxy.
+
+Three pure helpers, no openai/litellm dependencies:
+
+- :func:`parse_sse_data_chunks` — incremental SSE byte stream → list of decoded
+  ``data:`` payload dicts (used by the forward path to feed chunks into the
+  stream-state aggregator while bytes pass through verbatim to the client).
+- :func:`completion_to_chunk_dict` — convert a non-streaming ``chat.completion``
+  response into a single ``chat.completion.chunk`` dict, by renaming
+  ``message`` → ``delta``. Used by the replay path's streaming output.
+- :func:`encode_sse_event` — encode a payload dict as ``data: <json>\\n\\n``
+  bytes (one SSE event).
+"""
+
+from __future__ import annotations
+
+import json
+import time
+import uuid
+from typing import Final
+
+# Terminal SSE event sent at the end of a chat/completions stream.
+SSE_DONE: Final[bytes] = b"data: [DONE]\n\n"
+
+
+def parse_sse_data_chunks(buffer: bytes) -> tuple[list[dict], bytes]:
+    """Extract complete SSE events from a (possibly partial) byte buffer.
+
+    Returns ``(chunks, leftover)``: the parsed ``data:`` JSON payload dicts and
+    the bytes that did not yet form a complete event (``\\n\\n``-terminated).
+
+    - ``data: [DONE]`` is skipped (terminal marker, has no JSON payload).
+    - Lines that don't start with ``data:`` (``event:`` / ``id:`` / blank)
+      are ignored.
+    - Malformed JSON in a ``data:`` line is silently skipped — caller logs at
+      its own discretion (typically ``debug``).
+
+    Caller pattern::
+
+        chunks, buffer = parse_sse_data_chunks(buffer + new_bytes)
+        for chunk_dict in chunks:
+            ... feed to aggregator, etc ...
+    """
+    chunks: list[dict] = []
+    while b"\n\n" in buffer:
+        event, buffer = buffer.split(b"\n\n", 1)
+        for raw_line in event.split(b"\n"):
+            line = raw_line.decode("utf-8", errors="replace").strip()
+            if not line.startswith("data:"):
+                continue
+            payload = line[len("data:") :].strip()
+            if not payload or payload == "[DONE]":
+                continue
+            try:
+                chunks.append(json.loads(payload))
+            except json.JSONDecodeError:
+                continue
+    return chunks, buffer
+
+
+def completion_to_chunk_dict(response: dict, *, model: str) -> dict:
+    """Convert a recorded ``chat.completion`` dict into a single
+    ``chat.completion.chunk`` dict, suitable for re-streaming.
+
+    Only ``message`` → ``delta`` is renamed; every other field (including
+    provider-specific extras like ``reasoning_content`` inside the message)
+    flows through unchanged. ``id`` / ``created`` are synthesized when missing.
+
+    ``tool_calls`` items get a positional ``index`` injected if missing — the
+    OpenAI streaming spec requires it on chunk deltas (a recorded non-stream
+    ``message.tool_calls`` carries no ``index``, but downstream stream parsers
+    e.g. the openai SDK will reject the chunk without one).
+    """
+    choices_in = response.get("choices") or []
+    choices_out = []
+    for choice in choices_in:
+        delta = dict(choice.get("message") or {})
+        if "tool_calls" in delta and delta["tool_calls"]:
+            delta["tool_calls"] = [{"index": tc.get("index", i), **tc} for i, tc in enumerate(delta["tool_calls"])]
+        choices_out.append(
+            {
+                "index": choice.get("index", 0),
+                "delta": delta,
+                "finish_reason": choice.get("finish_reason"),
+                "logprobs": choice.get("logprobs"),
+            }
+        )
+    return {
+        "id": response.get("id") or f"chatcmpl-{uuid.uuid4()}",
+        "object": "chat.completion.chunk",
+        "created": response.get("created") or int(time.time()),
+        "model": response.get("model") or model,
+        "choices": choices_out,
+    }
+
+
+def encode_sse_event(data: dict) -> bytes:
+    """Encode a JSON payload as one SSE ``data:`` event (terminated by ``\\n\\n``)."""
+    return f"data: {json.dumps(data, ensure_ascii=False)}\n\n".encode()
diff --git a/rock/sdk/model/server/traj.py b/rock/sdk/model/server/traj.py
new file mode 100644
index 0000000000..e12c229c7f
--- /dev/null
+++ b/rock/sdk/model/server/traj.py
@@ -0,0 +1,156 @@
+"""Trajectory record + replay for the chat/completions proxy.
+
+Two halves around the same JSONL schema (one record per line):
+
+- :class:`TrajectoryRecorder` — invoked by the forward path after each upstream
+  call (success or failure). Appends a small dict with
+  ``request`` / ``response`` / ``status`` / ``response_time`` / ``model`` /
+  ``stream``, and reports OTLP RT/count metrics. Stores responses verbatim
+  (provider-specific fields like ``reasoning_content`` survive); for streaming
+  calls ``response`` is the aggregated final ChatCompletion produced by
+  ``ChatCompletionStreamState.get_final_completion().model_dump()``.
+
+- :class:`SequentialCursor` — loads a JSONL trajectory once at startup;
+  ``await cursor.next(expected_model=...)`` hands out the next record (full
+  payload dict) and advances. Going past the end raises
+  :class:`TrajectoryExhausted` so the proxy can return a clean 404.
+"""
+
+from __future__ import annotations
+
+import asyncio
+import json
+import os
+from pathlib import Path
+from typing import Any
+
+from rock.logger import init_logger
+from rock.sdk.model.server.utils import (
+    MODEL_SERVICE_REQUEST_COUNT,
+    MODEL_SERVICE_REQUEST_RT,
+    _get_or_create_metrics_monitor,
+)
+
+logger = init_logger(__name__)
+
+
+# ---------------------------------------------------------------------------
+# Recorder
+# ---------------------------------------------------------------------------
+
+
+class TrajectoryRecorder:
+    """Appends one JSONL line per chat/completions call and reports OTLP metrics."""
+
+    def __init__(self, traj_file: str | os.PathLike) -> None:
+        self.traj_file = Path(traj_file)
+        self.traj_file.parent.mkdir(parents=True, exist_ok=True)
+        self._lock = asyncio.Lock()
+        self._monitor = _get_or_create_metrics_monitor()
+
+    async def record(
+        self,
+        *,
+        request: dict[str, Any],
+        response: dict[str, Any] | None,
+        status: str,
+        start_time: float,
+        end_time: float,
+        error: str | None = None,
+    ) -> None:
+        rt_seconds = end_time - start_time
+        payload = {
+            "model": request.get("model"),
+            "stream": bool(request.get("stream")),
+            "status": status,
+            "response_time": rt_seconds,
+            "start_time": start_time,
+            "end_time": end_time,
+            "request": request,
+            "response": response,
+            "error": error,
+        }
+
+        line = json.dumps(payload, ensure_ascii=False, default=str) + "\n"
+        async with self._lock:
+            await asyncio.to_thread(self._write_line, line)
+
+        attrs = {
+            "type": "chat_completions",
+            "status": status,
+            "sandbox_id": os.getenv("ROCK_SANDBOX_ID", "unknown"),
+        }
+        self._monitor.record_gauge_by_name(MODEL_SERVICE_REQUEST_RT, rt_seconds * 1000.0, attributes=attrs)
+        self._monitor.record_counter_by_name(MODEL_SERVICE_REQUEST_COUNT, 1, attributes=attrs)
+
+    def _write_line(self, line: str) -> None:
+        with self.traj_file.open("a", encoding="utf-8") as f:
+            f.write(line)
+
+
+# ---------------------------------------------------------------------------
+# Replay cursor
+# ---------------------------------------------------------------------------
+
+
+class TrajectoryExhausted(Exception):
+    """Raised by ``SequentialCursor.next`` when all recorded steps have been served."""
+
+    def __init__(self, position: int, total: int) -> None:
+        super().__init__(f"trajectory exhausted at step {position} (total recorded steps={total})")
+        self.position = position
+        self.total = total
+
+
+class SequentialCursor:
+    """Hands out trajectory records one at a time, in recorded order."""
+
+    def __init__(self, records: list[dict]) -> None:
+        self.records = records
+        self._idx = 0
+        self._lock = asyncio.Lock()
+
+    @classmethod
+    def load(cls, path: str | os.PathLike) -> SequentialCursor:
+        path = Path(path)
+        if not path.is_file():
+            raise FileNotFoundError(f"traj file not found: {path}")
+
+        records: list[dict] = []
+        with path.open("r", encoding="utf-8") as fp:
+            for line in fp:
+                line = line.strip()
+                if not line:
+                    continue
+                records.append(json.loads(line))
+
+        logger.info(f"[traj-replay] loaded {len(records)} record(s) from {path}")
+        return cls(records)
+
+    async def next(self, expected_model: str | None = None) -> dict:
+        async with self._lock:
+            if self._idx >= len(self.records):
+                raise TrajectoryExhausted(position=self._idx, total=len(self.records))
+            record = self.records[self._idx]
+            self._idx += 1
+            current_idx = self._idx - 1
+
+        if expected_model:
+            recorded_model = record.get("model")
+            if recorded_model and recorded_model != expected_model:
+                logger.warning(
+                    f"[traj-replay] step {current_idx} model mismatch: "
+                    f"recorded={recorded_model!r} requested={expected_model!r}"
+                )
+        return record
+
+    def reset(self) -> None:
+        self._idx = 0
+
+    @property
+    def position(self) -> int:
+        return self._idx
+
+    @property
+    def total(self) -> int:
+        return len(self.records)
diff --git a/rock/sdk/model/server/utils.py b/rock/sdk/model/server/utils.py
index 20ae8896dc..639ca3995b 100644
--- a/rock/sdk/model/server/utils.py
+++ b/rock/sdk/model/server/utils.py
@@ -38,7 +38,7 @@ def _write_traj(data: dict):
 
 
 def record_traj(func: Callable):
-    """Decorator to record chat completions input/output as traj."""
+    """Decorator to record chat completions input/output as traj (local mode only)."""
 
     @wraps(func)
     async def wrapper(*args, **kwargs):
diff --git a/rock/sdk/model/service.py b/rock/sdk/model/service.py
index b1b523ed27..24cd7ede38 100644
--- a/rock/sdk/model/service.py
+++ b/rock/sdk/model/service.py
@@ -17,6 +17,8 @@ def start_sandbox_service(
         proxy_base_url: str | None = None,
         retryable_status_codes: str | None = None,
         request_timeout: int | None = None,
+        recording_file: str | None = None,
+        replay_file: str | None = None,
     ) -> subprocess.Popen:
         """start sandbox service"""
         current_file = Path(__file__).resolve()
@@ -38,6 +40,10 @@ def start_sandbox_service(
             cmd.extend(["--retryable-status-codes", retryable_status_codes])
         if request_timeout:
             cmd.extend(["--request-timeout", str(request_timeout)])
+        if recording_file:
+            cmd.extend(["--recording-file", recording_file])
+        if replay_file:
+            cmd.extend(["--replay-file", replay_file])
         process = subprocess.Popen(cmd, cwd=str(service_dir))
         return process
 
@@ -51,6 +57,8 @@ async def start(
         proxy_base_url: str | None = None,
         retryable_status_codes: str | None = None,
         request_timeout: int | None = None,
+        recording_file: str | None = None,
+        replay_file: str | None = None,
     ) -> str:
         process = self.start_sandbox_service(
             model_service_type=model_service_type,
@@ -60,6 +68,8 @@ async def start(
             proxy_base_url=proxy_base_url,
             retryable_status_codes=retryable_status_codes,
             request_timeout=request_timeout,
+            recording_file=recording_file,
+            replay_file=replay_file,
         )
         pid = process.pid
 
diff --git a/tests/unit/cli/command/test_model_service.py b/tests/unit/cli/command/test_model_service.py
new file mode 100644
index 0000000000..86849c718b
--- /dev/null
+++ b/tests/unit/cli/command/test_model_service.py
@@ -0,0 +1,120 @@
+"""Unit tests for rock.cli.command.model_service.ModelServiceCommand.
+
+Drive the sub-parser end-to-end with argparse so the surface that users
+actually type at the terminal is what we exercise. ``ModelService.start`` is
+mocked — these tests assert wiring (argparse → handler → SDK call), not the
+subprocess command construction (covered separately in
+tests/unit/sdk/model/test_service.py).
+"""
+
+from __future__ import annotations
+
+import argparse
+import asyncio
+from unittest.mock import AsyncMock
+
+import pytest
+
+from rock.cli.command.model_service import ModelServiceCommand
+
+
+def _build_parser() -> argparse.ArgumentParser:
+    """Top-level parser with `model-service` subcommand wired in, same as the CLI."""
+    top = argparse.ArgumentParser(prog="rock")
+    subparsers = top.add_subparsers(dest="command")
+    asyncio.run(ModelServiceCommand.add_parser_to(subparsers))
+    return top
+
+
+@pytest.fixture
+def isolate_pid_file(monkeypatch, tmp_path):
+    """Redirect PID dir/file into tmp so arun() doesn't touch ./data/cli/model."""
+    monkeypatch.setattr(ModelServiceCommand, "DEFAULT_MODEL_SERVICE_DIR", str(tmp_path))
+    monkeypatch.setattr(ModelServiceCommand, "DEFAULT_MODEL_SERVICE_PID_FILE", str(tmp_path / "pid.txt"))
+
+
+@pytest.fixture
+def fake_start(monkeypatch):
+    """Replace ModelService.start with an AsyncMock returning a fixed pid."""
+    mock = AsyncMock(return_value="12345")
+    monkeypatch.setattr("rock.cli.command.model_service.ModelService.start", mock)
+    return mock
+
+
+# ---------- argparse: the new flags must parse ----------
+
+
+def test_recording_file_flag_parses():
+    parser = _build_parser()
+    ns = parser.parse_args(["model-service", "start", "--type", "proxy", "--recording-file", "/tmp/out.jsonl"])
+    assert ns.recording_file == "/tmp/out.jsonl"
+    assert ns.replay_file is None
+
+
+def test_replay_file_flag_parses():
+    parser = _build_parser()
+    ns = parser.parse_args(["model-service", "start", "--type", "proxy", "--replay-file", "/tmp/in.jsonl"])
+    assert ns.replay_file == "/tmp/in.jsonl"
+    assert ns.recording_file is None
+
+
+def test_neither_flag_defaults_to_none():
+    parser = _build_parser()
+    ns = parser.parse_args(["model-service", "start", "--type", "proxy"])
+    assert ns.recording_file is None
+    assert ns.replay_file is None
+
+
+# ---------- handler: passes parsed args through to ModelService.start ----------
+
+
+def test_start_handler_forwards_recording_file(isolate_pid_file, fake_start):
+    parser = _build_parser()
+    ns = parser.parse_args(
+        [
+            "model-service",
+            "start",
+            "--type",
+            "proxy",
+            "--proxy-base-url",
+            "https://api.openai.com/v1",
+            "--recording-file",
+            "/tmp/out.jsonl",
+        ]
+    )
+    asyncio.run(ModelServiceCommand().arun(ns))
+
+    kwargs = fake_start.call_args.kwargs
+    assert kwargs["recording_file"] == "/tmp/out.jsonl"
+    assert kwargs["replay_file"] is None
+    assert kwargs["proxy_base_url"] == "https://api.openai.com/v1"
+    assert kwargs["model_service_type"] == "proxy"
+
+
+def test_start_handler_forwards_replay_file(isolate_pid_file, fake_start):
+    parser = _build_parser()
+    ns = parser.parse_args(
+        [
+            "model-service",
+            "start",
+            "--type",
+            "proxy",
+            "--replay-file",
+            "/tmp/in.jsonl",
+        ]
+    )
+    asyncio.run(ModelServiceCommand().arun(ns))
+
+    kwargs = fake_start.call_args.kwargs
+    assert kwargs["replay_file"] == "/tmp/in.jsonl"
+    assert kwargs["recording_file"] is None
+
+
+def test_start_handler_omits_both_when_unset(isolate_pid_file, fake_start):
+    parser = _build_parser()
+    ns = parser.parse_args(["model-service", "start", "--type", "proxy"])
+    asyncio.run(ModelServiceCommand().arun(ns))
+
+    kwargs = fake_start.call_args.kwargs
+    assert kwargs["recording_file"] is None
+    assert kwargs["replay_file"] is None
diff --git a/tests/unit/sdk/model/test_proxy.py b/tests/unit/sdk/model/test_proxy.py
index edce5584cb..345ea31775 100644
--- a/tests/unit/sdk/model/test_proxy.py
+++ b/tests/unit/sdk/model/test_proxy.py
@@ -1,14 +1,24 @@
-from unittest.mock import AsyncMock, MagicMock, patch
+"""Tests for the chat/completions proxy.
+
+Forward path is exercised by pointing the proxy at an httpx ``MockTransport``
+(no real network). Replay path is exercised end-to-end via the FastAPI test
+client. Config / CLI / metrics-singleton tests round out the file.
+"""
+
+import argparse
+import json
+from unittest.mock import MagicMock, patch
 
 import httpx
 import pytest
 import yaml
-from fastapi import FastAPI, Request
-from httpx import ASGITransport, AsyncClient, HTTPStatusError, Request, Response
+from fastapi import FastAPI
+from httpx import ASGITransport, AsyncClient
 
-from rock.sdk.model.server.api.proxy import perform_llm_request, proxy_router
+from rock.sdk.model.server.api.proxy import proxy_router
 from rock.sdk.model.server.config import ModelServiceConfig
 from rock.sdk.model.server.main import create_config_from_args, lifespan
+from rock.sdk.model.server.traj import SequentialCursor
 from rock.sdk.model.server.utils import (
     MODEL_SERVICE_REQUEST_COUNT,
     MODEL_SERVICE_REQUEST_RT,
@@ -16,361 +26,568 @@
     record_traj,
 )
 
-# Initialize a temporary FastAPI application for testing the router
-test_app = FastAPI()
-test_app.include_router(proxy_router)
 
-mock_config = ModelServiceConfig()
-test_app.state.model_service_config = mock_config
+def _build_app(config: ModelServiceConfig, *, replay_cursor=None, recorder=None) -> FastAPI:
+    """Build a FastAPI app with the proxy router and the given config attached."""
+    from rock.sdk.model.server.api.proxy import ForwardBackend, ReplayBackend
+
+    app = FastAPI()
+    app.state.model_service_config = config
+    if replay_cursor is not None:
+        app.state.backend = ReplayBackend(replay_cursor)
+    else:
+        app.state.backend = ForwardBackend(config, recorder=recorder)
+    app.include_router(proxy_router)
+    return app
+
+
+def _patch_httpx_with_handler(handler):
+    """Patch ``proxy.httpx.AsyncClient`` so each ``async with httpx.AsyncClient(...)``
+    returns a real client wrapping ``MockTransport(handler)``."""
+    real_client_cls = httpx.AsyncClient  # capture before patching kicks in
+    transport = httpx.MockTransport(handler)
+
+    def factory(*args, **kwargs):
+        kwargs.pop("timeout", None)  # transport supplies the response, no timeout needed
+        return real_client_cls(transport=transport, **kwargs)
+
+    return patch("rock.sdk.model.server.api.proxy.httpx.AsyncClient", side_effect=factory)
+
+
+def _success_response_json(*, model: str = "gpt-3.5-turbo", content: str = "hi") -> dict:
+    return {
+        "id": "chatcmpl-1",
+        "object": "chat.completion",
+        "created": 1234,
+        "model": model,
+        "choices": [
+            {
+                "index": 0,
+                "message": {"role": "assistant", "content": content},
+                "finish_reason": "stop",
+            }
+        ],
+        "usage": {"prompt_tokens": 1, "completion_tokens": 1, "total_tokens": 2},
+    }
+
+
+# ---------- Forward path: routing ----------
 
 
 @pytest.mark.asyncio
-async def test_chat_completions_routing_success():
-    """
-    Test the high-level routing logic.
-    """
-    patch_path = "rock.sdk.model.server.api.proxy.perform_llm_request"
-
-    with patch(patch_path, new_callable=AsyncMock) as mock_request:
-        mock_resp = MagicMock(spec=Response)
-        mock_resp.status_code = 200
-        mock_resp.json.return_value = {"id": "chat-123", "choices": []}
-        mock_request.return_value = mock_resp
-
-        transport = ASGITransport(app=test_app)
+async def test_forward_routes_by_model_name_to_proxy_rules():
+    captured = {}
+
+    def handler(request: httpx.Request) -> httpx.Response:
+        captured["url"] = str(request.url)
+        captured["body"] = json.loads(request.content)
+        return httpx.Response(200, json=_success_response_json())
+
+    app = _build_app(ModelServiceConfig())
+    with _patch_httpx_with_handler(handler):
+        transport = ASGITransport(app=app)
         async with AsyncClient(transport=transport, base_url="http://test") as ac:
-            payload = {"model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "hello"}]}
-            response = await ac.post("/v1/chat/completions", json=payload)
+            r = await ac.post(
+                "/v1/chat/completions",
+                json={"model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "hi"}]},
+            )
 
-        assert response.status_code == 200
-        call_args = mock_request.call_args[0]
-        assert call_args[0] == "https://api.openai.com/v1/chat/completions"
-        assert mock_request.called
+    assert r.status_code == 200
+    assert captured["url"] == "https://api.openai.com/v1/chat/completions"
+    assert captured["body"]["model"] == "gpt-3.5-turbo"
 
 
 @pytest.mark.asyncio
-async def test_chat_completions_fallback_to_default_when_not_found():
-    """
-    Test that an unrecognized model name correctly falls back to the 'default' URL.
-    """
-    patch_path = "rock.sdk.model.server.api.proxy.perform_llm_request"
-
-    with patch(patch_path, new_callable=AsyncMock) as mock_request:
-        mock_resp = MagicMock(spec=Response)
-        mock_resp.status_code = 200
-        mock_resp.json.return_value = {"id": "chat-fallback", "choices": []}
-        mock_request.return_value = mock_resp
-
-        config = test_app.state.model_service_config
-        default_base_url = config.proxy_rules["default"].rstrip("/")
-        expected_target_url = f"{default_base_url}/chat/completions"
-
-        transport = ASGITransport(app=test_app)
+async def test_forward_falls_back_to_default_for_unknown_model():
+    captured = {}
+
+    def handler(request: httpx.Request) -> httpx.Response:
+        captured["url"] = str(request.url)
+        return httpx.Response(200, json=_success_response_json(model="some-random"))
+
+    config = ModelServiceConfig()
+    expected_default = config.proxy_rules["default"].rstrip("/") + "/chat/completions"
+    app = _build_app(config)
+
+    with _patch_httpx_with_handler(handler):
+        transport = ASGITransport(app=app)
         async with AsyncClient(transport=transport, base_url="http://test") as ac:
-            payload = {
-                "model": "some-random-unsupported-model",  # This model is NOT in proxy_rules
-                "messages": [{"role": "user", "content": "hello"}],
-            }
-            response = await ac.post("/v1/chat/completions", json=payload)
+            r = await ac.post(
+                "/v1/chat/completions",
+                json={"model": "some-random", "messages": [{"role": "user", "content": "hi"}]},
+            )
+
+    assert r.status_code == 200
+    assert captured["url"] == expected_default
 
-        assert response.status_code == 200
 
-        # Verify that perform_llm_request was called with the DEFAULT URL
-        call_args = mock_request.call_args[0]
-        actual_url = call_args[0]
+@pytest.mark.asyncio
+async def test_forward_400_when_no_rule_and_no_default():
+    config = ModelServiceConfig()
+    config.proxy_rules = {}
+    app = _build_app(config)
+
+    transport = ASGITransport(app=app)
+    async with AsyncClient(transport=transport, base_url="http://test") as ac:
+        r = await ac.post(
+            "/v1/chat/completions",
+            json={"model": "any", "messages": [{"role": "user", "content": "hi"}]},
+        )
 
-        assert actual_url == expected_target_url
-        assert mock_request.called
+    assert r.status_code == 400
+    assert "not configured" in r.json()["detail"]
 
 
 @pytest.mark.asyncio
-async def test_chat_completions_routing_absolute_fail():
-    """
-    Test that both the specific model and the 'default' rule are missing.
-    """
-    empty_config = ModelServiceConfig()
-    empty_config.proxy_rules = {}
-
-    with patch.object(test_app.state, "model_service_config", empty_config):
-        transport = ASGITransport(app=test_app)
+async def test_forward_proxy_base_url_overrides_proxy_rules():
+    captured = {}
+
+    def handler(request: httpx.Request) -> httpx.Response:
+        captured["url"] = str(request.url)
+        return httpx.Response(200, json=_success_response_json())
+
+    config = ModelServiceConfig()
+    config.proxy_base_url = "https://custom-endpoint.example.com/v1"
+    app = _build_app(config)
+
+    with _patch_httpx_with_handler(handler):
+        transport = ASGITransport(app=app)
         async with AsyncClient(transport=transport, base_url="http://test") as ac:
-            payload = {"model": "any-model", "messages": [{"role": "user", "content": "hello"}]}
-            response = await ac.post("/v1/chat/completions", json=payload)
+            await ac.post(
+                "/v1/chat/completions",
+                json={"model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "hi"}]},
+            )
 
-    assert response.status_code == 400
-    detail = response.json()["detail"]
-    assert "not configured" in detail
+    assert captured["url"] == "https://custom-endpoint.example.com/v1/chat/completions"
 
 
-@pytest.mark.asyncio
-async def test_perform_llm_request_retry_on_whitelist():
-    """
-    Test that the proxy retries when receiving a whitelisted error code.
-    """
-    client_post_path = "rock.sdk.model.server.api.proxy.http_client.post"
+# ---------- Forward path: byte passthrough ----------
 
-    # Patch asyncio.sleep inside the retry module to avoid actual waiting
-    with (
-        patch(client_post_path, new_callable=AsyncMock) as mock_post,
-        patch("rock.utils.retry.asyncio.sleep", return_value=None),
-    ):
-        # 1. Setup Failed Response (429)
-        resp_429 = MagicMock(spec=Response)
-        resp_429.status_code = 429
-        error_429 = HTTPStatusError("Rate Limited", request=MagicMock(spec=Request), response=resp_429)
 
-        # 2. Setup Success Response (200)
-        resp_200 = MagicMock(spec=Response)
-        resp_200.status_code = 200
-        resp_200.json.return_value = {"ok": True}
+@pytest.mark.asyncio
+async def test_forward_response_body_is_byte_for_byte_passthrough():
+    """Upstream's exact JSON bytes (incl. provider-specific fields) reach the client."""
+    upstream_payload = {
+        "id": "x",
+        "object": "chat.completion",
+        "model": "glm-5",
+        "choices": [
+            {
+                "index": 0,
+                "message": {"role": "assistant", "content": "hi", "reasoning_content": "...think..."},
+                "finish_reason": "stop",
+            }
+        ],
+        "provider_specific_fields": {"vendor_field": "vendor_value"},
+    }
 
-        # Sequence: Fail with 429, then Succeed with 200
-        mock_post.side_effect = [error_429, resp_200]
+    def handler(request: httpx.Request) -> httpx.Response:
+        return httpx.Response(200, json=upstream_payload)
 
-        result = await perform_llm_request("http://fake.url", {}, {}, mock_config)
+    app = _build_app(ModelServiceConfig())
+    with _patch_httpx_with_handler(handler):
+        transport = ASGITransport(app=app)
+        async with AsyncClient(transport=transport, base_url="http://test") as ac:
+            r = await ac.post(
+                "/v1/chat/completions",
+                json={"model": "glm-5", "messages": [{"role": "user", "content": "hi"}]},
+            )
 
-        assert result.status_code == 200
-        assert mock_post.call_count == 2
+    body = r.json()
+    assert body["choices"][0]["message"]["reasoning_content"] == "...think..."
+    assert body["provider_specific_fields"] == {"vendor_field": "vendor_value"}
 
 
 @pytest.mark.asyncio
-async def test_perform_llm_request_no_retry_on_non_whitelist():
-    """
-    Test that the proxy DOES NOT retry for non-retryable codes (e.g., 401).
-    It should return the error response immediately.
-    """
-    client_post_path = "rock.sdk.model.server.api.proxy.http_client.post"
+async def test_forward_propagates_upstream_status_and_body_on_4xx():
+    """Upstream 4xx is forwarded verbatim — proxy doesn't re-shape error JSON."""
+    err_body = {"error": {"message": "context length exceeded", "type": "BadRequestError"}}
+
+    def handler(request: httpx.Request) -> httpx.Response:
+        return httpx.Response(400, json=err_body)
+
+    app = _build_app(ModelServiceConfig())
+    with _patch_httpx_with_handler(handler):
+        transport = ASGITransport(app=app)
+        async with AsyncClient(transport=transport, base_url="http://test") as ac:
+            r = await ac.post(
+                "/v1/chat/completions",
+                json={"model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "hi"}]},
+            )
+
+    assert r.status_code == 400
+    assert r.json() == err_body
 
-    with patch(client_post_path, new_callable=AsyncMock) as mock_post:
-        # Mock 401 Unauthorized (NOT in the retry whitelist)
-        resp_401 = MagicMock(spec=Response)
-        resp_401.status_code = 401
-        resp_401.json.return_value = {"error": "Invalid API Key"}
 
-        # The function should return this response directly
-        mock_post.return_value = resp_401
+@pytest.mark.asyncio
+async def test_forward_authorization_header_passes_through():
+    captured = {}
+
+    def handler(request: httpx.Request) -> httpx.Response:
+        captured["headers"] = dict(request.headers)
+        return httpx.Response(200, json=_success_response_json())
 
-        result = await perform_llm_request("http://fake.url", {}, {}, mock_config)
+    app = _build_app(ModelServiceConfig())
+    with _patch_httpx_with_handler(handler):
+        transport = ASGITransport(app=app)
+        async with AsyncClient(transport=transport, base_url="http://test") as ac:
+            await ac.post(
+                "/v1/chat/completions",
+                json={"model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "hi"}]},
+                headers={"Authorization": "Bearer sk-abc", "X-Trace": "t1"},
+            )
 
-        assert result.status_code == 401
-        # Call count must be 1, meaning no retries were attempted
-        assert mock_post.call_count == 1
+    # Authorization and custom X-* headers are forwarded verbatim. We don't assert
+    # on framing headers (connection / content-length / accept-encoding) because
+    # httpx rebuilds them itself for the outgoing request.
+    auth_value = captured["headers"].get("Authorization") or captured["headers"].get("authorization")
+    assert auth_value == "Bearer sk-abc"
+    fwd_lower = {k.lower() for k in captured["headers"]}
+    assert "x-trace" in fwd_lower
 
 
 @pytest.mark.asyncio
-async def test_perform_llm_request_network_timeout_retry():
-    """
-    Test that network-level exceptions (like Timeout) also trigger retries.
-    """
-    client_post_path = "rock.sdk.model.server.api.proxy.http_client.post"
+async def test_forward_502_on_upstream_connection_failure(monkeypatch):
+    """ConnectError → 502. Retry disabled here to keep the test fast."""
+    monkeypatch.setattr("rock.sdk.model.server.api.proxy._RETRY_MAX_ATTEMPTS", 1)
 
-    with (
-        patch(client_post_path, new_callable=AsyncMock) as mock_post,
-        patch("rock.utils.retry.asyncio.sleep", return_value=None),
-    ):
-        resp_200 = MagicMock(spec=Response)
-        resp_200.status_code = 200
+    def handler(request: httpx.Request) -> httpx.Response:
+        raise httpx.ConnectError("upstream is down")
+
+    app = _build_app(ModelServiceConfig())
+    with _patch_httpx_with_handler(handler):
+        transport = ASGITransport(app=app)
+        async with AsyncClient(transport=transport, base_url="http://test") as ac:
+            r = await ac.post(
+                "/v1/chat/completions",
+                json={"model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "hi"}]},
+            )
 
-        mock_post.side_effect = [httpx.TimeoutException("Network Timeout"), resp_200]
+    assert r.status_code == 502
 
-        result = await perform_llm_request("http://fake.url", {}, {}, mock_config)
 
-        assert result.status_code == 200
-        assert mock_post.call_count == 2
+# ---------- Forward path: retry ----------
 
 
 @pytest.mark.asyncio
-async def test_lifespan_initialization_with_config(tmp_path):
-    """
-    Test that the application correctly initializes and overrides defaults
-    when a valid configuration file path is provided.
-    """
-    conf_file = tmp_path / "proxy.yml"
-    conf_file.write_text(yaml.dump({"proxy_rules": {"my-model": "http://custom-url"}, "request_timeout": 50}))
+async def test_forward_retries_on_retryable_status_then_succeeds(monkeypatch):
+    """A 429 is retried; the next attempt's 200 is returned to the client."""
+    monkeypatch.setattr("rock.sdk.model.server.api.proxy._RETRY_DELAY_SECONDS", 0.0)
 
-    # Initialize App and load config from file
-    config = ModelServiceConfig.from_file(str(conf_file))
-    app = FastAPI(lifespan=lambda app: lifespan(app, config))
+    attempts = []
 
-    async with lifespan(app, config):
-        app_config = app.state.model_service_config
-        # Verify that the config reflects file content instead of defaults
-        assert app_config.proxy_rules["my-model"] == "http://custom-url"
-        assert app_config.request_timeout == 50
-        assert "gpt-3.5-turbo" not in app_config.proxy_rules
+    def handler(request: httpx.Request) -> httpx.Response:
+        attempts.append(1)
+        if len(attempts) < 3:
+            return httpx.Response(429, json={"error": "rate limited"})
+        return httpx.Response(200, json=_success_response_json(content="finally"))
+
+    app = _build_app(ModelServiceConfig())  # default retryable_status_codes = [429, 500]
+    with _patch_httpx_with_handler(handler):
+        transport = ASGITransport(app=app)
+        async with AsyncClient(transport=transport, base_url="http://test") as ac:
+            r = await ac.post(
+                "/v1/chat/completions",
+                json={"model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "hi"}]},
+            )
+
+    assert r.status_code == 200
+    assert r.json()["choices"][0]["message"]["content"] == "finally"
+    assert len(attempts) == 3
 
 
 @pytest.mark.asyncio
-async def test_lifespan_initialization_no_config():
-    """
-    Test that the application initializes with default ModelServiceConfig
-    settings when no configuration file path is provided.
-    """
-    config = ModelServiceConfig()
-    app = FastAPI(lifespan=lambda app: lifespan(app, config))
+async def test_forward_returns_last_response_when_retries_exhausted(monkeypatch):
+    """All attempts return 429 → the final 429 body+status is forwarded verbatim."""
+    monkeypatch.setattr("rock.sdk.model.server.api.proxy._RETRY_MAX_ATTEMPTS", 3)
+    monkeypatch.setattr("rock.sdk.model.server.api.proxy._RETRY_DELAY_SECONDS", 0.0)
 
-    async with lifespan(app, config):
-        app_config = app.state.model_service_config
-        # Verify that default rules (e.g., 'gpt-3.5-turbo') are loaded
-        assert "gpt-3.5-turbo" in app_config.proxy_rules
-        assert app_config.request_timeout == 120
+    attempts = []
 
+    def handler(request: httpx.Request) -> httpx.Response:
+        attempts.append(1)
+        return httpx.Response(429, json={"error": "still rate limited"})
 
-@pytest.mark.asyncio
-async def test_lifespan_invalid_config_path():
-    """
-    Test that providing a non-existent configuration file path causes
-    ModelServiceConfig.from_file to raise a FileNotFoundError.
-    """
-    # Expect FileNotFoundError when loading from non-existent file
-    with pytest.raises(FileNotFoundError):
-        ModelServiceConfig.from_file("/tmp/non_existent_file.yml")
+    app = _build_app(ModelServiceConfig())
+    with _patch_httpx_with_handler(handler):
+        transport = ASGITransport(app=app)
+        async with AsyncClient(transport=transport, base_url="http://test") as ac:
+            r = await ac.post(
+                "/v1/chat/completions",
+                json={"model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "hi"}]},
+            )
+
+    assert r.status_code == 429
+    assert r.json() == {"error": "still rate limited"}
+    assert len(attempts) == 3
 
 
 @pytest.mark.asyncio
-async def test_proxy_base_url_overrides_proxy_rules(tmp_path):
-    """
-    Test that when proxy_base_url is set, all requests are forwarded to that URL,
-    bypassing proxy_rules entirely.
-    """
-    config = ModelServiceConfig()
-    config.proxy_base_url = "https://custom-endpoint.example.com/v1"
+async def test_forward_does_not_retry_non_whitelisted_status(monkeypatch):
+    """400 is not in retryable_status_codes → forwarded immediately, no retry."""
+    monkeypatch.setattr("rock.sdk.model.server.api.proxy._RETRY_DELAY_SECONDS", 0.0)
 
-    test_app = FastAPI()
-    test_app.state.model_service_config = config
-    test_app.include_router(proxy_router)
+    attempts = []
 
-    with patch("rock.sdk.model.server.api.proxy.perform_llm_request", new_callable=AsyncMock) as mock_request:
-        mock_resp = MagicMock(spec=Response)
-        mock_resp.status_code = 200
-        mock_resp.json.return_value = {"id": "chat-123", "choices": []}
-        mock_request.return_value = mock_resp
+    def handler(request: httpx.Request) -> httpx.Response:
+        attempts.append(1)
+        return httpx.Response(400, json={"error": "bad request"})
 
-        transport = ASGITransport(app=test_app)
+    app = _build_app(ModelServiceConfig())
+    with _patch_httpx_with_handler(handler):
+        transport = ASGITransport(app=app)
         async with AsyncClient(transport=transport, base_url="http://test") as ac:
-            # Even when requesting gpt-3.5-turbo, should forward to proxy_base_url
-            payload = {"model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "hello"}]}
-            response = await ac.post("/v1/chat/completions", json=payload)
+            r = await ac.post(
+                "/v1/chat/completions",
+                json={"model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "hi"}]},
+            )
 
-        assert response.status_code == 200
-        # Verify request was sent to proxy_base_url
-        call_args = mock_request.call_args[0]
-        assert call_args[0] == "https://custom-endpoint.example.com/v1/chat/completions"
+    assert r.status_code == 400
+    assert len(attempts) == 1
 
 
 @pytest.mark.asyncio
-async def test_config_loads_host_and_port_from_file(tmp_path):
-    """
-    Test that ModelServiceConfig correctly loads host and port from config file.
-    """
-    conf_file = tmp_path / "proxy.yml"
-    conf_file.write_text(
-        yaml.dump({"host": "127.0.0.1", "port": 9000, "proxy_rules": {"my-model": "http://my-backend"}})
+async def test_forward_stream_retries_on_retryable_status_then_succeeds(monkeypatch):
+    """Streaming: 500 on first attempt, then 200 SSE on second — client sees only the 200 body."""
+    monkeypatch.setattr("rock.sdk.model.server.api.proxy._RETRY_DELAY_SECONDS", 0.0)
+
+    attempts = []
+    sse_body = (
+        b'data: {"id":"x","object":"chat.completion.chunk","choices":[{"index":0,'
+        b'"delta":{"content":"hello"},"finish_reason":null}]}\n\n'
+        b"data: [DONE]\n\n"
     )
 
-    config = ModelServiceConfig.from_file(str(conf_file))
+    def handler(request: httpx.Request) -> httpx.Response:
+        attempts.append(1)
+        if len(attempts) < 2:
+            return httpx.Response(500, json={"error": "internal"})
+        return httpx.Response(200, content=sse_body, headers={"content-type": "text/event-stream"})
 
-    assert config.host == "127.0.0.1"
-    assert config.port == 9000
-    assert config.proxy_rules["my-model"] == "http://my-backend"
+    app = _build_app(ModelServiceConfig())
+    with _patch_httpx_with_handler(handler):
+        transport = ASGITransport(app=app)
+        async with AsyncClient(transport=transport, base_url="http://test") as ac:
+            r = await ac.post(
+                "/v1/chat/completions",
+                json={"model": "gpt-3.5-turbo", "stream": True, "messages": [{"role": "user", "content": "hi"}]},
+            )
 
+    body = r.text
+    assert "hello" in body
+    assert "[DONE]" in body
+    assert "internal" not in body  # the 500 attempt is not leaked to client
+    assert len(attempts) == 2
+
+
+# ---------- Forward path: recording ----------
+
+
+@pytest.mark.asyncio
+async def test_forward_invokes_recorder_on_success(tmp_path):
+    """When a recorder is attached to the backend, success calls write a JSONL line."""
+    from rock.sdk.model.server.traj import TrajectoryRecorder
+
+    upstream_payload = _success_response_json(content="recorded reply")
+    traj_file = tmp_path / "traj.jsonl"
+
+    def handler(request: httpx.Request) -> httpx.Response:
+        return httpx.Response(200, json=upstream_payload)
 
-def test_config_default_host_and_port():
-    """
-    Test default values for host and port.
-    """
     config = ModelServiceConfig()
 
-    assert config.host == "0.0.0.0"
-    assert config.port == 8080
+    with _patch_httpx_with_handler(handler):
+        recorder = TrajectoryRecorder(traj_file=traj_file)
+        app = _build_app(config, recorder=recorder)
+        transport = ASGITransport(app=app)
+        async with AsyncClient(transport=transport, base_url="http://test") as ac:
+            await ac.post(
+                "/v1/chat/completions",
+                json={"model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "hi"}]},
+            )
+
+    line = traj_file.read_text(encoding="utf-8").strip()
+    record = json.loads(line)
+    assert record["status"] == "success"
+    assert record["model"] == "gpt-3.5-turbo"
+    assert record["stream"] is False
+    assert record["request"]["messages"][0]["content"] == "hi"
+    assert record["response"] == upstream_payload
+
+
+# ---------- Replay path ----------
 
 
 @pytest.mark.asyncio
-async def test_config_loads_retryable_status_codes_from_file(tmp_path):
-    """
-    Test that ModelServiceConfig correctly loads retryable_status_codes from config file.
-    """
-    conf_file = tmp_path / "proxy.yml"
-    conf_file.write_text(yaml.dump({"retryable_status_codes": [429, 500, 502, 503]}))
+async def test_replay_returns_recorded_response_no_upstream_call(tmp_path):
+    record = {
+        "model": "gpt-3.5-turbo",
+        "response": {
+            "id": "rec-1",
+            "object": "chat.completion",
+            "model": "gpt-3.5-turbo",
+            "choices": [
+                {
+                    "index": 0,
+                    "message": {"role": "assistant", "content": "recorded reply"},
+                    "finish_reason": "stop",
+                }
+            ],
+        },
+    }
+    traj = tmp_path / "t.jsonl"
+    traj.write_text(json.dumps(record) + "\n", encoding="utf-8")
 
-    config = ModelServiceConfig.from_file(str(conf_file))
+    config = ModelServiceConfig()
+    config.replay_file = str(traj)
+    app = _build_app(config, replay_cursor=SequentialCursor.load(traj))
+
+    transport = ASGITransport(app=app)
+    async with AsyncClient(transport=transport, base_url="http://test") as ac:
+        r = await ac.post(
+            "/v1/chat/completions",
+            json={"model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "hi"}]},
+        )
+
+    assert r.status_code == 200
+    assert r.json()["choices"][0]["message"]["content"] == "recorded reply"
 
-    assert config.retryable_status_codes == [429, 500, 502, 503]
 
+@pytest.mark.asyncio
+async def test_replay_streaming_emits_recorded_response_as_sse(tmp_path):
+    record = {
+        "model": "gpt-3.5-turbo",
+        "response": {
+            "id": "rec-stream",
+            "object": "chat.completion",
+            "model": "gpt-3.5-turbo",
+            "choices": [
+                {
+                    "index": 0,
+                    "message": {"role": "assistant", "content": "streamed reply"},
+                    "finish_reason": "tool_calls",
+                }
+            ],
+        },
+    }
+    traj = tmp_path / "t.jsonl"
+    traj.write_text(json.dumps(record) + "\n", encoding="utf-8")
 
-def test_config_default_retryable_status_codes():
-    """
-    Test default values for retryable_status_codes.
-    """
     config = ModelServiceConfig()
+    config.replay_file = str(traj)
+    app = _build_app(config, replay_cursor=SequentialCursor.load(traj))
+
+    transport = ASGITransport(app=app)
+    async with AsyncClient(transport=transport, base_url="http://test") as ac:
+        r = await ac.post(
+            "/v1/chat/completions",
+            json={"model": "gpt-3.5-turbo", "stream": True, "messages": [{"role": "user", "content": "hi"}]},
+        )
 
-    assert config.retryable_status_codes == [429, 500]
+    body = r.text
+    assert "data: [DONE]" in body
+    assert '"object": "chat.completion.chunk"' in body
+    assert '"delta": {"role": "assistant", "content": "streamed reply"}' in body
+    assert '"finish_reason": "tool_calls"' in body
 
 
 @pytest.mark.asyncio
-async def test_perform_llm_request_respects_custom_retryable_codes():
-    """
-    Test that custom retryable_status_codes are respected (502 retries, 401 does not).
-    """
+async def test_replay_returns_404_when_cursor_exhausted(tmp_path):
+    record = {
+        "model": "gpt-3.5-turbo",
+        "response": {
+            "id": "only",
+            "choices": [{"index": 0, "message": {"role": "assistant", "content": "x"}, "finish_reason": "stop"}],
+        },
+    }
+    traj = tmp_path / "t.jsonl"
+    traj.write_text(json.dumps(record) + "\n", encoding="utf-8")
+
     config = ModelServiceConfig()
-    config.retryable_status_codes = [502, 503, 504]  # Custom retryable status codes
+    config.replay_file = str(traj)
+    app = _build_app(config, replay_cursor=SequentialCursor.load(traj))
+
+    transport = ASGITransport(app=app)
+    async with AsyncClient(transport=transport, base_url="http://test") as ac:
+        await ac.post(
+            "/v1/chat/completions",
+            json={"model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "hi"}]},
+        )
+        second = await ac.post(
+            "/v1/chat/completions",
+            json={"model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "again"}]},
+        )
 
-    client_post_path = "rock.sdk.model.server.api.proxy.http_client.post"
+    assert second.status_code == 404
+    assert "exhausted" in second.json()["detail"]
 
-    with (
-        patch(client_post_path, new_callable=AsyncMock) as mock_post,
-        patch("rock.utils.retry.asyncio.sleep", return_value=None),
-    ):
-        # 502 should retry (in custom list)
-        resp_502 = MagicMock(spec=Response)
-        resp_502.status_code = 502
-        error_502 = HTTPStatusError("Bad Gateway", request=MagicMock(spec=Request), response=resp_502)
 
-        resp_200 = MagicMock(spec=Response)
-        resp_200.status_code = 200
-        resp_200.json.return_value = {"ok": True}
+# ---------- Lifespan + Config ----------
 
-        # Sequence: 502 fail, then 200 success
-        mock_post.side_effect = [error_502, resp_200]
 
-        result = await perform_llm_request("http://fake.url", {}, {}, config)
+@pytest.mark.asyncio
+async def test_lifespan_initialization_with_config(tmp_path):
+    conf_file = tmp_path / "proxy.yml"
+    conf_file.write_text(yaml.dump({"proxy_rules": {"my-model": "http://custom-url"}, "request_timeout": 50}))
 
-        assert result.status_code == 200
-        assert mock_post.call_count == 2
+    config = ModelServiceConfig.from_file(str(conf_file))
+    app = FastAPI(lifespan=lambda app: lifespan(app, config))
+
+    async with lifespan(app, config):
+        assert app.state.model_service_config.proxy_rules["my-model"] == "http://custom-url"
+        assert app.state.model_service_config.request_timeout == 50
 
 
 @pytest.mark.asyncio
-async def test_perform_llm_request_non_retryable_code_not_retried():
-    """
-    Test that 401 (not in custom retryable_status_codes) does not trigger retry.
-    """
+async def test_lifespan_invalid_config_path():
+    with pytest.raises(FileNotFoundError):
+        ModelServiceConfig.from_file("/tmp/non_existent_file.yml")
+
+
+def test_config_default_host_and_port():
     config = ModelServiceConfig()
-    config.retryable_status_codes = [502, 503, 504]  # Custom retryable status codes, excluding 401
+    assert config.host == "0.0.0.0"
+    assert config.port == 8080
 
-    client_post_path = "rock.sdk.model.server.api.proxy.http_client.post"
 
-    with patch(client_post_path, new_callable=AsyncMock) as mock_post:
-        # 401 should not retry (not in custom list)
-        resp_401 = MagicMock(spec=Response)
-        resp_401.status_code = 401
-        resp_401.json.return_value = {"error": "Invalid API Key"}
+def test_config_default_recording_and_replay():
+    config = ModelServiceConfig()
+    assert config.recording_file is None
+    assert config.replay_file is None
+
 
-        mock_post.return_value = resp_401
+@pytest.mark.asyncio
+async def test_config_loads_recording_file_from_yaml(tmp_path):
+    conf_file = tmp_path / "proxy.yml"
+    conf_file.write_text(yaml.dump({"recording_file": "/tmp/my-traj.jsonl"}))
+    config = ModelServiceConfig.from_file(str(conf_file))
+    assert config.recording_file == "/tmp/my-traj.jsonl"
+    assert config.replay_file is None
 
-        result = await perform_llm_request("http://fake.url", {}, {}, config)
 
-        assert result.status_code == 401
-        assert mock_post.call_count == 1  # No retry
+@pytest.mark.asyncio
+async def test_config_loads_replay_file_from_yaml(tmp_path):
+    conf_file = tmp_path / "proxy.yml"
+    conf_file.write_text(yaml.dump({"replay_file": "/tmp/in.jsonl"}))
+    config = ModelServiceConfig.from_file(str(conf_file))
+    assert config.replay_file == "/tmp/in.jsonl"
+    assert config.recording_file is None
 
 
-def test_cli_args_override_config_file(tmp_path):
-    """
-    Test that CLI arguments override config file settings.
-    This tests the logic in create_config_from_args().
-    """
-    import argparse
+def test_config_recording_and_replay_are_mutually_exclusive():
+    """Setting both at construction time fails Pydantic validation."""
+    with pytest.raises(ValueError, match="mutually exclusive"):
+        ModelServiceConfig(recording_file="/tmp/a.jsonl", replay_file="/tmp/b.jsonl")
 
-    # Create args with config file and CLI parameters
+
+def test_config_recording_replay_mutex_fires_on_assignment():
+    """validate_assignment=True so CLI-style field-by-field overrides also trip the mutex."""
+    config = ModelServiceConfig(recording_file="/tmp/a.jsonl")
+    with pytest.raises(ValueError, match="mutually exclusive"):
+        config.replay_file = "/tmp/b.jsonl"
+
+
+def test_cli_args_override_config_file(tmp_path):
     conf_file = tmp_path / "proxy.yml"
     conf_file.write_text(
         yaml.dump(
@@ -378,144 +595,67 @@ def test_cli_args_override_config_file(tmp_path):
                 "host": "192.168.1.1",
                 "port": 8080,
                 "proxy_base_url": "https://config-url.example.com/v1",
-                "retryable_status_codes": [429, 500],
                 "request_timeout": 60,
             }
         )
     )
-
     args = argparse.Namespace(
         config_file=str(conf_file),
-        host="0.0.0.0",  # CLI overrides config file
-        port=9000,  # CLI overrides config file
-        proxy_base_url="https://cli-url.example.com/v1",  # CLI overrides config file
-        retryable_status_codes="502,503",  # CLI overrides config file
-        request_timeout=30,  # CLI overrides config file
+        host="0.0.0.0",
+        port=9000,
+        proxy_base_url="https://cli-url.example.com/v1",
+        retryable_status_codes=None,
+        request_timeout=30,
+        recording_file=None,
+        replay_file=None,
     )
-
     config = create_config_from_args(args)
-
-    # Verify CLI arguments override config file
     assert config.host == "0.0.0.0"
     assert config.port == 9000
     assert config.proxy_base_url == "https://cli-url.example.com/v1"
-    assert config.retryable_status_codes == [502, 503]
     assert config.request_timeout == 30
 
 
-@pytest.mark.asyncio
-async def test_config_file_overrides_defaults(tmp_path):
-    """
-    Test that config file values override default values.
-    """
-    conf_file = tmp_path / "proxy.yml"
-    conf_file.write_text(
-        yaml.dump(
-            {
-                "host": "10.0.0.1",
-                "port": 8888,
-                "request_timeout": 300,
-                "proxy_rules": {"test-model": "http://test-backend"},
-            }
-        )
+def test_cli_replay_file_enables_replay():
+    args = argparse.Namespace(
+        config_file=None,
+        host=None,
+        port=None,
+        proxy_base_url=None,
+        retryable_status_codes=None,
+        request_timeout=None,
+        recording_file=None,
+        replay_file="/tmp/in.jsonl",
     )
+    config = create_config_from_args(args)
+    assert config.replay_file == "/tmp/in.jsonl"
 
-    config = ModelServiceConfig.from_file(str(conf_file))
 
-    # Verify config file overrides defaults
-    assert config.host == "10.0.0.1"
-    assert config.port == 8888
-    assert config.request_timeout == 300
-    assert config.proxy_rules["test-model"] == "http://test-backend"
-    # Verify other fields remain as defaults
-    assert config.proxy_base_url is None
+# ---------- Metrics singleton + legacy record_traj (still used by local mode) ----------
 
 
 def test_metrics_monitor_is_singleton():
-    """
-    Test that _get_or_create_metrics_monitor returns the same instance
-    on repeated calls (module-level singleton, created only once).
-    """
     import rock.sdk.model.server.utils as utils_module
 
     with patch("rock.sdk.model.server.utils.MetricsMonitor") as mock_cls:
-        mock_monitor = MagicMock()
-        mock_cls.create.return_value = mock_monitor
-
-        # Reset singleton so the test is isolated
+        mock_cls.create.return_value = MagicMock()
         utils_module._metrics_monitor = None
-
         first = _get_or_create_metrics_monitor()
         second = _get_or_create_metrics_monitor()
-
         assert first is second
-        assert mock_cls.create.call_count == 1
-
-        # Cleanup
-        utils_module._metrics_monitor = None
-
-
-def test_metrics_monitor_uses_env_endpoint():
-    """
-    Test that ROCK_METRICS_ENDPOINT env var is passed to MetricsMonitor.create().
-    """
-    import rock.sdk.model.server.utils as utils_module
-
-    custom_endpoint = "http://my-otel-collector:4318/v1/metrics"
-
-    with (
-        patch("rock.sdk.model.server.utils.MetricsMonitor") as mock_cls,
-        patch.dict("os.environ", {"ROCK_METRICS_ENDPOINT": custom_endpoint}),
-    ):
-        mock_monitor = MagicMock()
-        mock_cls.create.return_value = mock_monitor
-
-        utils_module._metrics_monitor = None
-        _get_or_create_metrics_monitor()
-
-        mock_cls.create.assert_called_once_with(metrics_endpoint=custom_endpoint)
-
-        utils_module._metrics_monitor = None
-
-
-def test_metrics_monitor_registers_gauge_and_counter():
-    """
-    Test that _get_or_create_metrics_monitor registers both
-    the RT gauge and request count counter on first creation.
-    """
-    import rock.sdk.model.server.utils as utils_module
-
-    with patch("rock.sdk.model.server.utils.MetricsMonitor") as mock_cls:
-        mock_monitor = MagicMock()
-        mock_cls.create.return_value = mock_monitor
-
-        utils_module._metrics_monitor = None
-        _get_or_create_metrics_monitor()
-
-        mock_monitor._register_gauge.assert_called_once_with(
-            MODEL_SERVICE_REQUEST_RT, "total execution time for request", "ms"
-        )
-        mock_monitor._register_counter.assert_called_once_with(
-            MODEL_SERVICE_REQUEST_COUNT, "total request count", "count"
-        )
-
         utils_module._metrics_monitor = None
 
 
 @pytest.mark.asyncio
-async def test_record_traj_reports_rt_and_count():
-    """
-    Test that record_traj decorator calls record_gauge_by_name (RT)
-    and record_counter_by_name (count) with correct metric names and attributes.
-    """
+async def test_record_traj_decorator_reports_rt_and_count():
+    """Legacy record_traj decorator (still used by local mode) reports RT/count."""
     import rock.sdk.model.server.utils as utils_module
 
-    mock_monitor = MagicMock()
-
     with (
         patch("rock.sdk.model.server.utils.MetricsMonitor") as mock_cls,
-        patch.dict("os.environ", {"ROCK_SANDBOX_ID": "sandbox-test-001"}),
+        patch.dict("os.environ", {"ROCK_SANDBOX_ID": "sandbox-test"}),
     ):
+        mock_monitor = MagicMock()
         mock_cls.create.return_value = mock_monitor
         utils_module._metrics_monitor = None
 
@@ -525,45 +665,11 @@ async def fake_handler(body: dict):
 
         await fake_handler({"model": "gpt-4", "messages": []})
 
-        mock_monitor.record_gauge_by_name.assert_called_once()
         gauge_call = mock_monitor.record_gauge_by_name.call_args
         assert gauge_call[0][0] == MODEL_SERVICE_REQUEST_RT
-        assert gauge_call[1]["attributes"]["type"] == "chat_completions"
-        assert gauge_call[1]["attributes"]["sandbox_id"] == "sandbox-test-001"
+        assert gauge_call[1]["attributes"]["sandbox_id"] == "sandbox-test"
 
-        mock_monitor.record_counter_by_name.assert_called_once()
         counter_call = mock_monitor.record_counter_by_name.call_args
         assert counter_call[0][0] == MODEL_SERVICE_REQUEST_COUNT
-        assert counter_call[0][1] == 1
-        assert counter_call[1]["attributes"]["sandbox_id"] == "sandbox-test-001"
-
-        utils_module._metrics_monitor = None
-
-
-@pytest.mark.asyncio
-async def test_record_traj_sandbox_id_defaults_to_unknown():
-    """
-    Test that sandbox_id defaults to 'unknown' when ROCK_SANDBOX_ID is not set.
-    """
-    import rock.sdk.model.server.utils as utils_module
-
-    mock_monitor = MagicMock()
-
-    with patch("rock.sdk.model.server.utils.MetricsMonitor") as mock_cls, patch.dict("os.environ", {}, clear=False):
-        # Ensure ROCK_SANDBOX_ID is not set
-        os_env = __import__("os").environ
-        os_env.pop("ROCK_SANDBOX_ID", None)
-
-        mock_cls.create.return_value = mock_monitor
-        utils_module._metrics_monitor = None
-
-        @record_traj
-        async def fake_handler(body: dict):
-            return {"id": "resp-2", "choices": []}
-
-        await fake_handler({"model": "gpt-4", "messages": []})
-
-        gauge_call = mock_monitor.record_gauge_by_name.call_args
-        assert gauge_call[1]["attributes"]["sandbox_id"] == "unknown"
 
         utils_module._metrics_monitor = None
diff --git a/tests/unit/sdk/model/test_proxy_record_replay_e2e.py b/tests/unit/sdk/model/test_proxy_record_replay_e2e.py
new file mode 100644
index 0000000000..0b70ed0cf8
--- /dev/null
+++ b/tests/unit/sdk/model/test_proxy_record_replay_e2e.py
@@ -0,0 +1,332 @@
+"""End-to-end: real in-process TCP mock upstream + real proxy router + recorder.
+
+The mock upstream is a tiny FastAPI app served by uvicorn in a background thread
+(real TCP). The proxy stays in-process and is hit via FastAPI's ``TestClient``;
+its outbound ``httpx.AsyncClient`` makes a real TCP call to the mock — production
+code path, no transport injection, no patching.
+"""
+
+from __future__ import annotations
+
+import asyncio
+import json
+import threading
+import time
+from collections.abc import Iterator
+from pathlib import Path
+
+import pytest
+import uvicorn
+from fastapi import FastAPI, Request
+from fastapi.responses import JSONResponse, StreamingResponse
+from fastapi.testclient import TestClient
+
+from rock.sdk.model.server.api.proxy import ForwardBackend, ReplayBackend, proxy_router
+from rock.sdk.model.server.config import ModelServiceConfig
+from rock.sdk.model.server.sse import parse_sse_data_chunks
+from rock.sdk.model.server.traj import SequentialCursor, TrajectoryRecorder
+from rock.utils.system import find_free_port
+
+# ---------------------------------------------------------------------------
+# Mock upstream — a tiny FastAPI app behind a real TCP uvicorn in a thread.
+# Owns both the canned reply AND the assertion helper, so the response shape
+# and the expectations stay in sync if either is edited.
+# ---------------------------------------------------------------------------
+
+
+class MockUpstream:
+    """Mock OpenAI-compatible upstream.
+
+    Single canonical reply (returned for both stream and non-stream requests)
+    contains three fields the proxy must preserve end-to-end:
+      - ``content``            (plain text)
+      - ``reasoning_content``  (vendor-specific thinking)
+      - ``tool_calls``         (a function call)
+    The streaming variant splits each field into multiple deltas so the
+    recorder also exercises the openai SDK's stream-state aggregator.
+
+    Use as ``with MockUpstream() as mock: ...``; ``mock.base_url`` points at
+    the running server. ``mock.assert_message(msg)`` checks any received
+    assistant message matches the canonical reply.
+    """
+
+    # Canonical reply values — change here, both the handler and the assertion
+    # helper pick them up automatically. Two parallel tool_calls cover the
+    # multi-tool-call case (modern LLMs commonly emit several at once).
+    EXPECTED_CONTENT = "Checking weather and time for you."
+    EXPECTED_REASONING = "User wants weather + time; calling both tools in parallel."
+    EXPECTED_TOOL_CALLS = [
+        {
+            "id": "call_weather",
+            "type": "function",
+            "function": {"name": "get_weather", "arguments": '{"city":"Tokyo","unit":"celsius"}'},
+        },
+        {
+            "id": "call_time",
+            "type": "function",
+            "function": {"name": "get_time", "arguments": '{"city":"Tokyo"}'},
+        },
+    ]
+
+    def __init__(self) -> None:
+        port = asyncio.run(find_free_port())
+        config = uvicorn.Config(self._build_app(), host="127.0.0.1", port=port, log_level="warning", access_log=False)
+        self._server = uvicorn.Server(config)
+        self._thread = threading.Thread(target=self._server.run, daemon=True)
+        self.base_url = f"http://127.0.0.1:{port}/v1"
+
+    # ---- lifecycle ----
+
+    def __enter__(self) -> MockUpstream:
+        self._thread.start()
+        deadline = time.time() + 5.0
+        while not self._server.started:
+            if time.time() > deadline:
+                raise RuntimeError("mock upstream did not start within 5s")
+            time.sleep(0.02)
+        return self
+
+    def __exit__(self, *_exc) -> None:
+        self._server.should_exit = True
+        self._thread.join(timeout=5)
+
+    # ---- assertion helper ----
+
+    def assert_message(self, msg: dict) -> None:
+        """Assert ``msg`` is the canonical full message (content + reasoning + 2 parallel tool_calls)."""
+        assert msg["content"] == self.EXPECTED_CONTENT
+        assert msg["reasoning_content"] == self.EXPECTED_REASONING
+        tcs = msg["tool_calls"]
+        assert len(tcs) == len(self.EXPECTED_TOOL_CALLS)
+        for actual, expected in zip(tcs, self.EXPECTED_TOOL_CALLS, strict=True):
+            assert actual["id"] == expected["id"]
+            assert actual["type"] == expected["type"]
+            assert actual["function"]["name"] == expected["function"]["name"]
+            assert json.loads(actual["function"]["arguments"]) == json.loads(expected["function"]["arguments"])
+
+    # ---- internal: FastAPI app + handlers ----
+
+    def _build_app(self) -> FastAPI:
+        app = FastAPI()
+
+        @app.post("/v1/chat/completions")
+        async def chat_completions(request: Request):
+            body = await request.json()
+            model = body.get("model", "mock")
+            if body.get("stream"):
+                return StreamingResponse(self._stream_gen(model), media_type="text/event-stream")
+            return JSONResponse(status_code=200, content=self._completion_json(model))
+
+        return app
+
+    def _completion_json(self, model: str) -> dict:
+        return {
+            "id": "chatcmpl-mock-1",
+            "object": "chat.completion",
+            "created": 0,
+            "model": model,
+            "choices": [
+                {
+                    "index": 0,
+                    "message": {
+                        "role": "assistant",
+                        "content": self.EXPECTED_CONTENT,
+                        "reasoning_content": self.EXPECTED_REASONING,
+                        "tool_calls": self.EXPECTED_TOOL_CALLS,
+                    },
+                    "finish_reason": "tool_calls",
+                }
+            ],
+            "usage": {"prompt_tokens": 12, "completion_tokens": 24, "total_tokens": 36},
+        }
+
+    async def _stream_gen(self, model: str):
+        base = {"id": "chatcmpl-mock-1", "object": "chat.completion.chunk", "created": 0, "model": model}
+
+        def emit(delta: dict, finish_reason=None) -> bytes:
+            payload = {**base, "choices": [{"index": 0, "delta": delta, "finish_reason": finish_reason}]}
+            return f"data: {json.dumps(payload, ensure_ascii=False)}\n\n".encode()
+
+        # 1-2. Reasoning split in two deltas
+        yield emit({"role": "assistant", "reasoning_content": "User wants weather + time; "})
+        await asyncio.sleep(0.005)
+        yield emit({"reasoning_content": "calling both tools in parallel."})
+        await asyncio.sleep(0.005)
+        # 3-4. Content split in two deltas
+        yield emit({"content": "Checking weather"})
+        await asyncio.sleep(0.005)
+        yield emit({"content": " and time for you."})
+        await asyncio.sleep(0.005)
+
+        # 5-7. tool_call[0] (get_weather): announce, then arguments in two pieces
+        yield emit(
+            {
+                "tool_calls": [
+                    {
+                        "index": 0,
+                        "id": "call_weather",
+                        "type": "function",
+                        "function": {"name": "get_weather", "arguments": ""},
+                    }
+                ]
+            }
+        )
+        await asyncio.sleep(0.005)
+        yield emit({"tool_calls": [{"index": 0, "function": {"arguments": '{"city":"Tokyo",'}}]})
+        await asyncio.sleep(0.005)
+        yield emit({"tool_calls": [{"index": 0, "function": {"arguments": '"unit":"celsius"}'}}]})
+        await asyncio.sleep(0.005)
+
+        # 8-9. tool_call[1] (get_time): announce + arguments in one piece
+        yield emit(
+            {
+                "tool_calls": [
+                    {
+                        "index": 1,
+                        "id": "call_time",
+                        "type": "function",
+                        "function": {"name": "get_time", "arguments": ""},
+                    }
+                ]
+            }
+        )
+        await asyncio.sleep(0.005)
+        yield emit({"tool_calls": [{"index": 1, "function": {"arguments": '{"city":"Tokyo"}'}}]})
+        await asyncio.sleep(0.005)
+
+        # 10. Finish
+        yield emit({}, finish_reason="tool_calls")
+        yield b"data: [DONE]\n\n"
+
+
+@pytest.fixture
+def mock_upstream() -> Iterator[MockUpstream]:
+    with MockUpstream() as m:
+        yield m
+
+
+# ---------------------------------------------------------------------------
+# Proxy app builder + request helper (module-level, generic)
+# ---------------------------------------------------------------------------
+
+
+def _build_proxy_app(*, mock_url: str | None = None, traj_file: Path | None = None, replay_cursor=None) -> FastAPI:
+    config = ModelServiceConfig()
+    # ReplayBackend never calls upstream, so mock_url is only relevant for forward mode.
+    if mock_url is not None:
+        config.proxy_base_url = mock_url
+
+    app = FastAPI()
+    app.state.model_service_config = config
+    if replay_cursor is not None:
+        app.state.backend = ReplayBackend(replay_cursor)
+    else:
+        recorder = TrajectoryRecorder(traj_file=traj_file) if traj_file is not None else None
+        app.state.backend = ForwardBackend(config, recorder=recorder)
+    app.include_router(proxy_router)
+    return app
+
+
+def _call_chat_completions(client: TestClient, *, stream: bool) -> dict:
+    """One chat.completions call. Returns the assistant message dict.
+
+    - non-stream: just unwraps ``choices[0].message``.
+    - stream: replay always emits exactly one chunk + ``[DONE]`` (see
+      ``completion_to_chunk_dict``), so the chunk's ``delta`` IS the full
+      message — no aggregation needed.
+    """
+    payload = {"model": "mock-model", "messages": [{"role": "user", "content": "hi"}]}
+    if not stream:
+        r = client.post("/v1/chat/completions", json=payload)
+        assert r.status_code == 200
+        return r.json()["choices"][0]["message"]
+
+    with client.stream("POST", "/v1/chat/completions", json={**payload, "stream": True}) as r:
+        assert r.status_code == 200
+        body_bytes = b"".join(r.iter_bytes())
+    chunks, _ = parse_sse_data_chunks(body_bytes)
+    return chunks[0]["choices"][0]["delta"]
+
+
+# ---------------------------------------------------------------------------
+# Tests
+# ---------------------------------------------------------------------------
+
+
+class TestProxyRecordReplay:
+    """End-to-end: real TCP mock upstream <-> real proxy router + recorder/replayer."""
+
+    def test_forward_non_stream(self, mock_upstream: MockUpstream, tmp_path):
+        """Vendor field reaches the client; recorder writes a JSONL line with the full response."""
+        traj_file = tmp_path / "traj.jsonl"
+        proxy_app = _build_proxy_app(mock_url=mock_upstream.base_url, traj_file=traj_file)
+
+        with TestClient(proxy_app) as client:
+            r = client.post(
+                "/v1/chat/completions",
+                json={"model": "mock-model", "messages": [{"role": "user", "content": "hi"}]},
+                headers={"Authorization": "Bearer test-key"},
+            )
+
+        assert r.status_code == 200
+        body = r.json()
+        assert body["choices"][0]["finish_reason"] == "tool_calls"
+        mock_upstream.assert_message(body["choices"][0]["message"])
+
+        rec = json.loads(traj_file.read_text(encoding="utf-8").strip())
+        assert rec["status"] == "success"
+        assert rec["stream"] is False
+        assert rec["response"]["choices"][0]["finish_reason"] == "tool_calls"
+        mock_upstream.assert_message(rec["response"]["choices"][0]["message"])
+
+    def test_forward_stream(self, mock_upstream: MockUpstream, tmp_path):
+        """Each upstream SSE chunk reaches the client; recorder gets the aggregated final completion
+        with reasoning_content concatenated and tool_calls.arguments assembled from deltas."""
+        traj_file = tmp_path / "traj.jsonl"
+        proxy_app = _build_proxy_app(mock_url=mock_upstream.base_url, traj_file=traj_file)
+
+        with TestClient(proxy_app) as client:
+            with client.stream(
+                "POST",
+                "/v1/chat/completions",
+                json={"model": "mock-model", "stream": True, "messages": [{"role": "user", "content": "hi"}]},
+                headers={"Authorization": "Bearer test-key"},
+            ) as r:
+                body = b"".join(r.iter_bytes()).decode("utf-8")
+
+        # Raw chunks make it to the client untouched
+        assert '"reasoning_content": "User wants weather + time; "' in body
+        assert '"reasoning_content": "calling both tools in parallel."' in body
+        assert '"content": "Checking weather"' in body
+        assert '"content": " and time for you."' in body
+        assert '"name": "get_weather"' in body
+        assert '"name": "get_time"' in body
+        assert '"finish_reason": "tool_calls"' in body
+        assert body.rstrip().endswith("data: [DONE]")
+
+        # Recorder's aggregated message matches the canonical reply
+        rec = json.loads(traj_file.read_text(encoding="utf-8").strip())
+        assert rec["status"] == "success"
+        assert rec["stream"] is True
+        assert rec["response"]["choices"][0]["finish_reason"] == "tool_calls"
+        mock_upstream.assert_message(rec["response"]["choices"][0]["message"])
+
+    @pytest.mark.parametrize("replay_stream", [False, True], ids=["replay_nonstream", "replay_stream"])
+    @pytest.mark.parametrize("record_stream", [False, True], ids=["record_nonstream", "record_stream"])
+    def test_replay(self, mock_upstream: MockUpstream, tmp_path, record_stream: bool, replay_stream: bool):
+        """Recorded mode and replayed mode are orthogonal — all 4 combinations of
+        (stream/non-stream) on each side must yield the same full message."""
+        traj_file = tmp_path / "traj.jsonl"
+
+        # ---- record phase ----
+        proxy_record = _build_proxy_app(mock_url=mock_upstream.base_url, traj_file=traj_file)
+        with TestClient(proxy_record) as client:
+            _call_chat_completions(client, stream=record_stream)
+
+        # ---- replay phase: no upstream URL needed — ReplayBackend never calls upstream ----
+        cursor = SequentialCursor.load(traj_file)
+        proxy_replay = _build_proxy_app(replay_cursor=cursor)
+        with TestClient(proxy_replay) as client:
+            msg = _call_chat_completions(client, stream=replay_stream)
+
+        mock_upstream.assert_message(msg)
diff --git a/tests/unit/sdk/model/test_service_subprocess.py b/tests/unit/sdk/model/test_service_subprocess.py
new file mode 100644
index 0000000000..61176173bd
--- /dev/null
+++ b/tests/unit/sdk/model/test_service_subprocess.py
@@ -0,0 +1,38 @@
+"""Tests for ModelService.start_sandbox_service subprocess command construction.
+
+Covers the CLI flag wiring without actually spawning a subprocess: mock Popen
+and inspect the argv it would have been called with.
+"""
+
+from unittest.mock import patch
+
+from rock.sdk.model.service import ModelService
+
+
+def _captured_argv(**start_kwargs) -> list[str]:
+    with patch("rock.sdk.model.service.subprocess.Popen") as mock_popen:
+        ModelService().start_sandbox_service(**start_kwargs)
+    return mock_popen.call_args[0][0]
+
+
+def test_start_sandbox_service_omits_recording_and_replay_flags_by_default():
+    argv = _captured_argv(model_service_type="proxy", proxy_base_url="https://api.openai.com/v1", port=8080)
+    assert argv[1:5] == ["-m", "main", "--type", "proxy"]
+    assert "--proxy-base-url" in argv and "https://api.openai.com/v1" in argv
+    assert "--port" in argv and "8080" in argv
+    assert "--recording-file" not in argv
+    assert "--replay-file" not in argv
+
+
+def test_start_sandbox_service_passes_recording_file():
+    argv = _captured_argv(model_service_type="proxy", recording_file="/tmp/my-traj.jsonl")
+    idx = argv.index("--recording-file")
+    assert argv[idx + 1] == "/tmp/my-traj.jsonl"
+    assert "--replay-file" not in argv
+
+
+def test_start_sandbox_service_passes_replay_file():
+    argv = _captured_argv(model_service_type="proxy", replay_file="/tmp/in.jsonl")
+    idx = argv.index("--replay-file")
+    assert argv[idx + 1] == "/tmp/in.jsonl"
+    assert "--recording-file" not in argv
diff --git a/tests/unit/sdk/model/test_sse.py b/tests/unit/sdk/model/test_sse.py
new file mode 100644
index 0000000000..251016a0a8
--- /dev/null
+++ b/tests/unit/sdk/model/test_sse.py
@@ -0,0 +1,223 @@
+"""Tests for the pure SSE codec utilities (no openai/litellm dependencies)."""
+
+import json
+
+from rock.sdk.model.server.sse import (
+    SSE_DONE,
+    completion_to_chunk_dict,
+    encode_sse_event,
+    parse_sse_data_chunks,
+)
+
+# ---------- parse_sse_data_chunks ----------
+
+
+def test_parse_returns_complete_events_and_leftover_buffer():
+    raw = b'data: {"a": 1}\n\ndata: {"a": 2}\n\ndata: {"a": 3}'  # 3rd event is incomplete
+    chunks, leftover = parse_sse_data_chunks(raw)
+
+    assert chunks == [{"a": 1}, {"a": 2}]
+    assert leftover == b'data: {"a": 3}'
+
+
+def test_parse_skips_done_marker():
+    raw = b'data: {"x": 1}\n\ndata: [DONE]\n\n'
+    chunks, leftover = parse_sse_data_chunks(raw)
+
+    assert chunks == [{"x": 1}]
+    assert leftover == b""
+
+
+def test_parse_skips_non_data_lines():
+    raw = b'event: progress\ndata: {"y": 2}\nid: abc\n\n'
+    chunks, leftover = parse_sse_data_chunks(raw)
+
+    assert chunks == [{"y": 2}]
+    assert leftover == b""
+
+
+def test_parse_silently_skips_malformed_json():
+    raw = b'data: not-json-at-all\n\ndata: {"ok": true}\n\n'
+    chunks, leftover = parse_sse_data_chunks(raw)
+
+    assert chunks == [{"ok": True}]
+    assert leftover == b""
+
+
+def test_parse_handles_empty_buffer():
+    chunks, leftover = parse_sse_data_chunks(b"")
+    assert chunks == []
+    assert leftover == b""
+
+
+def test_parse_incremental_streaming_pattern():
+    """Simulates feeding bytes in arbitrary chunks; final concatenation == all events."""
+    full_stream = b'data: {"i": 0}\n\ndata: {"i": 1}\n\ndata: {"i": 2}\n\ndata: [DONE]\n\n'
+    fragments = [full_stream[i : i + 5] for i in range(0, len(full_stream), 5)]
+
+    buffer = b""
+    collected: list[dict] = []
+    for frag in fragments:
+        new_chunks, buffer = parse_sse_data_chunks(buffer + frag)
+        collected.extend(new_chunks)
+
+    assert collected == [{"i": 0}, {"i": 1}, {"i": 2}]
+    assert buffer == b""
+
+
+def test_parse_handles_unicode_payload():
+    raw = b'data: {"content": "\xe4\xbd\xa0\xe5\xa5\xbd"}\n\n'  # "你好" UTF-8
+    chunks, _ = parse_sse_data_chunks(raw)
+    assert chunks == [{"content": "你好"}]
+
+
+# ---------- completion_to_chunk_dict ----------
+
+
+def test_completion_to_chunk_renames_message_to_delta():
+    response = {
+        "id": "rec-1",
+        "object": "chat.completion",
+        "created": 100,
+        "model": "gpt-4",
+        "choices": [
+            {
+                "index": 0,
+                "message": {"role": "assistant", "content": "hi"},
+                "finish_reason": "stop",
+            }
+        ],
+    }
+    chunk = completion_to_chunk_dict(response, model="gpt-4")
+
+    assert chunk["object"] == "chat.completion.chunk"
+    assert chunk["id"] == "rec-1"
+    assert chunk["created"] == 100
+    assert chunk["model"] == "gpt-4"
+    assert chunk["choices"][0]["delta"] == {"role": "assistant", "content": "hi"}
+    assert chunk["choices"][0]["finish_reason"] == "stop"
+    assert chunk["choices"][0]["index"] == 0
+    assert "message" not in chunk["choices"][0]
+
+
+def test_completion_to_chunk_preserves_provider_specific_message_fields():
+    """reasoning_content kept verbatim; tool_calls get a positional index injected
+    (required by the OpenAI streaming spec — see test below)."""
+    response = {
+        "choices": [
+            {
+                "index": 0,
+                "message": {
+                    "role": "assistant",
+                    "content": "answer",
+                    "reasoning_content": "step-by-step thinking",
+                    "tool_calls": [{"id": "t1", "type": "function"}],
+                },
+                "finish_reason": "tool_calls",
+            }
+        ],
+    }
+    chunk = completion_to_chunk_dict(response, model="glm-5")
+
+    assert chunk["choices"][0]["delta"]["reasoning_content"] == "step-by-step thinking"
+    assert chunk["choices"][0]["delta"]["tool_calls"] == [{"index": 0, "id": "t1", "type": "function"}]
+    assert chunk["choices"][0]["finish_reason"] == "tool_calls"
+
+
+def test_completion_to_chunk_injects_tool_call_index_for_openai_sdk_compat():
+    """A recorded non-stream message has tool_calls without 'index'; the OpenAI
+    streaming spec requires it on chunk deltas, and the openai SDK's
+    ChatCompletionChunk.model_validate() rejects the chunk otherwise. We inject
+    a positional index so replay-stream output is parseable by strict clients."""
+    response = {
+        "choices": [
+            {
+                "index": 0,
+                "message": {
+                    "role": "assistant",
+                    "tool_calls": [
+                        {"id": "a", "type": "function", "function": {"name": "f1", "arguments": "{}"}},
+                        {"id": "b", "type": "function", "function": {"name": "f2", "arguments": "{}"}},
+                    ],
+                },
+                "finish_reason": "tool_calls",
+            }
+        ],
+    }
+    chunk = completion_to_chunk_dict(response, model="m")
+    tcs = chunk["choices"][0]["delta"]["tool_calls"]
+    assert [tc["index"] for tc in tcs] == [0, 1]
+
+    # End-to-end: openai SDK accepts the chunk
+    from openai.types.chat import ChatCompletionChunk
+
+    ChatCompletionChunk.model_validate(chunk)  # must not raise
+
+
+def test_completion_to_chunk_preserves_explicit_tool_call_index():
+    """If the recorded tool_calls already have 'index', we don't overwrite it."""
+    response = {
+        "choices": [
+            {
+                "index": 0,
+                "message": {
+                    "role": "assistant",
+                    "tool_calls": [
+                        {"index": 5, "id": "a", "type": "function", "function": {"name": "f", "arguments": "{}"}},
+                    ],
+                },
+                "finish_reason": "tool_calls",
+            }
+        ],
+    }
+    chunk = completion_to_chunk_dict(response, model="m")
+    assert chunk["choices"][0]["delta"]["tool_calls"][0]["index"] == 5
+
+
+def test_completion_to_chunk_synthesizes_id_and_created_when_missing():
+    chunk = completion_to_chunk_dict(
+        {"choices": [{"index": 0, "message": {"role": "assistant"}, "finish_reason": "stop"}]},
+        model="any",
+    )
+    assert chunk["id"].startswith("chatcmpl-")
+    assert isinstance(chunk["created"], int) and chunk["created"] > 0
+    assert chunk["model"] == "any"
+
+
+def test_completion_to_chunk_handles_empty_choices():
+    chunk = completion_to_chunk_dict({"choices": []}, model="m")
+    assert chunk["choices"] == []
+
+
+# ---------- encode_sse_event ----------
+
+
+def test_encode_sse_event_appends_double_newline_terminator():
+    out = encode_sse_event({"k": "v"})
+    assert out.endswith(b"\n\n")
+    assert out.startswith(b"data: ")
+    body = out[len(b"data: ") : -len(b"\n\n")]
+    assert json.loads(body) == {"k": "v"}
+
+
+def test_encode_sse_event_preserves_unicode_without_escapes():
+    out = encode_sse_event({"content": "你好"})
+    # ensure_ascii=False is critical so Chinese stays readable in the wire format
+    assert "你好".encode() in out
+
+
+def test_sse_done_constant():
+    assert SSE_DONE == b"data: [DONE]\n\n"
+
+
+# ---------- round-trip ----------
+
+
+def test_roundtrip_encode_then_parse():
+    """encode → parse must round-trip a payload dict."""
+    payloads = [{"i": 0, "text": "alpha"}, {"i": 1, "text": "beta 中文"}]
+    wire = b"".join(encode_sse_event(p) for p in payloads) + SSE_DONE
+    chunks, leftover = parse_sse_data_chunks(wire)
+
+    assert chunks == payloads
+    assert leftover == b""
diff --git a/tests/unit/sdk/model/test_traj_recorder.py b/tests/unit/sdk/model/test_traj_recorder.py
new file mode 100644
index 0000000000..3f06481639
--- /dev/null
+++ b/tests/unit/sdk/model/test_traj_recorder.py
@@ -0,0 +1,141 @@
+"""Tests for TrajectoryRecorder (explicit-call API, no longer a litellm CustomLogger)."""
+
+import json
+from unittest.mock import MagicMock, patch
+
+import pytest
+
+from rock.sdk.model.server.traj import TrajectoryRecorder
+
+
+@pytest.fixture
+def mock_monitor():
+    monitor = MagicMock()
+    with patch(
+        "rock.sdk.model.server.traj._get_or_create_metrics_monitor",
+        return_value=monitor,
+    ):
+        yield monitor
+
+
+def _make_recorder(traj_file) -> TrajectoryRecorder:
+    return TrajectoryRecorder(traj_file=traj_file)
+
+
+@pytest.mark.asyncio
+async def test_recorder_appends_each_call_as_jsonl_line(tmp_path, mock_monitor):
+    traj_file = tmp_path / "traj.jsonl"
+    recorder = _make_recorder(traj_file)
+
+    await recorder.record(
+        request={"model": "gpt-4", "messages": [{"role": "user", "content": "hi"}]},
+        response={"id": "a", "choices": []},
+        status="success",
+        start_time=100.0,
+        end_time=100.5,
+    )
+    await recorder.record(
+        request={"model": "gpt-4", "messages": [{"role": "user", "content": "again"}]},
+        response={"id": "b", "choices": []},
+        status="success",
+        start_time=101.0,
+        end_time=101.2,
+    )
+
+    lines = traj_file.read_text(encoding="utf-8").strip().split("\n")
+    assert len(lines) == 2
+    assert json.loads(lines[0])["response"]["id"] == "a"
+    assert json.loads(lines[1])["response"]["id"] == "b"
+
+
+@pytest.mark.asyncio
+async def test_recorder_writes_request_and_response_verbatim(tmp_path, mock_monitor):
+    """Provider-specific fields (reasoning_content, citations, ...) survive untouched."""
+    traj_file = tmp_path / "traj.jsonl"
+    recorder = _make_recorder(traj_file)
+
+    request = {"model": "glm-5", "stream": True, "messages": [{"role": "user", "content": "你是谁"}]}
+    response = {
+        "id": "x",
+        "choices": [
+            {
+                "index": 0,
+                "message": {"role": "assistant", "content": "我是 GLM", "reasoning_content": "用户问..."},
+                "finish_reason": "stop",
+            }
+        ],
+    }
+    await recorder.record(request=request, response=response, status="success", start_time=0.0, end_time=1.0)
+
+    record = json.loads(traj_file.read_text(encoding="utf-8").strip())
+    assert record["model"] == "glm-5"
+    assert record["stream"] is True
+    assert record["request"] == request
+    assert record["response"] == response
+    assert record["response_time"] == 1.0
+
+
+@pytest.mark.asyncio
+async def test_recorder_emits_metrics_with_status_and_sandbox_id(tmp_path, mock_monitor):
+    traj_file = tmp_path / "traj.jsonl"
+    recorder = _make_recorder(traj_file)
+
+    with patch.dict("os.environ", {"ROCK_SANDBOX_ID": "sandbox-xyz"}):
+        await recorder.record(
+            request={"model": "gpt-4"},
+            response={"id": "x", "choices": []},
+            status="success",
+            start_time=0.0,
+            end_time=0.5,
+        )
+
+    gauge_call = mock_monitor.record_gauge_by_name.call_args
+    assert gauge_call[0][0] == "model_service.request.rt"
+    assert gauge_call[0][1] == 500.0  # 0.5s -> 500 ms
+    assert gauge_call[1]["attributes"]["status"] == "success"
+    assert gauge_call[1]["attributes"]["sandbox_id"] == "sandbox-xyz"
+    assert gauge_call[1]["attributes"]["type"] == "chat_completions"
+
+    mock_monitor.record_counter_by_name.assert_called_once_with(
+        "model_service.request.count", 1, attributes=gauge_call[1]["attributes"]
+    )
+
+
+@pytest.mark.asyncio
+async def test_recorder_records_failure_with_error_text(tmp_path, mock_monitor):
+    traj_file = tmp_path / "traj.jsonl"
+    recorder = _make_recorder(traj_file)
+
+    await recorder.record(
+        request={"model": "gpt-4"},
+        response=None,
+        status="failure",
+        start_time=0.0,
+        end_time=1.0,
+        error="upstream_status=429",
+    )
+
+    record = json.loads(traj_file.read_text(encoding="utf-8").strip())
+    assert record["status"] == "failure"
+    assert record["error"] == "upstream_status=429"
+    assert record["response"] is None
+
+    gauge_call = mock_monitor.record_gauge_by_name.call_args
+    assert gauge_call[1]["attributes"]["status"] == "failure"
+
+
+@pytest.mark.asyncio
+async def test_recorder_creates_parent_directory(tmp_path, mock_monitor):
+    traj_file = tmp_path / "deep" / "nested" / "traj.jsonl"
+    recorder = _make_recorder(traj_file)
+
+    await recorder.record(
+        request={"model": "gpt-4"},
+        response={"id": "x", "choices": []},
+        status="success",
+        start_time=0.0,
+        end_time=0.5,
+    )
+
+    assert traj_file.exists()
+    assert traj_file.parent.is_dir()
diff --git a/tests/unit/sdk/model/test_traj_replayer.py b/tests/unit/sdk/model/test_traj_replayer.py
new file mode 100644
index 0000000000..ffcc5c4011
--- /dev/null
+++ b/tests/unit/sdk/model/test_traj_replayer.py
@@ -0,0 +1,122 @@
+"""Tests for SequentialCursor (the replay cursor used by proxy.py).
+
+The proxy serves replay responses directly — there is no CustomLLM-based
+``TrajectoryReplayer`` anymore. End-to-end replay coverage (cursor + SSE chunk
+emit + cursor-exhausted → 404) lives in ``test_proxy.py``.
+"""
+
+import json
+
+import pytest
+
+from rock.sdk.model.server.traj import SequentialCursor, TrajectoryExhausted
+
+
+def _record(*, msg: str, model: str = "gpt-3.5-turbo", call_id: str = "x") -> dict:
+    return {
+        "id": call_id,
+        "model": model,
+        "messages": [{"role": "user", "content": msg}],
+        "response": {
+            "id": call_id,
+            "object": "chat.completion",
+            "model": model,
+            "choices": [
+                {
+                    "index": 0,
+                    "message": {"role": "assistant", "content": f"reply: {msg}"},
+                    "finish_reason": "stop",
+                }
+            ],
+            "usage": {"prompt_tokens": 1, "completion_tokens": 1, "total_tokens": 2},
+        },
+    }
+
+
+def _write_jsonl(path, records):
+    with path.open("w", encoding="utf-8") as f:
+        for r in records:
+            f.write(json.dumps(r) + "\n")
+
+
+def test_cursor_load_from_single_file(tmp_path):
+    p = tmp_path / "traj.jsonl"
+    _write_jsonl(p, [_record(msg="a"), _record(msg="b")])
+
+    cur = SequentialCursor.load(p)
+    assert cur.total == 2
+    assert cur.position == 0
+
+
+def test_cursor_load_skips_empty_lines(tmp_path):
+    p = tmp_path / "traj.jsonl"
+    p.write_text(
+        json.dumps(_record(msg="a")) + "\n\n  \n" + json.dumps(_record(msg="b")) + "\n",
+        encoding="utf-8",
+    )
+
+    cur = SequentialCursor.load(p)
+    assert cur.total == 2
+
+
+def test_cursor_load_missing_file_raises(tmp_path):
+    with pytest.raises(FileNotFoundError):
+        SequentialCursor.load(tmp_path / "missing.jsonl")
+
+
+def test_cursor_load_directory_raises(tmp_path):
+    """Path must be a single .jsonl file, not a directory."""
+    with pytest.raises(FileNotFoundError):
+        SequentialCursor.load(tmp_path)
+
+
+@pytest.mark.asyncio
+async def test_cursor_next_returns_records_in_order(tmp_path):
+    p = tmp_path / "traj.jsonl"
+    _write_jsonl(p, [_record(msg="a", call_id="1"), _record(msg="b", call_id="2")])
+
+    cur = SequentialCursor.load(p)
+    first = await cur.next()
+    second = await cur.next()
+
+    assert first["id"] == "1"
+    assert second["id"] == "2"
+    assert cur.position == 2
+
+
+@pytest.mark.asyncio
+async def test_cursor_next_raises_trajectory_exhausted_when_done(tmp_path):
+    p = tmp_path / "traj.jsonl"
+    _write_jsonl(p, [_record(msg="only")])
+
+    cur = SequentialCursor.load(p)
+    await cur.next()
+
+    with pytest.raises(TrajectoryExhausted) as exc_info:
+        await cur.next()
+    assert exc_info.value.position == 1
+    assert exc_info.value.total == 1
+
+
+@pytest.mark.asyncio
+async def test_cursor_reset_replays_from_start(tmp_path):
+    p = tmp_path / "traj.jsonl"
+    _write_jsonl(p, [_record(msg="a"), _record(msg="b")])
+
+    cur = SequentialCursor.load(p)
+    await cur.next()
+    await cur.next()
+    cur.reset()
+
+    again = await cur.next()
+    assert again["messages"][0]["content"] == "a"
+
+
+@pytest.mark.asyncio
+async def test_cursor_model_mismatch_only_warns(tmp_path):
+    p = tmp_path / "traj.jsonl"
+    _write_jsonl(p, [_record(msg="a", model="gpt-3.5-turbo")])
+
+    cur = SequentialCursor.load(p)
+    record = await cur.next(expected_model="gpt-4o")  # different model -> warn but don't raise
+    assert record["id"] == "x"
diff --git a/uv.lock b/uv.lock
index e00a7f86b3..cfed10409c 100644
--- a/uv.lock
+++ b/uv.lock
@@ -1196,6 +1196,15 @@ wheels = [
     { url = "https://mirrors.aliyun.com/pypi/packages/33/6b/e0547afaf41bf2c42e52430072fa5658766e3d65bd4b03a563d1b6336f57/distlib-0.4.0-py2.py3-none-any.whl", hash = "sha256:9659f7d87e46584a30b5780e43ac7a2143098441670ff0a49d5f9034c54a6c16" },
 ]
 
+[[package]]
+name = "distro"
+version = "1.9.0"
+source = { registry = "https://mirrors.aliyun.com/pypi/simple/" }
+sdist = { url = "https://mirrors.aliyun.com/pypi/packages/fc/f8/98eea607f65de6527f8a2e8885fc8015d3e6f5775df186e443e0964a11c3/distro-1.9.0.tar.gz", hash = "sha256:2fa77c6fd8940f116ee1d6b94a2f90b13b5ea8d019b98bc8bafdcabcdd9bdbed" }
+wheels = [
+    { url = "https://mirrors.aliyun.com/pypi/packages/12/b3/231ffd4ab1fc9d679809f356cebee130ac7daa00d6d6f3206dd4fd137e9e/distro-1.9.0-py3-none-any.whl", hash = "sha256:7bffd925d65168f85027d8da9af6bddab658135b840670a223589bc0c8ef02b2" },
+]
+
 [[package]]
 name = "docker"
 version = "7.1.0"
@@ -1919,6 +1928,109 @@ wheels = [
     { url = "https://mirrors.aliyun.com/pypi/packages/cb/b1/3846dd7f199d53cb17f49cba7e651e9ce294d8497c8c150530ed11865bb8/iniconfig-2.3.0-py3-none-any.whl", hash = "sha256:f631c04d2c48c52b84d0d0549c99ff3859c98df65b3101406327ecc7d53fbf12" },
 ]
 
+[[package]]
+name = "jiter"
+version = "0.14.0"
+source = { registry = "https://mirrors.aliyun.com/pypi/simple/" }
+sdist = { url = "https://mirrors.aliyun.com/pypi/packages/6e/c1/0cddc6eb17d4c53a99840953f95dd3accdc5cfc7a337b0e9b26476276be9/jiter-0.14.0.tar.gz", hash = "sha256:e8a39e66dac7153cf3f964a12aad515afa8d74938ec5cc0018adcdae5367c79e" }
+wheels = [
+    { url = "https://mirrors.aliyun.com/pypi/packages/64/2e/a9959997739c403378d0a4a3a1c4ed80b60aeace216c4d37b303a9fc60a4/jiter-0.14.0-cp310-cp310-macosx_10_12_x86_64.whl", hash = "sha256:02f36a5c700f105ac04a6556fe664a59037a2c200db3b7e88784fac2ddf02531" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/27/72/b6de8a531e0adbadd839bec301165feb1fccf00e9ff55073ba2dd20f0043/jiter-0.14.0-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:41eab6c09ceffb6f0fe25e214b3068146edb1eda3649ca2aee2a061029c7ba2e" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/db/d8/2040b9efa13c917f855c40890ae4119fe02c25b7c7677d5b4fa820a851fc/jiter-0.14.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:5cf4d4c109641f9cfaf4a7b6aebd51654e405cd00fa9ebbf87163b8b97b325aa" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/49/62/655c0ad5ce6a8e90f9068c175b8a236877d753e460762b3183c136db1c5b/jiter-0.14.0-cp310-cp310-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:b80c7b41a628e6be2213ad0ece763c5f88aa5ee003fa394d58acaaee1f4b8342" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/f1/66/549c40fa068f08710b7570869c306a051eb67a29758bd64f4114f730554c/jiter-0.14.0-cp310-cp310-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:fb3dbf7cc0d4dbe73cce307ebe7eefa7f73a7d3d854dd119ea0c243f03e40927" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/25/2f/97a32a05fed14ed58a18e181fdfb619e05163f3726b54ee6080ec0539c09/jiter-0.14.0-cp310-cp310-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:7054adcdeb06b46efd17b5734f75817a44a2d06d3748e36c3a023a1bb52af9ec" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/2a/3b/4347e1d6c2a973d653bbb7a2d671a2d2426e54b52ba735b8ff0d0a29b75c/jiter-0.14.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:d597cd1bf6790376f3fffc7c708766e57301d99a19314824ea0ccc9c3c70e1e2" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/ef/24/ca452fbf2ea33548ed30ce68a39a50442d3f7c9bf0704a7af958a930c057/jiter-0.14.0-cp310-cp310-manylinux_2_31_riscv64.whl", hash = "sha256:df63a14878da754427926281626fd3ee249424a186e25a274e78176d42945264" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/e3/a3/94470a0d199287caabeb4da2bb2ae5f6d17f3cf05dfc975d7cb064d58e0f/jiter-0.14.0-cp310-cp310-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:4ea73187627bcc5810e085df715e8a99da8bdfd96a7eb36b4b4df700ba6d4c9c" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/cf/71/6768edc09d7c45c39f093feb3de105fa718a3e982b5208b8a2ed6382b44b/jiter-0.14.0-cp310-cp310-musllinux_1_1_aarch64.whl", hash = "sha256:9f541eaf7bb8382367a1a23d6fc3d6aad57f8dd8c18c3c17f838bee20f217220" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/3d/6b/5c2e17559a0f4e96e934479f7137df46c939e983fa05244e674815befb73/jiter-0.14.0-cp310-cp310-musllinux_1_1_x86_64.whl", hash = "sha256:107465250de4fce00fdb47166bcd51df8e634e049541174fe3c71848e44f52ce" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/b1/83/c25f3556a60fc74d11199100f1b6cc0c006b815c8494dea8ca16fe398732/jiter-0.14.0-cp310-cp310-win32.whl", hash = "sha256:ffb2a08a406465bb076b7cc1df41d833106d3cf7905076cc73f0cb90078c7d10" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/2e/99/781a1b413f0989b7f2ea203b094b331685f1a35e52e0a45e5d000ecaab27/jiter-0.14.0-cp310-cp310-win_amd64.whl", hash = "sha256:cb8b682d10cb0cce7ff4c1af7244af7022c9b01ae16d46c357bdd0df13afb25d" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/8a/1f/198ae537fccb7080a0ed655eb56abf64a92f79489dfbf79f40fa34225bcd/jiter-0.14.0-cp311-cp311-macosx_10_12_x86_64.whl", hash = "sha256:7e791e247b8044512e070bd1f3633dc08350d32776d2d6e7473309d0edf256a2" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/cf/34/da67cff3fce964a36d03c3e365fb0f8726ade2a6cfd4d3c70107e216ead6/jiter-0.14.0-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:71527ce13fd5a0c4e40ad37331f8c547177dbb2dd0a93e5278b6a5eecf748804" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/ed/36/4c72e67180d4e71a4f5dcf7886d0840e83c49ab11788172177a77570326e/jiter-0.14.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:02c4a7ab56f746014874f2c525584c0daca1dec37f66fd707ecef3b7e5c2228c" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/bc/db/9b39e09ceafa9878235c0fc29e3e3f9b12a4c6a98ea3085b998cadf3accc/jiter-0.14.0-cp311-cp311-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:376e9dafff914253bb9d46cdc5f7965607fbe7feb0a491c34e35f92b2770702e" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/b0/96/0dcba1d7a82c1b720774b48ef239376addbaf30df24c34742ac4a57b67b2/jiter-0.14.0-cp311-cp311-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:23ad2a7a9da1935575c820428dd8d2490ce4d23189691ce33da1fc0a58e14e1c" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/f1/e3/f61b71543e746e6b8b805e7755814fc242715c16f1dba58e1cbccb8032c2/jiter-0.14.0-cp311-cp311-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:54b3ddf5786bc7732d293bba3411ac637ecfa200a39983166d1df86a59a43c9f" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/ad/5e/0ddeb7096aca099114abe36c4921016e8d251e6f35f5890240b31f1f60ae/jiter-0.14.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:5c001d5a646c2a50dc055dd526dad5d5245969e8234d2b1131d0451e81f3a373" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/e9/d1/fe0c46cd7fda9cad8f1ff9ad217dc61f1e4280b21052ec6dfe88c1446ef2/jiter-0.14.0-cp311-cp311-manylinux_2_31_riscv64.whl", hash = "sha256:834bb5bdabca2e91592a03d373838a8d0a1b8bbde7077ae6913fd2fc51812d00" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/ac/21/f5317f91729b501019184771c80d60abd89907009e7bfa6c7e348c5bdd44/jiter-0.14.0-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:4e9178be60e229b1b2b0710f61b9e24d1f4f8556985a83ff4c4f95920eea7314" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/e9/05/79d8f33fb2bf168db0df5c9cd16fe440a8ada57e929d3677b22712c2568f/jiter-0.14.0-cp311-cp311-musllinux_1_1_aarch64.whl", hash = "sha256:a7e4ccff04ec03614e62c613e976a3a5860dc9714ce8266f44328bdc8b1cab2c" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/5c/00/d1e3ff3d2a465e67f08507d74bafb2dcd29eba91dc939820e39e8dea38b8/jiter-0.14.0-cp311-cp311-musllinux_1_1_x86_64.whl", hash = "sha256:69539d936fb5d55caf6ecd33e2e884de083ff0ea28579780d56c4403094bb8d9" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/60/5b/bbb2189f62ace8d95e869aa4c84c9946616f301e2d02895a6f20dcc3bba3/jiter-0.14.0-cp311-cp311-win32.whl", hash = "sha256:4927d09b3e572787cc5e0a5318601448e1ab9391bcef95677f5840c2d00eaa6d" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/b8/86/c500b53dcbf08575f5963e536ebd757a1f7c568272ba5d180b212c9a87fb/jiter-0.14.0-cp311-cp311-win_amd64.whl", hash = "sha256:42d6ed359ac49eb922fdd565f209c57340aa06d589c84c8413e42a0f9ae1b842" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/75/4a/a676249049d42cb29bef82233e4fe0524d414cbe3606c7a4b311193c2f77/jiter-0.14.0-cp311-cp311-win_arm64.whl", hash = "sha256:6dd689f5f4a5a33747b28686e051095beb214fe28cfda5e9fe58a295a788f593" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/5a/68/7390a418f10897da93b158f2d5a8bd0bcd73a0f9ec3bb36917085bb759ef/jiter-0.14.0-cp312-cp312-macosx_10_12_x86_64.whl", hash = "sha256:2fb2ce3a7bc331256dfb14cefc34832366bb28a9aca81deaf43bbf2a5659e607" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/60/a0/5854ac00ff63551c52c6c89534ec6aba4b93474e7924d64e860b1c94165b/jiter-0.14.0-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:5252a7ca23785cef5d02d4ece6077a1b556a410c591b379f82091c3001e14844" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/41/a1/4f44832650a16b18e8391f1bf1d6ca4909bc738351826bcc198bba4357f4/jiter-0.14.0-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:c409578cbd77c338975670ada777add4efd53379667edf0aceea730cabede6fb" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/48/64/a329e9d469f86307203594b1707e11ae51c3348d03bfd514a5f997870012/jiter-0.14.0-cp312-cp312-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:7ede4331a1899d604463369c730dbb961ffdc5312bc7f16c41c2896415b1304a" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/94/c1/5e3dfc59635aa4d4c7bd20a820ac1d09b8ed851568356802cf1c08edb3cf/jiter-0.14.0-cp312-cp312-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:92cd8b6025981a041f5310430310b55b25ca593972c16407af8837d3d7d2ca01" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/e3/1b/dd157009dbc058f7b00108f545ccb72a2d56461395c4fc7b9cfdccb00af4/jiter-0.14.0-cp312-cp312-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:351bf6eda4e3a7ceb876377840c702e9a3e4ecc4624dbfb2d6463c67ae52637d" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/91/78/256013667b7c10b8834f8e6e54cd3e562d4c6e34227a1596addccc05e38c/jiter-0.14.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:c1dcfbeb93d9ecd9ca128bbf8910120367777973fa193fb9a39c31237d8df165" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/de/d9/137d65ade9093a409fe80955ce60b12bb753722c986467aeda47faf450ad/jiter-0.14.0-cp312-cp312-manylinux_2_31_riscv64.whl", hash = "sha256:ae039aaef8de3f8157ecc1fdd4d85043ac4f57538c245a0afaecb8321ec951c3" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/2e/48/76750835b87029342727c1a268bea8878ab988caf81ee4e7b880900eeb5a/jiter-0.14.0-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:7d9d51eb96c82a9652933bd769fe6de66877d6eb2b2440e281f2938c51b5643e" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/a6/60/456c4e81d5c8045279aefe60e9e483be08793828800a4e64add8fdde7f2a/jiter-0.14.0-cp312-cp312-musllinux_1_1_aarch64.whl", hash = "sha256:d824ca4148b705970bf4e120924a212fdfca9859a73e42bd7889a63a4ea6bb98" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/a8/9f/2020e0984c235f678dced38fe4eec3058cf528e6af36ebf969b410305941/jiter-0.14.0-cp312-cp312-musllinux_1_1_x86_64.whl", hash = "sha256:ff3a6465b3a0f54b1a430f45c3c0ba7d61ceb45cbc3e33f9e1a7f638d690baf3" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/ef/32/e2d298e1a22a4bbe6062136d1c7192db7dba003a6975e51d9a9eecabc4c2/jiter-0.14.0-cp312-cp312-win32.whl", hash = "sha256:5dec7c0a3e98d2a3f8a2e67382d0d7c3ac60c69103a4b271da889b4e8bb1e129" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/36/ac/96369141b3d8a4a8e4590e983085efe1c436f35c0cda940dd76d942e3e40/jiter-0.14.0-cp312-cp312-win_amd64.whl", hash = "sha256:fc7e37b4b8bc7e80a63ad6cfa5fc11fab27dbfea4cc4ae644b1ab3f273dc348f" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/01/c3/75d847f264647017d7e3052bbcc8b1e24b95fa139c320c5f5066fa7a0bdd/jiter-0.14.0-cp312-cp312-win_arm64.whl", hash = "sha256:ee4a72f12847ef29b072aee9ad5474041ab2924106bdca9fcf5d7d965853e057" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/97/2a/09f70020898507a89279659a1afe3364d57fc1b2c89949081975d135f6f5/jiter-0.14.0-cp313-cp313-macosx_10_12_x86_64.whl", hash = "sha256:af72f204cf4d44258e5b4c1745130ac45ddab0e71a06333b01de660ab4187a94" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/d6/be/080c96a45cd74f9fce5db4fd68510b88087fb37ffe2541ff73c12db92535/jiter-0.14.0-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:4b77da71f6e819be5fbcec11a453fde5b1d0267ef6ed487e2a392fd8e14e4e3a" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/7d/5e/2d0fee155826a968a832cc32438de5e2a193292c8721ca70d0b53e58245b/jiter-0.14.0-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:77f4ea612fe8b84b8b04e51d0e78029ecf3466348e25973f953de6e6a59aa4c1" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/70/af/bf9ee0d3a4f8dc0d679fc1337f874fe60cdbf841ebbb304b374e1c9aaceb/jiter-0.14.0-cp313-cp313-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:62fe2451f8fcc0240261e6a4df18ecbcd58327857e61e625b2393ea3b468aac9" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/0f/83/8e8561eadba31f4d3948a5b712fb0447ec71c3560b57a855449e7b8ddc98/jiter-0.14.0-cp313-cp313-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:6112f26f5afc75bcb475787d29da3aa92f9d09c7858f632f4be6ffe607be82e9" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/f6/c9/c5299e826a5fe6108d172b344033f61c69b1bb979dd8d9ddd4278a160971/jiter-0.14.0-cp313-cp313-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:215a6cb8fb7dc702aa35d475cc00ddc7f970e5c0b1417fb4b4ac5d82fa2a29db" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/5d/37/c16d9d15c0a471b8644b1abe3c82668092a707d9bedcf076f24ff2e380cd/jiter-0.14.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:fc4ab96a30fb3cb2c7e0cd33f7616c8860da5f5674438988a54ac717caccdbaa" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/58/ea/8050cb0dc654e728e1bfacbc0c640772f2181af5dedd13ae70145743a439/jiter-0.14.0-cp313-cp313-manylinux_2_31_riscv64.whl", hash = "sha256:3a99c1387b1f2928f799a9de899193484d66206a50e98233b6b088a7f0c1edb2" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/b0/3b/cf71506d270e5f84d97326bf220e47aed9b95e9a4a060758fb07772170ab/jiter-0.14.0-cp313-cp313-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:ab18d11074485438695f8d34a1b6da61db9754248f96d51341956607a8f39985" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/b0/cc/8c6c74a3efb5bd671bfd14f51e8a73375464ca914b1551bc3b40e26ac2c9/jiter-0.14.0-cp313-cp313-musllinux_1_1_aarch64.whl", hash = "sha256:801028dcfc26ac0895e4964cbc0fd62c73be9fd4a7d7b1aaf6e5790033a719b7" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/41/24/68d7b883ec959884ddf00d019b2e0e82ba81b167e1253684fa90519ce33c/jiter-0.14.0-cp313-cp313-musllinux_1_1_x86_64.whl", hash = "sha256:ad425b087aafb4a1c7e1e98a279200743b9aaf30c3e0ba723aec93f061bd9bc8" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/b6/89/b1a0985223bbf3150ff9e8f46f98fc9360c1de94f48abe271bbe1b465682/jiter-0.14.0-cp313-cp313-win32.whl", hash = "sha256:882bcb9b334318e233950b8be366fe5f92c86b66a7e449e76975dfd6d776a01f" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/4c/19/3f339a5a7f14a11730e67f6be34f9d5105751d547b615ef593fa122a5ded/jiter-0.14.0-cp313-cp313-win_amd64.whl", hash = "sha256:9b8c571a5dba09b98bd3462b5a53f27209a5cbbe85670391692ede71974e979f" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/50/56/752dd89c84be0e022a8ea3720bcfa0a8431db79a962578544812ce061739/jiter-0.14.0-cp313-cp313-win_arm64.whl", hash = "sha256:34f19dcc35cb1abe7c369b3756babf8c7f04595c0807a848df8f26ef8298ef92" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/91/28/292916f354f25a1fe8cf2c918d1415c699a4a659ae00be0430e1c5d9ffea/jiter-0.14.0-cp313-cp313t-macosx_11_0_arm64.whl", hash = "sha256:e89bcd7d426a75bb4952c696b267075790d854a07aad4c9894551a82c5b574ab" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/ad/c7/b002a7d8b8957ac3d469bd59c18ef4b1595a5216ae0de639a287b9816023/jiter-0.14.0-cp313-cp313t-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:7b25beaa0d4447ea8c7ae0c18c688905d34840d7d0b937f2f7bdd52162c98a40" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/f9/3b/f8d07580d8706021d255a6356b8fab13ee4c869412995550ce6ed4ddf97d/jiter-0.14.0-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:651a8758dd413c51e3b7f6557cdc6921faf70b14106f45f969f091f5cda990ea" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/47/5b/ac1a974da29e35507230383110ffec59998b290a8732585d04e19a9eb5ba/jiter-0.14.0-cp313-cp313t-win_amd64.whl", hash = "sha256:e1a7eead856a5038a8d291f1447176ab0b525c77a279a058121b5fccee257f6f" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/96/6d/9fc8433d667d2454271378a79747d8c76c10b51b482b454e6190e511f244/jiter-0.14.0-cp313-cp313t-win_arm64.whl", hash = "sha256:2e692633a12cda97e352fdcd1c4acc971b1c28707e1e33aeef782b0cbf051975" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/4f/1e/354ed92461b165bd581f9ef5150971a572c873ec3b68a916d5aa91da3cc2/jiter-0.14.0-cp314-cp314-macosx_10_12_x86_64.whl", hash = "sha256:6f396837fc7577871ca8c12edaf239ed9ccef3bbe39904ae9b8b63ce0a48b140" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/a6/95/8c7c7028aa8636ac21b7a55faef3e34215e6ed0cbf5ae58258427f621aa3/jiter-0.14.0-cp314-cp314-macosx_11_0_arm64.whl", hash = "sha256:a4d50ea3d8ba4176f79754333bd35f1bbcd28e91adc13eb9b7ca91bc52a6cef9" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/47/40/e2a852a44c4a089f2681a16611b7ce113224a80fd8504c46d78491b47220/jiter-0.14.0-cp314-cp314-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:ce17f8a050447d1b4153bda4fb7d26e6a9e74eb4f4a41913f30934c5075bf615" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/fc/1f/670f92adee1e9895eac41e8a4d623b6da68c4d46249d8b556b60b63f949e/jiter-0.14.0-cp314-cp314-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:f4f1c4b125e1652aefbc2e2c1617b60a160ab789d180e3d423c41439e5f32850" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/01/2f/541c9ba567d05de1c4874a0f8f8c5e3fd78e2b874266623da9a775cf46e0/jiter-0.14.0-cp314-cp314-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:be808176a6a3a14321d18c603f2d40741858a7c4fc982f83232842689fe86dd9" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/ce/a9/c31cbec09627e0d5de7aeaec7690dba03e090caa808fefd8133137cf45bc/jiter-0.14.0-cp314-cp314-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:26679d58ba816f88c3849306dd58cb863a90a1cf352cdd4ef67e30ccf8a77994" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/50/02/3c05c1666c41904a2f607475a73e7a4763d1cbde2d18229c4f85b22dc253/jiter-0.14.0-cp314-cp314-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:80381f5a19af8fa9aef743f080e34f6b25ebd89656475f8cf0470ec6157052aa" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/7d/97/e15b33545c2b13518f560d695f974b9891b311641bdcf178d63177e8801e/jiter-0.14.0-cp314-cp314-manylinux_2_31_riscv64.whl", hash = "sha256:004df5fdb8ecbd6d99f3227df18ba1a259254c4359736a2e6f036c944e02d7c5" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/ad/d2/8b1461def6b96ba44530df20d07ef7a1c7da22f3f9bf1727e2d611077bf1/jiter-0.14.0-cp314-cp314-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:cff5708f7ed0fa098f2b53446c6fa74c48469118e5cd7497b4f1cd569ab06928" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/e3/88/837566dd6ed6e452e8d3205355afd484ce44b2533edfa4ed73a298ea893e/jiter-0.14.0-cp314-cp314-musllinux_1_1_aarch64.whl", hash = "sha256:2492e5f06c36a976d25c7cc347a60e26d5470178d44cde1b9b75e60b4e519f28" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/89/6b/b00b45c4d1b4c031777fe161d620b755b5b02cdade1e316dcb46e4471d63/jiter-0.14.0-cp314-cp314-musllinux_1_1_x86_64.whl", hash = "sha256:7609cfbe3a03d37bfdbf5052012d5a879e72b83168a363deae7b3a26564d57de" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/ad/d8/6fe5b42011d19397433d345716eac16728ac241862a2aac9c91923c7509a/jiter-0.14.0-cp314-cp314-win32.whl", hash = "sha256:7282342d32e357543565286b6450378c3cd402eea333fc1ebe146f1fabb306fc" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/e5/43/5c2e08da1efad5e410f0eaaabeadd954812612c33fbbd8fd5328b489139d/jiter-0.14.0-cp314-cp314-win_amd64.whl", hash = "sha256:bd77945f38866a448e73b0b7637366afa814d4617790ecd88a18ca74377e6c02" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/aa/1f/6e39ac0b4cdfa23e606af5b245df5f9adaa76f35e0c5096790da430ca506/jiter-0.14.0-cp314-cp314-win_arm64.whl", hash = "sha256:f2d4c61da0821ee42e0cdf5489da60a6d074306313a377c2b35af464955a3611" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/05/57/7dbc0ffbbb5176a27e3518716608aa464aee2e2887dc938f0b900a120449/jiter-0.14.0-cp314-cp314t-macosx_11_0_arm64.whl", hash = "sha256:1bf7ff85517dd2f20a5750081d2b75083c1b269cf75afc7511bdf1f9548beb3b" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/83/6e/7b3314398d8983f06b557aa21b670511ec72d3b79a68ee5e4d9bff972286/jiter-0.14.0-cp314-cp314t-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:c8ef8791c3e78d6c6b157c6d360fbb5c715bebb8113bc6a9303c5caff012754a" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/ae/4f/8dc674bcd7db6dba566de73c08c763c337058baff1dbeb34567045b27cdc/jiter-0.14.0-cp314-cp314t-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:e74663b8b10da1fe0f4e4703fd7980d24ad17174b6bb35d8498d6e3ebce2ae6a" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/3b/5f/188e09a1f20906f98bbdec44ed820e19f4e8eb8aff88b9d1a5a497587ff3/jiter-0.14.0-cp314-cp314t-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:1aca29ba52913f78362ec9c2da62f22cdc4c3083313403f90c15460979b84d9b" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/ac/f0/19046ef965ed8f349e8554775bb12ff4352f443fbe12b95d31f575891256/jiter-0.14.0-cp314-cp314t-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:8b39b7d87a952b79949af5fef44d2544e58c21a28da7f1bae3ef166455c61746" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/c4/c3/da43bd8431ee175695777ee78cf0e93eacbb47393ff493f18c45231b427d/jiter-0.14.0-cp314-cp314t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:78d918a68b26e9fab068c2b5453577ef04943ab2807b9a6275df2a812599a310" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/72/26/e054771be889707c6161dbdec9c23d33a9ec70945395d70f07cfea1e9a6f/jiter-0.14.0-cp314-cp314t-manylinux_2_31_riscv64.whl", hash = "sha256:b08997c35aee1201c1a5361466a8fb9162d03ae7bf6568df70b6c859f1e654a4" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/c3/0f/7bea65ea2a6d91f2bf989ff11a18136644392bf2b0497a1fa50934c30a9c/jiter-0.14.0-cp314-cp314t-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:260bf7ca20704d58d41f669e5e9fe7fe2fa72901a6b324e79056f5d52e9c9be2" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/3c/a1/b1ff7d70deef61ac0b7c6c2f12d2ace950cdeecb4fdc94500a0926802857/jiter-0.14.0-cp314-cp314t-musllinux_1_1_aarch64.whl", hash = "sha256:37826e3df29e60f30a382f9294348d0238ef127f4b5d7f5f8da78b5b9e050560" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/0b/7b/3b0649983cbaf15eda26a414b5b1982e910c67bd6f7b1b490f3cfc76896a/jiter-0.14.0-cp314-cp314t-musllinux_1_1_x86_64.whl", hash = "sha256:645be49c46f2900937ba0eaf871ad5183c96858c0af74b6becc7f4e367e36e06" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/97/f8/33d78c83bd93ae0c0af05293a6660f88a1977caef39a6d72a84afab94ce0/jiter-0.14.0-cp314-cp314t-win32.whl", hash = "sha256:2f7877ed45118de283786178eceaf877110abacd04fde31efff3940ae9672674" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/d6/ac/2b760516c03e2227826d1f7025d89bf6bf6357a28fe75c2a2800873c50bf/jiter-0.14.0-cp314-cp314t-win_amd64.whl", hash = "sha256:14c0cb10337c49f5eafe8e7364daca5e29a020ea03580b8f8e6c597fed4e1588" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/dc/2e/a44c20c58aeed0355f2d326969a181696aeb551a25195f47563908a815be/jiter-0.14.0-cp314-cp314t-win_arm64.whl", hash = "sha256:5419d4aa2024961da9fe12a9cfe7484996735dca99e8e090b5c88595ef1951ff" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/32/a1/ef34ca2cab2962598591636a1804b93645821201cc0095d4a93a9a329c9d/jiter-0.14.0-graalpy311-graalpy242_311_native-macosx_10_12_x86_64.whl", hash = "sha256:a25ffa2dbbdf8721855612f6dca15c108224b12d0c4024d0ac3d7902132b4211" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/60/bb/520576a532a6b8a6f42747afed289c8448c879a34d7802fe2c832d4fd38f/jiter-0.14.0-graalpy311-graalpy242_311_native-macosx_11_0_arm64.whl", hash = "sha256:0ac9cbaa86c10996b92bd12c91659b60f939f8e28fcfa6bc11a0e90a774ce95b" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/b2/7c/c16db114ea1f2f532f198aa8dc39585026af45af362c69a0492f31bc4821/jiter-0.14.0-graalpy311-graalpy242_311_native-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:844e73b6c56b505e9e169234ea3bdea2ea43f769f847f47ac559ba1d2361ebea" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/99/8f/15e7741ff19e9bcd4d753f7ff22f988fd54592f134ca13701c13ea8c20e0/jiter-0.14.0-graalpy311-graalpy242_311_native-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:e52c076f187405fc21523c746c04399c9af8ece566077ed147b2126f2bcba577" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/21/42/9042c3f3019de4adcb8c16591c325ec7255beea9fcd33a42a43f3b0b1000/jiter-0.14.0-graalpy312-graalpy250_312_native-macosx_10_12_x86_64.whl", hash = "sha256:fbd9e482663ca9d005d051330e4d2d8150bb208a209409c10f7e7dfdf7c49da9" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/60/cf/a7e19b308bd86bb04776803b1f01a5f9a287a4c55205f4708827ee487fbf/jiter-0.14.0-graalpy312-graalpy250_312_native-macosx_11_0_arm64.whl", hash = "sha256:33a20d838b91ef376b3a56896d5b04e725c7df5bc4864cc6569cf046a8d73b6d" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/ca/44/e26ede3f0caeff93f222559cb0cc4ca68579f07d009d7b6010c5b586f9b1/jiter-0.14.0-graalpy312-graalpy250_312_native-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:432c4db5255d86a259efde91e55cb4c8d18c0521d844c9e2e7efcce3899fb016" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/da/e9/1f9ada30cef7b05e74bb06f52127e7a724976c225f46adb65c37b1dadfb6/jiter-0.14.0-graalpy312-graalpy250_312_native-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:67f00d94b281174144d6532a04b66a12cb866cbdc47c3af3bfe2973677f9861a" },
+]
+
 [[package]]
 name = "jmespath"
 version = "0.10.0"
@@ -2653,6 +2765,25 @@ wheels = [
     { url = "https://mirrors.aliyun.com/pypi/packages/be/9c/92789c596b8df838baa98fa71844d84283302f7604ed565dafe5a6b5041a/oauthlib-3.3.1-py3-none-any.whl", hash = "sha256:88119c938d2b8fb88561af5f6ee0eec8cc8d552b7bb1f712743136eb7523b7a1" },
 ]
 
+[[package]]
+name = "openai"
+version = "2.36.0"
+source = { registry = "https://mirrors.aliyun.com/pypi/simple/" }
+dependencies = [
+    { name = "anyio" },
+    { name = "distro" },
+    { name = "httpx" },
+    { name = "jiter" },
+    { name = "pydantic" },
+    { name = "sniffio" },
+    { name = "tqdm" },
+    { name = "typing-extensions" },
+]
+sdist = { url = "https://mirrors.aliyun.com/pypi/packages/f4/a1/4d5e84cf51720fc1526cc49e10ac1961abcccb55b0efb3d970db1e9a2728/openai-2.36.0.tar.gz", hash = "sha256:139dea0edd2f1b30c33d46ae1a6929e03906254140318e4608e98fe8c566f2e7" }
+wheels = [
+    { url = "https://mirrors.aliyun.com/pypi/packages/9d/1c/5d43735b2553baae2a5e899dcbcd0670a86930d993184d72ca909bf11c9b/openai-2.36.0-py3-none-any.whl", hash = "sha256:143f6194b548dbc2c921af1f1b03b9f14c85fed8a75b5b516f5bcc11a2a50c63" },
+]
+
 [[package]]
 name = "opencensus"
 version = "0.11.4"
@@ -4118,6 +4249,8 @@ builder = [
 model-service = [
     { name = "alibabacloud-cr20181201" },
     { name = "fastapi" },
+    { name = "httpx" },
+    { name = "openai" },
     { name = "psutil" },
     { name = "swebench" },
     { name = "uvicorn" },
@@ -4180,10 +4313,12 @@ requires-dist = [
     { name = "gem-llm", marker = "extra == 'rocklet'", specifier = ">=0.1.0" },
     { name = "gem-llm", marker = "extra == 'sandbox-actor'", specifier = ">=0.1.0" },
     { name = "httpx" },
+    { name = "httpx", marker = "extra == 'model-service'" },
     { name = "kubernetes", marker = "extra == 'admin'", specifier = ">=35.0.0" },
     { name = "nacos-sdk-python", marker = "extra == 'admin'", specifier = ">=0.1.14" },
     { name = "nacos-sdk-python", marker = "extra == 'sandbox-actor'", specifier = ">=0.1.14" },
     { name = "numpy", marker = "extra == 'rocklet'", specifier = "<=2.2.6" },
+    { name = "openai", marker = "extra == 'model-service'", specifier = ">=1.50.0" },
     { name = "opentelemetry-api" },
     { name = "opentelemetry-exporter-otlp" },
     { name = "opentelemetry-exporter-prometheus" },