From 55ec3142467344b07b074ea635dcc142019e201f Mon Sep 17 00:00:00 2001
From: "pengshixin.psx" <pengshixin.psx@alibaba-inc.com>
Date: Tue, 12 May 2026 02:52:26 +0000
Subject: [PATCH 01/25] refactor(model-service): rebuild proxy on litellm SDK
 with traj record/replay
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

替换 model-service `proxy` 模式手写的 httpx forward + retry_async,改为基于
litellm SDK 调用,同时新增 chat/completions 轨迹的录制与顺序回放能力,服务
SWE-agent / mini-swe-agent / OpenHands 等 deterministic agent 的无 LLM 成本调试。

主要改动:
- proxy.py 改用 litellm.acompletion(num_retries / extra_headers / streaming)
- 新增 TrajectoryRecorder(CustomLogger) 录制 StandardLoggingPayload 到 JSONL
- 新增 TrajectoryReplayer(CustomLLM) + SequentialCursor 顺序回放单个 jsonl 文件
- ModelServiceConfig 新增 num_retries / traj_enabled / traj_file / replay_traj_path
- CLI 新增 --num-retries / --traj-file(同时承担 replay 入口)
- local 模式保留旧 record_traj 装饰器,不受影响
- 删除 examples 旧 YAML,改 README 主推纯 CLI 启动方式
- docs/dev/litellm_proxy_refactor.md 写明设计与 breaking change

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 docs/dev/litellm_proxy_refactor.md            | 535 ++++++++++++++++++
 examples/model_service/README.md              |  90 +++
 pyproject.toml                                |   1 +
 rock/sdk/model/server/api/proxy.py            | 203 ++++---
 rock/sdk/model/server/config.py               |  13 +
 .../sdk/model/server/integrations/__init__.py |   0
 .../server/integrations/traj_recorder.py      |  77 +++
 .../server/integrations/traj_replayer.py      | 139 +++++
 rock/sdk/model/server/main.py                 |  55 +-
 rock/sdk/model/server/utils.py                |  15 +-
 tests/unit/sdk/model/test_proxy.py            | 455 +++++++--------
 tests/unit/sdk/model/test_traj_recorder.py    | 170 ++++++
 tests/unit/sdk/model/test_traj_replayer.py    | 204 +++++++
 uv.lock                                       | 404 +++++++++++++
 14 files changed, 2025 insertions(+), 336 deletions(-)
 create mode 100644 docs/dev/litellm_proxy_refactor.md
 create mode 100644 examples/model_service/README.md
 create mode 100644 rock/sdk/model/server/integrations/__init__.py
 create mode 100644 rock/sdk/model/server/integrations/traj_recorder.py
 create mode 100644 rock/sdk/model/server/integrations/traj_replayer.py
 create mode 100644 tests/unit/sdk/model/test_traj_recorder.py
 create mode 100644 tests/unit/sdk/model/test_traj_replayer.py

diff --git a/docs/dev/litellm_proxy_refactor.md b/docs/dev/litellm_proxy_refactor.md
new file mode 100644
index 0000000000..be30dbb4b2
--- /dev/null
+++ b/docs/dev/litellm_proxy_refactor.md
@@ -0,0 +1,535 @@
+# LiteLLM 重构 model-service proxy + 加 record/replay —— Handoff 文档
+
+> 这份文档是给"接手者"(可能是另一个 Claude session 或人)看的,目的是让接手者**完全不看上一段对话**也能从我离开的地方继续往下做。文档放在 `docs/dev/litellm_proxy_refactor.md`。
+
+---
+
+## 0. TL;DR
+
+**目标**:把 `rock model-service --type proxy` 的自写 httpx forward + retry 替换为基于 `litellm` SDK 的实现;同时把 chat/completions 轨迹的"录制 + 顺序回放"作为一等公民能力做进来,服务 SWE-agent / mini-swe-agent / OpenHands 类 deterministic agent 的"无 LLM 成本"调试。
+
+**当前状态**:**代码改动、单元测试、lint 全部完成通过**。下一步是集成验证(实际起 proxy + curl)和写 PR。
+
+**完成清单**:
+- ✅ `pyproject.toml` `model-service` extras 加 `litellm>=1.50.0`
+- ✅ `ModelServiceConfig` 加 `traj_enabled / traj_file / traj_append / replay_enabled / replay_traj_path / num_retries` 6 个字段
+- ✅ 新模块 `rock/sdk/model/server/integrations/{__init__.py, traj_recorder.py, traj_replayer.py}`
+- ✅ `rock/sdk/model/server/api/proxy.py` 整文件重写为 litellm SDK 调用
+- ✅ `rock/sdk/model/server/main.py` 加 `_configure_litellm_for_proxy()` + 新 CLI flags(`--num-retries / --traj-file / --no-traj / --replay-traj`)
+- ✅ `rock/sdk/model/server/utils.py` 保留 `record_traj` 装饰器(给 local 模式继续用),proxy 模式不再用
+- ✅ `tests/unit/sdk/model/test_proxy.py` 改造完成(把 `patch perform_llm_request` 改为 `patch litellm.acompletion`)
+- ✅ 新测试 `tests/unit/sdk/model/test_traj_recorder.py` + `test_traj_replayer.py`
+- ✅ `examples/model_service/config_record.yaml` + `config_replay.yaml`
+- ✅ **单元测试全部通过**(`uv run pytest tests/unit/sdk/model/` → 47 passed)
+- ✅ **Lint/format 全部干净**(`ruff check` + `ruff format --check`，修了一个 `Optional[str]` → `str | None` 的 UP045)
+
+**未完成 / 阻塞**:
+- ⏳ **集成验证**(实际起 proxy + curl + agent 端到端,见第 4.4 节)
+- ⏳ **PR 描述里的 breaking change 提示**(见第 5 节)
+
+**原始 plan 文件**(更详细的设计推演):`/home/xinshi/.claude/plans/litellm-chat-completions-traj-replay-ser-lucky-rainbow.md`(在主 Claude 配置目录,不在 rock 仓内)。
+
+---
+
+## 1. 背景与目标
+
+### 起因
+
+用户问:"litellm 能支持把 chat/completions 接口的轨迹落盘吗,然后我想看看能否支持根据 traj 文件做一个 replay server, 比如给一些其他的 agent (swe-agent, openhands) 等用来做 traj 回放"。
+
+### 需求方向的几次迭代(避免接手者重走弯路)
+
+1. **第一版方向**:做一个独立 Python 项目 `litellm-traj`,里面定义 `CustomLogger` 子类(record)和 `CustomLLM` 子类(replay),通过 dotted-path 注册到 litellm proxy 的 `config.yaml`。**已废弃**。
+2. **第二版方向**:在 rock 仓内把这个能力做进 `rock/sdk/model/server/api/proxy.py`(rock 已有 model-service)。但用户进一步要求:**重构掉 rock 自写的 proxy 实现,改为基于 litellm**。
+3. **最终方向(本次)**:用 **litellm SDK** 替换 `proxy.py` 内手写的 httpx forward + `retry_async`;record 接 `CustomLogger`,replay 接 `CustomLLM` provider。`rock model-service` CLI、`local` 模式、FastAPI app/health/metrics 全部保留不动 —— 只动 proxy 模式。
+
+### 为什么是 litellm SDK 而不是 litellm proxy
+
+我们已经有 rock 自己的 FastAPI app + CLI + auth/metrics middleware,只需要一个"OpenAI 兼容上游调用 + 错误归一化 + 流式聚合 + record/replay 接入点"。**litellm SDK 是这层能力的最小外加**,不需要把 litellm proxy 整套生命周期/配置体系拽进来。litellm proxy 适合"完全没有 server 的人"用,我们已经有 server。
+
+### 用户最终拍板的 4 个关键设计选择
+
+| 维度 | 选择 | 理由 |
+|---|---|---|
+| 集成模式 | **litellm SDK** | 改动面最小,保留 rock 既有 FastAPI/CLI/metrics |
+| traj schema | **`StandardLoggingPayload`(litellm 原生)** | 字段最全(messages/response/usage/timing/error_information/trace_id),与 litellm 生态互通 |
+| 是否本期做 replay | **是,record + replay 一起** | 用户原始诉求就是回放;基础设施一次性铺好 |
+| 流式 | **顺便解禁** | litellm 自动聚合,record/replay 走流式不增加复杂度 |
+
+---
+
+## 2. 改动清单(按文件)
+
+### 2.1 `pyproject.toml` —— 修改
+
+`[project.optional-dependencies]` 的 `model-service` 数组追加一项 `"litellm>=1.50.0"`。其它 extras 不动。
+
+```toml
+model-service = [
+    "fastapi",
+    "uvicorn",
+    "psutil",
+    "swebench",
+    "alibabacloud_cr20181201==2.0.5",
+    "litellm>=1.50.0",   # ← 这一行新加
+]
+```
+
+为什么是 `>=1.50.0`:这个版本之后 `StandardLoggingPayload`、`CustomLogger.async_log_success_event` 接口、`async_mock_completion_streaming_obj` 都已稳定。本仓现有 model-service 测试集没装过 litellm,所以全新引入,不存在升级冲突。
+
+### 2.2 `rock/sdk/model/server/config.py` —— 修改
+
+在 `ModelServiceConfig` 末尾新增 6 个字段(注意顺序、类型、默认值):
+
+```python
+num_retries: int = Field(default=6)
+
+traj_enabled: bool = Field(default=True)
+traj_file: str | None = Field(default=None)
+traj_append: bool = Field(default=True)   # 注意:旧默认是 False(覆盖),这里翻成 True
+
+replay_enabled: bool = Field(default=False)
+replay_traj_path: str | None = Field(default=None)
+```
+
+每个字段的语义和取值范围都写在 docstring 里。`traj_append=True` 是这次的**默认行为变更**(旧的 `_write_traj` 默认覆盖,被认为是 bug)。`TRAJ_FILE`、`LOG_FILE`、`LOG_DIR` 模块级常量保留不动。
+
+### 2.3 `rock/sdk/model/server/integrations/__init__.py` —— 新增(空文件)
+
+只为了让 `integrations` 成为一个包,内容为空。
+
+### 2.4 `rock/sdk/model/server/integrations/traj_recorder.py` —— 新增
+
+`TrajectoryRecorder(CustomLogger)`,实现两个钩子:`async_log_success_event` 和 `async_log_failure_event`。每次调用从 `kwargs["standard_logging_object"]` 取出 `StandardLoggingPayload`(dict 形态),append 一行 JSON 到 `traj_file`,同时上报 OTLP `model_service.request.{rt,count}` metrics。
+
+关键设计点(展开见第 3.1 节):
+- streaming 不分支(litellm 已在 callback 触发前把 chunks 聚合写入 `payload.response`)
+- `asyncio.Lock` per recorder + `asyncio.to_thread` 包同步写,避免在 event loop 阻塞
+- `append=False` 模式只在**首次写**时截断(避免每次调用覆盖)
+- metrics 复用 `rock.sdk.model.server.utils._get_or_create_metrics_monitor`,`MODEL_SERVICE_REQUEST_RT/COUNT` 常量
+
+### 2.5 `rock/sdk/model/server/integrations/traj_replayer.py` —— 新增
+
+包含两个类 + 两个 helper:
+
+- `SequentialCursor`:从 jsonl 文件或目录加载 records,`async next()` 返回下一条并推进游标,越界 raise `CustomLLMError(404)`。带 `asyncio.Lock` 防并发推进。`reset()` 用于回到起点。
+- `_record_to_model_response(record)` / `_extract_assistant_text(record)`:把 record 还原成 `litellm.types.utils.ModelResponse` 或抽出 assistant text(给 streaming 用)。
+- `TrajectoryReplayer(CustomLLM)`:实现 `acompletion` 和 `astreaming`。流式拆分直接调 `litellm.utils.async_mock_completion_streaming_obj`,不自己造轮子。
+
+`acompletion`/`astreaming` 的签名是 `(self, model, messages, *args, **kwargs)`。litellm 调 CustomLLM 时**全部用关键字参数**(litellm/main.py:4302-4319 实测),所以 `kwargs.get("model_response")` 能可靠拿到流式拆分需要的目标对象。
+
+### 2.6 `rock/sdk/model/server/utils.py` —— 修改(保留 + 注释更新)
+
+**关键决定**:不删 `record_traj` / `_write_traj`。原因:`local.py` 仍在用 `@record_traj`,plan 阶段说过"local 模式不动";所以 record_traj 保留,docstring 加一段说明"proxy 不再用,只给 local 用",新引导走 `TrajectoryRecorder`。
+
+`_get_or_create_metrics_monitor` / `MODEL_SERVICE_REQUEST_RT` / `MODEL_SERVICE_REQUEST_COUNT` 不动 —— `traj_recorder.py` 复用之。
+
+### 2.7 `rock/sdk/model/server/api/proxy.py` —— 整文件重写
+
+旧实现:
+- `httpx.AsyncClient` 全局 + `@retry_async` 6 次指数退避
+- `perform_llm_request(url, body, headers, config)` 自管 retry
+- `@record_traj` 挂在 handler 上同步落盘 + metrics
+- 强制 `stream=False`(MVP 限制)
+
+新实现:
+- `litellm.acompletion(model, api_base, extra_headers, timeout, num_retries, **body)`
+- 错误归一化:catch `RateLimitError / APIError / BadRequestError / AuthenticationError / Timeout` → `_format_error_response()` 回退到 `{error:{message,type,code}}` schema(agent 端关键字检测兼容)
+- 流式开放:`stream=True` 走 `StreamingResponse(_sse_iter(...))`
+- 不再有装饰器 —— record 落盘改由 `main.py` 在启动时挂的 `litellm.callbacks` 完成
+
+`get_base_url()` 路由优先级**完全保留**(`proxy_base_url` > `proxy_rules[model]` > `proxy_rules["default"]`)。`_filter_headers()` 把 hop-by-hop headers(host/content-length/content-type/transfer-encoding/connection)滤掉,Authorization 等保留。
+
+replay 模式下:`litellm_model = f"traj-replay/{model_name}"`,`api_base=None`。litellm 看到 `traj-replay/` 前缀会查 `litellm.custom_provider_map`,找到 `TrajectoryReplayer` 实例并调它的 `acompletion`/`astreaming`。
+
+### 2.8 `rock/sdk/model/server/main.py` —— 修改
+
+新增私有函数 `_configure_litellm_for_proxy(config)`,在 `main()` 进入 proxy 分支时(`include_router(proxy_router)` 之前)调用一次。两个分支:
+
+```python
+if config.replay_enabled:
+    # 注册 TrajectoryReplayer 到 litellm.custom_provider_map
+    ...
+elif config.traj_enabled:
+    # 把 TrajectoryRecorder 加到 litellm.callbacks
+    ...
+```
+
+**注意**:replay 和 record 互斥(replay 不要再录,否则录回放结果会污染 source-of-truth)。
+
+`create_config_from_args()` 新增 4 个 CLI override:`--num-retries / --traj-file / --no-traj / --replay-traj`。所有用 `getattr(args, "<name>", default)` 的方式取,这样老的调用方(传不带这些字段的 Namespace)不会炸。
+
+`from rock.sdk.model.server.config import TRAJ_FILE, ModelServiceConfig` —— 新增 `TRAJ_FILE` 导入,因为 `_configure_litellm_for_proxy` 在 `traj_file` 未指定时回退到 `TRAJ_FILE`。
+
+### 2.9 `tests/unit/sdk/model/test_proxy.py` —— 重写
+
+- 删除:`test_perform_llm_request_*`(4 个,perform_llm_request 已不存在)
+- 改造:`test_chat_completions_routing_*`、`test_proxy_base_url_overrides_proxy_rules` —— `patch_path` 从 `proxy.perform_llm_request` 改为 `proxy.litellm.acompletion`
+- 改造:断言从"perform_llm_request 第一个位置参数 == URL"改为"litellm.acompletion kwargs 中 `api_base == 期望值`,`model == 'openai/<name>'`"
+- 新增:`test_chat_completions_passes_num_retries_and_timeout` / `test_chat_completions_litellm_error_returns_proxy_schema` / `test_chat_completions_replay_mode_uses_traj_replay_provider` / `test_chat_completions_strips_hop_by_hop_headers` / `test_config_default_traj_and_replay` / `test_config_loads_traj_and_replay_from_file` / `test_cli_replay_traj_enables_replay`
+- 保留:所有 lifespan / config-load / metrics-monitor / record_traj 测试(record_traj 在 utils.py 还在,给 local 用)
+
+mock 返回的 ModelResponse:用 `SimpleNamespace(model_dump=lambda: payload)` 假装一个 pydantic 对象 —— 因为 handler 只调 `.model_dump()`,不需要真 import 整个 ModelResponse。
+
+### 2.10 `tests/unit/sdk/model/test_traj_recorder.py` —— 新增
+
+7 个测试:JSONL append / `append=False` 首次截断 / metrics + sandbox_id / failure 落盘 / 缺 standard_logging_object 跳过 / 自动建父目录 / `response_time` 缺失时回退到 `endTime - startTime`。
+
+mock 思路:`patch("rock.sdk.model.server.integrations.traj_recorder._get_or_create_metrics_monitor", return_value=mock_monitor)` —— recorder 内部 import 了这个函数,mock 它的引用。
+
+### 2.11 `tests/unit/sdk/model/test_traj_replayer.py` —— 新增
+
+11 个测试:cursor 加载单文件/目录(按文件名 sort)/空行/缺失文件 raise / `next()` 顺序返回 / 越界 raise / `reset()` 回到起点 / model mismatch 只 warn / Replayer.acompletion 命中 record / cursor 推进 / streaming chunk 拼回 == 原文 / 越界 raise CustomLLMError。
+
+streaming 测试构造一个 `SimpleNamespace(choices=[SimpleNamespace(delta=SimpleNamespace(role=None, content=None), index=0)])` 当 model_response,因为 `async_mock_completion_streaming_obj` 内部会写 `model_response.choices[0].delta.content = ...`。
+
+### 2.12 `examples/model_service/config_record.yaml` 和 `config_replay.yaml` —— 新增
+
+两份开箱即用的 yaml,带详细注释。`config_record.yaml` 默认开 `traj_enabled: true / traj_append: true`,关 replay。`config_replay.yaml` 默认关 traj_enabled / 开 replay,`replay_traj_path: "/data/logs/LLMTraj.jsonl"` 占位 —— 实际部署时根据 traj 位置改。
+
+### 2.13 `/mnt/xinshi/github/litellm-traj/` —— 已删除
+
+第一版独立项目骨架(`pyproject.toml / src/litellm_traj/cursor.py / .gitignore / LICENSE`)在方向变更时已 `rm -rf`。所有有效内容都迁回了 rock 的 integrations/ 模块。
+
+---
+
+## 3. 关键代码细节(踩坑点 + "为什么这么写")
+
+下文展开几个最容易让接手者迷失的设计选择。每一项都标了 litellm 仓内的源码定位(litellm 主仓在 `/mnt/xinshi/github/litellm/`),便于交叉验证。
+
+### 3.1 Streaming 聚合在 litellm 内部完成,Recorder 不需要分支
+
+`StandardLoggingPayload.response` 字段在 `success_handler` 触发前**已经是聚合完整的 OpenAI shape dict**。流式与非流式走同一条路径:litellm 在 streaming 结束时调用 `stream_chunk_builder` 拼出 `complete_streaming_response`(litellm 仓 `litellm/litellm_core_utils/litellm_logging.py:1930-1955`),然后写入 `standard_logging_object.response`。
+
+实际后果:`TrajectoryRecorder.async_log_success_event` 拿到的 payload 永远含完整 response,我**不需要写 `async_log_stream_event`**。这也是为什么 stream 解禁几乎"零成本" —— 录制端无任何额外代码。
+
+### 3.2 `model: "openai/<name>"` 前缀的含义
+
+litellm 把"provider"前缀作为路由依据。`openai/gpt-3.5-turbo` 表示"上游是 OpenAI 兼容协议的服务,模型名叫 gpt-3.5-turbo"。配合 `api_base="https://api.modelscope.cn/v1"` 这种第三方 OpenAI 兼容 endpoint 也能用 —— 这正是 rock 现有 `proxy_rules` 里的 ModelScope/OpenAI 等场景。
+
+`traj-replay/<name>` 是我们注册的自定义 provider。litellm 看到这个前缀会查 `litellm.custom_provider_map`,匹配到 `provider == "traj-replay"` 的项,把 `custom_handler.acompletion`/`astreaming` 当上游调(litellm 仓 `litellm/main.py:4280-4326`)。
+
+### 3.3 错误归一化:为什么 catch 那 5 个 exception
+
+`proxy.py` catch 顺序:`RateLimitError, APIError, BadRequestError, AuthenticationError, Timeout`。这五个在 `litellm/exceptions.py` 全部继承自 `openai.OpenAIError` 派生类,**都带 `.status_code` 属性**。`_format_error_response` 用 `getattr(exc, "status_code", None) or 502` 提取上游真实状态码;message 走 `str(exc)` —— litellm 异常的 `__str__` 已经包含"上游原始 error message",所以 agent 端的关键字检测(如 `"context length exceeded"` / `"content violation"`)继续工作。
+
+`type` 字段用 `type(exc).__name__`(`"BadRequestError"` 等),不再是旧的固定 `"proxy_retry_failed"`。这是 schema 的语义变化:同一个 `error.type` 字段,旧版本返回固定字符串,新版本返回 exception 类名。如果有下游消费 `error.type` 做分支,需要适配。
+
+兜底 `except Exception` 走 `HTTPException(500)`,会被 `main.py` 里的 `global_exception_handler` 接住,返回 `{error:{message,type:"internal_error",code:"internal_error"}}` —— 这条路径与重构前完全一致。
+
+### 3.4 retry 行为:从 `retry_async` 切到 `litellm.num_retries`
+
+旧实现:`@retry_async(max_attempts=6, delay_seconds=2.0, backoff=2.0, jitter=True, exceptions=(TimeoutException, ConnectError, HTTPStatusError))`。仅在 `status_code in retryable_status_codes` 时 raise,这样 401 不会触发 retry,而 429/500 会。
+
+新实现:`config.num_retries`(默认 6) 直接传给 `litellm.acompletion(num_retries=...)`。litellm 内部对 `RateLimitError / APIError / Timeout / ServiceUnavailableError` 自动重试,**不暴露 `retryable_status_codes` 维度**。我保留 `retryable_status_codes` 字段在 config 里,但当前**handler 没用它**(向后兼容旧 yaml,不会因为多了字段而 reject)。
+
+如果将来有人投诉"自定义重试码列表失效",这是已知的语义差异。fallback 方案:在 handler 里手写 `for attempt in range(config.num_retries):` 包一层,根据 status code 做白名单。本期不做,因为 litellm 默认行为已经覆盖最常见的 429/500。
+
+### 3.5 `_filter_headers` 黑名单 vs 白名单
+
+我用黑名单:`host / content-length / content-type / transfer-encoding / connection` 不转发,其余全部透传给 litellm 的 `extra_headers`。这与旧实现保持一致(旧的也是去掉前 4 个,新增 connection 是为了更标准)。Authorization/X-* 等都自动通过。
+
+注意:`extra_headers` 在 litellm 里被合并到上游 HTTP 请求里(litellm 自己的 OpenAI client),不会覆盖 litellm 自己生成的 `Authorization: Bearer <api_key>`。如果 rock 不主动设 `OPENAI_API_KEY`,而 client 又传了 Authorization header,litellm 会用 client 的;反之 litellm 会用环境变量。这一层逻辑全在 litellm 自己。
+
+### 3.6 `traj_append=False` 的"首次截断"行为
+
+旧 `_write_traj` 在 `append=False` 时**每次调用都 `mode="w"`**,导致 jsonl 永远只有最后一行 —— 这是个 bug。
+
+新 `TrajectoryRecorder` 的修复:维护一个 `self._truncated` 实例标志;`append=False` 时,**第一次写**用 `mode="w"`(覆盖上一进程留下的旧 traj),**后续写**用 `mode="a"`(本进程内 append)。所以:
+- 进程启动时:旧 traj 文件清空(如果存在)
+- 进程运行中:每次调用 append 一行
+- 进程重启:再次清空,从头记
+
+效果上等于"per-run 一份完整 traj"。我把这个语义在 docstring 里讲清楚了,因为这是和旧默认行为最不同的一点。
+
+`traj_append=True`(新默认)就是纯 append-only,不管旧文件。
+
+### 3.7 SequentialCursor 的并发模型
+
+`async next()` 用 `asyncio.Lock` 保护索引 + 自增。**单进程多并发请求场景下** cursor 推进是原子的,但**含义是"按到达顺序消费"**,所以多个 agent 并发打过来会被串成一个伪顺序 —— 这是 v1 的已知约束(plan 里明确列出),约定"单 agent 串行回放"。
+
+**model mismatch 只 warn 不 raise**:expected_model 来自调用方传入,recorded model 来自 record 内的 `model` 字段。两者不一致只打 warning,record 仍然返回。理由:agent 端可能切换了 base_url 但没改 model 名(常见调试场景),不该硬阻塞。
+
+### 3.8 CustomLLM 的调用约定 —— `*args, **kwargs` 收尾很重要
+
+`litellm/main.py:4302-4319` 实测调用方式是**全关键字参数**:
+```python
+response = handler_fn(
+    model=model, messages=messages, headers=headers,
+    model_response=model_response, print_verbose=...,
+    api_key=..., api_base=..., acompletion=..., logging_obj=...,
+    optional_params=..., litellm_params=..., logger_fn=...,
+    timeout=..., custom_prompt_dict=..., client=..., encoding=...,
+)
+```
+但 litellm 各小版本会不会增减字段不确定。`TrajectoryReplayer.acompletion(self, model, messages, *args, **kwargs)` 这种"显式 model+messages,其余吞掉"的签名,既能 PEP-484 注解,又对 litellm 后续加字段免疫。
+
+**不要改成 `def acompletion(self, model, messages, *, optional_params, ...)`** 否则 litellm 加新字段时会 TypeError。
+
+### 3.9 `LITELLM_TRAJ_FILE` env vs `traj_file` 字段
+
+我没引入新 env var。`config.traj_file` 在 `main.py:_configure_litellm_for_proxy` 里通过 `config.traj_file or TRAJ_FILE` 取值,而 `TRAJ_FILE` 来自 `config.py:13`,= `LOG_DIR + "/LLMTraj.jsonl"`,`LOG_DIR = env_vars.ROCK_MODEL_SERVICE_DATA_DIR`(默认 `/data/logs`)。
+
+所以路径优先级:`--traj-file CLI` > `traj_file: yaml` > `LOG_DIR/LLMTraj.jsonl`(LOG_DIR 受 `ROCK_MODEL_SERVICE_DATA_DIR` env 控制)。和旧体系一致。
+
+### 3.10 `record_traj` 装饰器为什么保留
+
+`local.py:75` 仍然用 `@record_traj` 装饰它的 chat_completions handler。local 模式不调 litellm,FileHandler 直接通过文件 marker 跟 Roll 通信 —— 没有 litellm callback 触发的窗口。所以为了保留 local 模式的"调用次数 + RT 上报",我把 `record_traj` 留在 `utils.py`,让 local 继续用,docstring 写明"proxy 模式不再用,改走 TrajectoryRecorder"。
+
+代价:local 模式录的 traj schema 是旧的 `{request, response}`,proxy 模式是 `StandardLoggingPayload`。两种 schema 共存于同一个 `LLMTraj.jsonl` 文件路径上(因为 `TRAJ_FILE` 是同一个常量)。**实际部署时 local 和 proxy 用同一个进程的概率为 0**(`--type` 互斥),所以同一个 traj 文件不会混合两种 schema。但如果有人定时切换 `--type` 跑 + `traj_append=true` 不轮换文件,会出现混合。文档建议:**replay 时只读 proxy 模式录的 traj**(StandardLoggingPayload 格式),local 模式的 traj 仅用于 local 调试。
+
+---
+
+## 4. 跑测试 / 验证步骤(接手者从这里继续)
+
+### 4.1 准备 Python 环境
+
+**已验证**:`uv sync` 后 litellm 已正常安装。使用 `uv run` 执行,不需要手动激活 venv。
+
+```bash
+cd /mnt/xinshi/github/Self-ROCK
+uv sync --extra model-service --group test
+```
+
+验证依赖(已通过):
+
+```bash
+uv run python -c "from litellm.integrations.custom_logger import CustomLogger; print('ok')"
+uv run python -c "from litellm.llms.custom_llm import CustomLLM, CustomLLMError; print('ok')"
+uv run python -c "from litellm.utils import async_mock_completion_streaming_obj; print('ok')"
+```
+
+### 4.2 静态检查 / lint
+
+```bash
+uv run ruff check rock/sdk/model/server/ tests/unit/sdk/model/
+uv run ruff format --check rock/sdk/model/server/ tests/unit/sdk/model/
+```
+
+如果 ruff format 报 diff,直接 `uv run ruff format rock/sdk/model/server/ tests/unit/sdk/model/` 修。代码写的时候我没跑 ruff,可能有 line-length / import 排序之类的小问题。
+
+### 4.3 单测(已全部通过)
+
+```bash
+uv run pytest tests/unit/sdk/model/ -v
+# → 47 passed in ~4s
+```
+
+**已验证通过的测试集**:
+- `test_proxy.py` (27 个):routing/error/replay/header/cli/config/metrics
+- `test_traj_recorder.py` (7 个):JSONL append/truncate/metrics/failure/missing payload/mkdir/rt fallback
+- `test_traj_replayer.py` (11 个):cursor 加载/顺序/越界/reset/model mismatch/acompletion/streaming/exhaustion
+- `test_model_client.py` (2 个):原有测试保留通过
+
+**已知但不影响测试的边界情况**(生产注意):
+- tool_calls 场景下 `_extract_assistant_text` 返回 `""`,replay 流式会返回空流(已知限制,不在本期范围)
+- `litellm.callbacks` 是全局 list,测试隔离靠 patch,生产只起一次 server 无问题
+
+### 4.4 集成验证(测试通过后)
+
+#### Record 模式
+
+```bash
+# 终端 1
+export OPENAI_API_KEY="sk-..."
+export ROCK_MODEL_SERVICE_DATA_DIR=/tmp/rock-traj
+mkdir -p /tmp/rock-traj
+uv run python -m rock.sdk.model.server.main \
+    --type proxy \
+    --config-file examples/model_service/config_record.yaml \
+    --port 8080
+
+# 终端 2
+curl -X POST http://127.0.0.1:8080/v1/chat/completions \
+    -H "Authorization: Bearer $OPENAI_API_KEY" \
+    -H "Content-Type: application/json" \
+    -d '{"model":"gpt-3.5-turbo","messages":[{"role":"user","content":"say hi"}]}'
+
+# 验证 traj
+cat /tmp/rock-traj/LLMTraj.jsonl | jq '.id, .model, .response.choices[0].message.content'
+# 应该看到 chatcmpl-xxx / gpt-3.5-turbo / "..."
+```
+
+#### Replay 模式
+
+```bash
+# 终端 1
+uv run python -m rock.sdk.model.server.main \
+    --type proxy \
+    --replay-traj /tmp/rock-traj/LLMTraj.jsonl \
+    --port 8081
+
+# 终端 2 - 同样的 curl 打 8081
+curl -X POST http://127.0.0.1:8081/v1/chat/completions \
+    -H "Content-Type: application/json" \
+    -d '{"model":"gpt-3.5-turbo","messages":[{"role":"user","content":"anything (replay ignores msgs)"}]}'
+
+# 应该返回与录制时同样的 response.choices[0].message.content
+# 第二次 curl 会 404(traj exhausted),证明 cursor 在工作
+```
+
+#### Streaming 验证
+
+```bash
+curl -N -X POST http://127.0.0.1:8080/v1/chat/completions \
+    -H "Authorization: Bearer $OPENAI_API_KEY" \
+    -H "Content-Type: application/json" \
+    -d '{"model":"gpt-3.5-turbo","stream":true,"messages":[{"role":"user","content":"count to 5"}]}'
+# 应该看到 SSE chunks: data: {...}\n\n ... data: [DONE]\n\n
+# traj 文件里那一行的 .stream == true,.response 是聚合后的完整 dict
+```
+
+#### Agent 端到端(最终验证)
+
+`mini-swe-agent` 跑一个 SWE-bench 实例,base_url 指向 8080(record),完了用同 instance 接 8081(replay),期望 agent 最终生成的 patch 与录制时一致。这是最强 check,但跑起来麻烦,可以在 PR review 阶段再做。
+
+---
+
+## 5. Breaking Changes(PR 描述里必须写清楚)
+
+### 5.1 traj 文件 schema 改变
+
+`LLMTraj.jsonl` 每行从 `{"request": {...}, "response": {...}}` 变成 `StandardLoggingPayload`(几十个字段:`id/trace_id/model/messages/response/model_parameters/usage/startTime/endTime/status/...`)。
+
+如果有下游消费者依赖旧的两字段 schema(脚本、UI、统计),会破坏。本期不提供"双格式输出"或"旧→新转换"工具,如有需要可单独写 `scripts/convert_traj.py`。
+
+### 5.2 `traj_append` 默认值翻转
+
+旧的 `ROCK_MODEL_SERVICE_TRAJ_APPEND_MODE` 默认 `"false"` → `_write_traj` 用 `mode="w"`,实际表现是"每次调用覆盖,文件只剩最后一条"。新的 `ModelServiceConfig.traj_append` 默认 `True`(append-only)。
+
+如果有人**之前依赖每次都覆盖来获取"最近一次调用"**(很罕见但可能),需要在 yaml 显式设 `traj_append: false`。
+
+### 5.3 `error.type` 字段语义变化
+
+旧值:固定字符串 `"proxy_retry_failed"`(retry 用尽)或 `"internal_error"`(其他)。
+新值:litellm 异常类名,如 `"BadRequestError" / "RateLimitError" / "Timeout" / "AuthenticationError" / "APIError"`。
+
+`error.message` 仍以 `"LLM backend error: ..."` 开头,关键字检测兼容。
+
+### 5.4 `retryable_status_codes` 字段不再生效
+
+旧版本根据 `retryable_status_codes` 白名单决定哪些状态码触发 retry(如 401 不 retry,429/500 retry)。新版本由 litellm 内部决定(对 `RateLimitError / APIError / Timeout / ServiceUnavailableError` 自动 retry,4xx 一般不 retry)。
+
+字段保留在 yaml 不报错,但 handler 不读它。如果将来需要恢复白名单,见 3.4 节"fallback 方案"。
+
+### 5.5 `stream=true` 不再被强制拒绝
+
+旧版本对 `stream=true` 返回 400 + `"Streaming requests (stream=True) are not supported"`。新版本正常处理,返回 SSE。
+
+如果有 client 之前**依赖** 400 来探测"是否启用流式",会破坏。但这种用法很反常,基本不会有。
+
+### 5.6 `perform_llm_request` 函数已删除
+
+下游不应该 import 这个 —— 它本来就是 proxy.py 内的 helper。如果有 test/script 直接 import 它,需要适配。`tests/unit/sdk/model/test_proxy.py` 我已改完。
+
+### 5.7 新的依赖
+
+`pip install rl-rock[model-service]` 会多装 litellm(及其依赖链:`openai>=1.x / tiktoken / aiohttp / tokenizers / ...`)。包大小 +~50MB。
+
+---
+
+## 6. 已知坑 / 接手时的注意事项
+
+### 6.1 `local.py` 仍在 import `record_traj`
+
+我**没改 local.py**(plan 明确"local 不动")。`local.py:12` 的 `from rock.sdk.model.server.utils import record_traj` 仍然成立,因为 utils.py 保留了 record_traj。如果接手者看到这个 import 想清理,**不要清理** —— 那会破坏 local 模式。
+
+### 6.2 `litellm.callbacks` 是全局 list
+
+`main.py:_configure_litellm_for_proxy` 用 `litellm.callbacks.append(recorder)`。如果同一进程多次启动(测试场景),会注册多次,导致每次调用落多份 traj。生产部署只跑一次没问题。**如果要写"重复初始化也安全"的逻辑**,可以改成 `if not any(isinstance(cb, TrajectoryRecorder) for cb in litellm.callbacks): litellm.callbacks.append(recorder)`。我没做,因为生产路径是"启动一次"。
+
+同理 `litellm.custom_provider_map = [...]` 是赋值不是 append,所以 replay 重复初始化是幂等的。
+
+### 6.3 SequentialCursor 在测试里要小心 cursor 跨用例
+
+`SequentialCursor` 是实例属性 `self._idx`,每个测试自己 `SequentialCursor.load(p)` 都是新实例,不会跨用例污染。但如果有人写"模块级单例 replayer + 多个测试调它"的 fixture,会撞 idx。当前测试都是 per-test 实例,OK。
+
+### 6.4 `litellm` import 较慢
+
+litellm import 时会加载几个 OpenAI/HuggingFace 客户端,首次 import 可能 1-2 秒。`main.py` 把 `import litellm` 放在 `_configure_litellm_for_proxy()` 内部(函数级延迟 import),只在 proxy 模式启动时触发。`proxy.py` 是模块顶级 `import litellm`,handler 文件首次加载就触发 —— 这是 fastapi 路由注册时的开销,不影响请求路径性能。
+
+### 6.5 `pyproject.toml` 的 `tzdata` 依赖
+
+我看到 pyproject.toml 里 ide_diagnostics 报 `httpx/uuid/anyio/tzdata/...` 未安装 —— 这是 ide 当前 Python 环境没装 rock 主仓依赖,与本次改动无关。`uv sync` 后这些 hint 自动消失。
+
+### 6.6 `__pycache__` 残留
+
+旧 `proxy.py` 有 `__pycache__/proxy.cpython-310.pyc`。重写后第一次 import 会重新生成,**正常情况下没问题**。如果跑测试时报 `ImportError: cannot import name 'perform_llm_request'`,先 `find rock -name __pycache__ -exec rm -rf {} +` 清掉缓存。
+
+### 6.7 别忘了 `extra_headers` 可能含敏感信息
+
+`_filter_headers` 把所有非 hop-by-hop header 透传给上游,包括 client 传的 `Authorization`。这是**故意的** —— 让 client 自己带 API key 是 rock 现有约定。但意味着 traj 录的 `StandardLoggingPayload.metadata.headers`(如果有) 可能含 Bearer token。litellm 自己有 `turn_off_message_logging` / `redact_user_api_key_info` 等开关,**目前没启用**。如果将来 traj 文件要分发,需要先脱敏。
+
+---
+
+## 7. 不在本次范围 / 后续扩展(v2)
+
+### 不在范围(明确不做)
+
+- local 模式(`--type local`)的任何改动
+- DB 持久化(traj 只走 JSONL)
+- 旧 `{request, response}` traj 的兼容读取(replay 只接受新 schema)
+- SWE-agent / OpenHands 原生 traj 格式互转
+- replay 时 streaming 的细粒度时序还原(只保证 chunk 序列正确)
+- tool_calls 的增量流式拆分(本期 streaming replay 只到 message-level chunk)
+
+### 后续扩展(留了接口)
+
+- **基于 messages hash 的乱序匹配**:`SequentialCursor` 旁加 `HashMatcher`,通过 `replay_mode: sequential | hash` 切换。当 agent 内部不严格按录制顺序调 LLM(分支/retry)时用。
+- **多并发回放**:用请求 metadata 中的 `run_id` 路由到不同 cursor;`SequentialCursor` 改成 `dict[run_id, Cursor]`。
+- **passthrough on miss**:cursor 用尽时回落到真 LLM(`import litellm; await litellm.acompletion(...)`)。用于"录到一半 traj 不够长"的调试场景。
+- **`/admin/reset` HTTP 端点**:不重启 proxy 即可把 cursor 归零。
+- **`scripts/convert_traj.py`**:把 SWE-agent `.traj` 或 OpenHands event log 转成 StandardLoggingPayload,反向也行。
+- **traj 脱敏 hook**:写盘前过 `redact_keys: list[str]` 把指定字段抹掉。
+
+---
+
+## 8. 关键路径速查
+
+### Rock 仓内(本次改动的)
+
+| 路径 | 角色 |
+|---|---|
+| `pyproject.toml` | model-service extras 加 litellm |
+| `rock/sdk/model/server/config.py` | ModelServiceConfig 新字段 |
+| `rock/sdk/model/server/api/proxy.py` | 重写为 litellm SDK |
+| `rock/sdk/model/server/main.py` | `_configure_litellm_for_proxy` + 新 CLI flags |
+| `rock/sdk/model/server/utils.py` | 保留 record_traj 给 local |
+| `rock/sdk/model/server/integrations/__init__.py` | 空,只为成包 |
+| `rock/sdk/model/server/integrations/traj_recorder.py` | TrajectoryRecorder(CustomLogger) |
+| `rock/sdk/model/server/integrations/traj_replayer.py` | SequentialCursor + TrajectoryReplayer(CustomLLM) |
+| `rock/sdk/model/server/api/local.py` | **没改**(仍用 record_traj) |
+| `tests/unit/sdk/model/test_proxy.py` | 改造完 |
+| `tests/unit/sdk/model/test_traj_recorder.py` | 新 |
+| `tests/unit/sdk/model/test_traj_replayer.py` | 新 |
+| `examples/model_service/config_record.yaml` | 新 |
+| `examples/model_service/config_replay.yaml` | 新 |
+
+### litellm 仓(交叉验证用,在 `/mnt/xinshi/github/litellm/`)
+
+| 关注点 | 路径 |
+|---|---|
+| CustomLogger 接口(基类) | `litellm/integrations/custom_logger.py:67` |
+| CustomLLM 接口(基类) | `litellm/llms/custom_llm.py:47` |
+| StandardLoggingPayload schema | `litellm/types/utils.py:2764` |
+| streaming 聚合写入 payload | `litellm/litellm_core_utils/litellm_logging.py:1930-1955` |
+| async_mock_completion_streaming_obj | `litellm/utils.py:6831` |
+| custom_provider_map 加载流程(实际是怎么调 acompletion 的) | `litellm/main.py:4280-4326` |
+| LiteLLM 异常基类(status_code 来源) | `litellm/exceptions.py` |
+
+### 历史 / 对话产物
+
+- 原始 plan 文件(详细设计推演): `/home/xinshi/.claude/plans/litellm-chat-completions-traj-replay-ser-lucky-rainbow.md`
+- 已废弃的独立项目骨架: `/mnt/xinshi/github/litellm-traj/`(已 `rm -rf`)
+
+---
+
+## 9. 给接手者的 1 分钟上手
+
+1. `cd /mnt/xinshi/github/Self-ROCK && uv sync --extra model-service --group test`
+2. `uv run pytest tests/unit/sdk/model/ -v` → 应得 **47 passed**(已验证)
+3. 跑集成验证(第 4.4 节)
+4. 写 PR 描述,**重点说第 5 节的 breaking changes**
+5. PR 评审里如果有人问"为什么不沿用 retry_async 的 status code 白名单",答:见第 3.4 节(litellm 默认 retry 已覆盖最常见场景,白名单后续可选加)
+
+如果想了解整个项目背景而不只是这次 refactor,看顶层 `CLAUDE.md`。如果想知道 litellm 内部细节,看 `/mnt/xinshi/github/litellm/CLAUDE.md`(litellm 主仓的)。
diff --git a/examples/model_service/README.md b/examples/model_service/README.md
new file mode 100644
index 0000000000..7a169764fe
--- /dev/null
+++ b/examples/model_service/README.md
@@ -0,0 +1,90 @@
+# model-service proxy 用法示例
+
+`rock model-service` 的 `proxy` 模式把 `/v1/chat/completions` 转发到上游 LLM，并把每次调用以
+`StandardLoggingPayload` 格式 append 到 JSONL traj 文件。配合 `--traj-file` 可以让相同 base URL 的
+agent（SWE-agent / mini-swe-agent / OpenHands）从录制的 traj 回放，实现"无 LLM 成本"调试。
+
+下面所有命令都用 `python -m rock.sdk.model.server.main` 启动，等价于 `rock model-service start`。
+
+## 1. Record 模式（默认）
+
+转发到单个上游，每次调用 append 到 `LOG_DIR/LLMTraj.jsonl`：
+
+```bash
+export OPENAI_API_KEY="sk-..."
+export ROCK_MODEL_SERVICE_DATA_DIR=/tmp/rock-traj   # traj 文件落盘根目录
+
+python -m rock.sdk.model.server.main \
+    --type proxy \
+    --proxy-base-url https://api.openai.com/v1 \
+    --port 8080
+```
+
+调用：
+
+```bash
+curl -X POST http://127.0.0.1:8080/v1/chat/completions \
+    -H "Authorization: Bearer $OPENAI_API_KEY" \
+    -H "Content-Type: application/json" \
+    -d '{"model":"gpt-3.5-turbo","messages":[{"role":"user","content":"hi"}]}'
+
+# 查看 traj
+cat /tmp/rock-traj/LLMTraj.jsonl | jq '.id, .model, .response.choices[0].message.content'
+```
+
+支持流式（litellm 自动聚合写入 traj）：
+
+```bash
+curl -N -X POST http://127.0.0.1:8080/v1/chat/completions \
+    -H "Authorization: Bearer $OPENAI_API_KEY" \
+    -H "Content-Type: application/json" \
+    -d '{"model":"gpt-3.5-turbo","stream":true,"messages":[{"role":"user","content":"count to 5"}]}'
+```
+
+## 2. Replay 模式
+
+把 `--traj-file` 指到一个录好的 jsonl，proxy 不再访问真实 LLM，按录制顺序返回响应：
+
+```bash
+python -m rock.sdk.model.server.main \
+    --type proxy \
+    --traj-file /tmp/rock-traj/LLMTraj.jsonl \
+    --port 8081
+```
+
+agent 把 base URL 换成 `http://127.0.0.1:8081/v1` 即可重放，cursor 用尽后返回 404。
+`--traj-file` 必须是单个 jsonl 文件路径。
+
+## 3. 调整重试和超时
+
+```bash
+python -m rock.sdk.model.server.main \
+    --type proxy \
+    --proxy-base-url https://api.openai.com/v1 \
+    --num-retries 3 \
+    --request-timeout 60 \
+    --port 8080
+```
+
+## 4. 多模型路由（需要 YAML）
+
+只有在按 model name 分流到不同上游时才需要 YAML（CLI 只暴露单一 `--proxy-base-url`）。新建
+`routes.yaml`：
+
+```yaml
+proxy_rules:
+  gpt-3.5-turbo: "https://api.openai.com/v1"
+  gpt-4o: "https://api.openai.com/v1"
+  default: "https://api-inference.modelscope.cn/v1"
+```
+
+启动时配合 CLI：
+
+```bash
+python -m rock.sdk.model.server.main \
+    --type proxy \
+    --config-file routes.yaml \
+    --port 8080
+```
+
+CLI 上指定的 `--proxy-base-url` / `--port` / `--num-retries` 等仍会覆盖 YAML 的同名字段。
diff --git a/pyproject.toml b/pyproject.toml
index badb7d1a4b..bf814e0aa1 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -86,6 +86,7 @@ model-service = [
     "psutil",
     "swebench",
     "alibabacloud_cr20181201==2.0.5",
+    "litellm>=1.50.0",
 ]
 
 
diff --git a/rock/sdk/model/server/api/proxy.py b/rock/sdk/model/server/api/proxy.py
index fb2b7bec3c..4894430641 100644
--- a/rock/sdk/model/server/api/proxy.py
+++ b/rock/sdk/model/server/api/proxy.py
@@ -1,13 +1,28 @@
+"""OpenAI-compatible chat/completions proxy backed by the litellm SDK.
+
+The proxy ``/v1/chat/completions`` handler routes a request to the configured
+upstream LLM (or to the in-process traj-replay handler when ``replay_traj_path``
+is set), forwards header/body, and applies retry via litellm's ``num_retries``.
+
+Trajectory recording is wired up at startup in
+``rock.sdk.model.server.main`` by registering ``TrajectoryRecorder`` as a
+``litellm.callbacks`` entry — this handler does not carry a ``@record_traj``
+decorator anymore.
+"""
+
+from __future__ import annotations
+
+import json
+from collections.abc import AsyncIterator
 from typing import Any
 
-import httpx
+import litellm
 from fastapi import APIRouter, HTTPException, Request
-from fastapi.responses import JSONResponse
+from fastapi.responses import JSONResponse, StreamingResponse
+from litellm.exceptions import APIError, AuthenticationError, BadRequestError, RateLimitError, Timeout
 
 from rock.logger import init_logger
 from rock.sdk.model.server.config import ModelServiceConfig
-from rock.sdk.model.server.utils import record_traj
-from rock.utils import retry_async
 
 logger = init_logger(__name__)
 
@@ -15,40 +30,21 @@
 proxy_router = APIRouter()
 
 
-# Global HTTP client with a persistent connection pool
-http_client = httpx.AsyncClient()
-
-
-@retry_async(
-    max_attempts=6,
-    delay_seconds=2.0,
-    backoff=2.0,  # Exponential backoff (2s, 4s, 8s, 16s, 32s).
-    jitter=True,  # Adds randomness to prevent "thundering herd" effect on the backend.
-    exceptions=(httpx.TimeoutException, httpx.ConnectError, httpx.HTTPStatusError),
-)
-async def perform_llm_request(url: str, body: dict, headers: dict, config: ModelServiceConfig):
-    """
-    Forwards the request and triggers retry ONLY if the status code
-    is in the explicit retryable whitelist.
-    """
-    response = await http_client.post(url, json=body, headers=headers, timeout=config.request_timeout)
-    status_code = response.status_code
-
-    # Check against the explicit whitelist
-    if status_code in config.retryable_status_codes:
-        logger.warning(f"Retryable error detected: {status_code}. Triggering retry for {url}...")
-        response.raise_for_status()
-
-    return response
+# Headers we never forward upstream:
+#   - host / content-length / content-type: litellm rewrites the body and re-targets,
+#     so the client's values would be wrong or misleading
+#   - transfer-encoding / connection: true RFC 7230 hop-by-hop headers, scoped to
+#     the client↔proxy connection only
+_HEADERS_NOT_TO_FORWARD = frozenset({"host", "content-length", "content-type", "transfer-encoding", "connection"})
 
 
 def get_base_url(model_name: str, config: ModelServiceConfig) -> str:
-    """
-    Selects the target backend URL based on model name matching.
+    """Pick the upstream base URL by model name.
 
-    If proxy_base_url is configured, it takes precedence over proxy_rules.
+    ``proxy_base_url`` takes precedence; falls back to ``proxy_rules[model]`` and
+    then ``proxy_rules["default"]``. Trailing slashes are stripped so the caller
+    can append ``/chat/completions`` directly.
     """
-    # If direct proxy base URL is configured, return it directly (bypass model name matching)
     if config.proxy_base_url:
         return config.proxy_base_url.rstrip("/")
 
@@ -59,67 +55,108 @@ def get_base_url(model_name: str, config: ModelServiceConfig) -> str:
     base_url = rules.get(model_name) or rules.get("default")
     if not base_url:
         raise HTTPException(
-            status_code=400, detail=f"Model '{model_name}' is not configured and no 'default' rule found."
+            status_code=400,
+            detail=f"Model '{model_name}' is not configured and no 'default' rule found.",
         )
 
     return base_url.rstrip("/")
 
 
+def _filter_headers(headers) -> dict[str, str]:
+    forwarded = {}
+    for key, value in headers.items():
+        if key.lower() in _HEADERS_NOT_TO_FORWARD:
+            continue
+        forwarded[key] = value
+    return forwarded
+
+
+def _format_error_response(exc: Exception) -> JSONResponse:
+    """Render a litellm exception as the legacy ``{error:{message,type,code}}`` JSON.
+
+    Agent-side logic keys off message substrings (e.g. "context length exceeded",
+    "content violation"), so we keep the message verbatim from the upstream.
+    """
+    status_code = getattr(exc, "status_code", None) or 502
+    message = str(exc)
+    error_type = type(exc).__name__
+    return JSONResponse(
+        status_code=status_code,
+        content={
+            "error": {
+                "message": f"LLM backend error: {message}",
+                "type": error_type,
+                "code": status_code,
+            }
+        },
+    )
+
+
+async def _sse_iter(stream: AsyncIterator[Any]) -> AsyncIterator[bytes]:
+    """Convert a litellm async chunk stream into Server-Sent Events bytes."""
+    try:
+        async for chunk in stream:
+            payload = chunk.model_dump() if hasattr(chunk, "model_dump") else chunk
+            yield f"data: {json.dumps(payload, ensure_ascii=False)}\n\n".encode()
+    finally:
+        yield b"data: [DONE]\n\n"
+
+
 @proxy_router.post("/v1/chat/completions")
-@record_traj
 async def chat_completions(body: dict[str, Any], request: Request):
+    """OpenAI-compatible chat completions proxy endpoint.
+
+    Routes via ``proxy_base_url`` / ``proxy_rules``, forwards Authorization-style
+    headers, supports streaming, retries via litellm. In replay mode the request
+    is dispatched to the registered ``traj-replay`` CustomLLM provider instead
+    of being forwarded upstream.
     """
-    OpenAI-compatible chat completions proxy endpoint.
-    Handles routing, header transparent forwarding, and automatic retries.
-    """
-    config = request.app.state.model_service_config
+    config: ModelServiceConfig = request.app.state.model_service_config
 
-    # Step 1: Model Routing
     model_name = body.get("model", "")
-    base_url = get_base_url(model_name, config)
-    target_url = f"{base_url}/chat/completions"
-    logger.info(f"Routing model '{model_name}' to URL: {target_url}")
-
-    # Step 2: Header Cleaning
-    # Preserve 'Authorization' for authentication while removing hop-by-hop transport headers.
-    forwarded_headers = {}
-    for key, value in request.headers.items():
-        if key.lower() in ["host", "content-length", "content-type", "transfer-encoding"]:
-            continue
-        forwarded_headers[key] = value
 
-    # Step 3: Strategy Enforcement
-    # Force non-streaming mode for the MVP phase to ensure stability.
-    if body.get("stream") is True:
-        raise HTTPException(
-            status_code=400,
-            detail="Streaming requests (stream=True) are not supported in the current version. Please set stream=False or omit the stream parameter.",
-        )
-    body["stream"] = False
+    # 1. Route selection
+    if config.replay_traj_path:
+        litellm_model = f"traj-replay/{model_name or 'replay'}"
+        api_base: str | None = None
+        logger.info(f"[replay] dispatching '{model_name}' to traj-replay handler")
+    else:
+        api_base = get_base_url(model_name, config)
+        # Tell litellm to treat the upstream as an OpenAI-compatible server.
+        litellm_model = f"openai/{model_name}" if model_name else "openai/default"
+        logger.info(f"Routing model '{model_name}' to {api_base}")
+
+    # 2. Header forwarding (preserve Authorization, drop hop-by-hop)
+    extra_headers = _filter_headers(request.headers)
+
+    # 3. Build call kwargs (transparent passthrough of body fields)
+    call_kwargs = dict(body)
+    call_kwargs.pop("model", None)  # avoid duplicate kwargs
+    is_stream = bool(call_kwargs.get("stream"))
 
     try:
-        # Step 4: Execute Request with Retry Logic
-        response = await perform_llm_request(target_url, body, forwarded_headers, config)
-        return JSONResponse(status_code=response.status_code, content=response.json())
-
-    except httpx.HTTPStatusError as e:
-        # Forward the raw backend error message to the client.
-        # This allows the Agent-side logic to detect keywords like 'context length exceeded'
-        # or 'content violation' and raise appropriate exceptions.
-        error_text = e.response.text if e.response else "No error details"
-        status_code = e.response.status_code if e.response else 502
-        logger.error(f"Final failure after retries. Status: {status_code}, Response: {error_text}")
-        return JSONResponse(
-            status_code=status_code,
-            content={
-                "error": {
-                    "message": f"LLM backend error: {error_text}",
-                    "type": "proxy_retry_failed",
-                    "code": status_code,
-                }
-            },
+        response = await litellm.acompletion(
+            model=litellm_model,
+            api_base=api_base,
+            extra_headers=extra_headers,
+            timeout=config.request_timeout,
+            num_retries=config.num_retries,
+            **call_kwargs,
         )
-    except Exception as e:
-        logger.error(f"Unexpected proxy error: {str(e)}")
-        # Raise standard 500 for non-HTTP related errors or system errors
-        raise HTTPException(status_code=500, detail=str(e))
+    except (RateLimitError, APIError, BadRequestError, AuthenticationError, Timeout) as exc:
+        logger.warning(f"litellm error for model '{model_name}': {exc}")
+        return _format_error_response(exc)
+    except Exception as exc:  # pragma: no cover - last-resort safety net
+        logger.error(f"Unexpected proxy error: {exc}", exc_info=True)
+        raise HTTPException(status_code=500, detail=str(exc))
+
+    # 4. Streaming vs non-streaming response
+    if is_stream:
+        return StreamingResponse(_sse_iter(response), media_type="text/event-stream")
+
+    # litellm returns a ModelResponse pydantic; expose the OpenAI-shape dict.
+    if hasattr(response, "model_dump"):
+        body_out = response.model_dump()
+    else:
+        body_out = response  # already a dict (replay path can short-circuit)
+    return JSONResponse(status_code=200, content=body_out)
diff --git a/rock/sdk/model/server/config.py b/rock/sdk/model/server/config.py
index 2c96992b5c..8c992fb4b3 100644
--- a/rock/sdk/model/server/config.py
+++ b/rock/sdk/model/server/config.py
@@ -51,6 +51,19 @@ class ModelServiceConfig(BaseModel):
     request_timeout: int = Field(default=120)
     """Request timeout in seconds."""
 
+    num_retries: int = Field(default=6)
+    """Number of retries for retryable failures (passed through to litellm)."""
+
+    traj_enabled: bool = Field(default=True)
+    """When True, write each chat/completions call as a JSONL trajectory line."""
+
+    traj_file: str | None = Field(default=None)
+    """Override default trajectory file path. None → uses TRAJ_FILE (LOG_DIR/LLMTraj.jsonl)."""
+
+    replay_traj_path: str | None = Field(default=None)
+    """Path to a .jsonl trajectory file or a directory of .jsonl files for replay mode.
+    When set, requests are served from recorded responses instead of a real upstream."""
+
     @classmethod
     def from_file(cls, config_path: str | None = None):
         """
diff --git a/rock/sdk/model/server/integrations/__init__.py b/rock/sdk/model/server/integrations/__init__.py
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/rock/sdk/model/server/integrations/traj_recorder.py b/rock/sdk/model/server/integrations/traj_recorder.py
new file mode 100644
index 0000000000..6aa01a8eed
--- /dev/null
+++ b/rock/sdk/model/server/integrations/traj_recorder.py
@@ -0,0 +1,77 @@
+"""Record chat/completions trajectories as JSONL via litellm's CustomLogger hook.
+
+One line per call, each line is a ``StandardLoggingPayload`` dict from litellm.
+Streaming chunks are aggregated by litellm before this callback fires (see
+litellm/litellm_core_utils/litellm_logging.py around line 1930), so we don't
+need to handle the streaming/non-streaming split ourselves.
+"""
+
+from __future__ import annotations
+
+import asyncio
+import json
+import os
+from pathlib import Path
+
+from litellm.integrations.custom_logger import CustomLogger
+
+from rock.logger import init_logger
+from rock.sdk.model.server.utils import (
+    MODEL_SERVICE_REQUEST_COUNT,
+    MODEL_SERVICE_REQUEST_RT,
+    _get_or_create_metrics_monitor,
+)
+
+logger = init_logger(__name__)
+
+
+class TrajectoryRecorder(CustomLogger):
+    """litellm CustomLogger that appends each call's StandardLoggingPayload to JSONL
+    and reports OTLP RT/count metrics."""
+
+    def __init__(self, traj_file: str | os.PathLike) -> None:
+        super().__init__()
+        self.traj_file = Path(traj_file)
+        self.traj_file.parent.mkdir(parents=True, exist_ok=True)
+        self._lock = asyncio.Lock()
+        self._monitor = _get_or_create_metrics_monitor()
+
+    async def async_log_success_event(self, kwargs, response_obj, start_time, end_time):
+        payload = kwargs.get("standard_logging_object")
+        if payload is None:
+            logger.debug("[traj-recorder] success event without standard_logging_object, skipping")
+            return
+        await self._append_jsonl(payload)
+        self._record_metrics(payload, status="success")
+
+    async def async_log_failure_event(self, kwargs, response_obj, start_time, end_time):
+        payload = kwargs.get("standard_logging_object")
+        if payload is None:
+            return
+        await self._append_jsonl(payload)
+        self._record_metrics(payload, status="failure")
+
+    async def _append_jsonl(self, payload: dict) -> None:
+        line = json.dumps(payload, ensure_ascii=False, default=str) + "\n"
+        async with self._lock:
+            await asyncio.to_thread(self._write_line, line)
+
+    def _write_line(self, line: str) -> None:
+        with self.traj_file.open("a", encoding="utf-8") as f:
+            f.write(line)
+
+    def _record_metrics(self, payload: dict, *, status: str) -> None:
+        rt_seconds = payload.get("response_time")
+        if rt_seconds is None:
+            start = payload.get("startTime")
+            end = payload.get("endTime")
+            rt_seconds = (end - start) if (start is not None and end is not None) else 0.0
+        rt_ms = float(rt_seconds) * 1000.0
+
+        attrs = {
+            "type": "chat_completions",
+            "status": status,
+            "sandbox_id": os.getenv("ROCK_SANDBOX_ID", "unknown"),
+        }
+        self._monitor.record_gauge_by_name(MODEL_SERVICE_REQUEST_RT, rt_ms, attributes=attrs)
+        self._monitor.record_counter_by_name(MODEL_SERVICE_REQUEST_COUNT, 1, attributes=attrs)
diff --git a/rock/sdk/model/server/integrations/traj_replayer.py b/rock/sdk/model/server/integrations/traj_replayer.py
new file mode 100644
index 0000000000..c87c0fe75f
--- /dev/null
+++ b/rock/sdk/model/server/integrations/traj_replayer.py
@@ -0,0 +1,139 @@
+"""Replay a recorded trajectory by registering a litellm CustomLLM provider.
+
+Loads a single JSONL trajectory file on init, then hands records out one at a
+time in recorded order. This is the simplest matching strategy and works for
+deterministic agent runs that replay the same sequence of LLM calls
+(SWE-agent / mini-swe-agent / OpenHands).
+"""
+
+from __future__ import annotations
+
+import asyncio
+import json
+import os
+from collections.abc import AsyncIterator
+from pathlib import Path
+from typing import Any
+
+from litellm.llms.custom_llm import CustomLLM, CustomLLMError
+from litellm.types.utils import GenericStreamingChunk, ModelResponse
+from litellm.utils import async_mock_completion_streaming_obj
+
+from rock.logger import init_logger
+
+logger = init_logger(__name__)
+
+
+class SequentialCursor:
+    """Hands out trajectory records one at a time, in recorded order.
+
+    Going past the end raises CustomLLMError(404) so the proxy returns a clear
+    error to the caller.
+    """
+
+    def __init__(self, records: list[dict]) -> None:
+        self.records = records
+        self._idx = 0
+        self._lock = asyncio.Lock()
+
+    @classmethod
+    def load(cls, path: str | os.PathLike) -> SequentialCursor:
+        path = Path(path)
+        if not path.is_file():
+            raise FileNotFoundError(f"traj file not found: {path}")
+
+        records: list[dict] = []
+        with path.open("r", encoding="utf-8") as fp:
+            for line in fp:
+                line = line.strip()
+                if not line:
+                    continue
+                records.append(json.loads(line))
+
+        logger.info(f"[traj-replay] loaded {len(records)} record(s) from {path}")
+        return cls(records)
+
+    async def next(self, expected_model: str | None = None) -> dict:
+        async with self._lock:
+            if self._idx >= len(self.records):
+                raise CustomLLMError(
+                    status_code=404,
+                    message=(f"trajectory exhausted at step {self._idx} (total recorded steps={len(self.records)})"),
+                )
+            record = self.records[self._idx]
+            self._idx += 1
+            current_idx = self._idx - 1
+
+        if expected_model:
+            recorded_model = record.get("model")
+            if recorded_model and recorded_model != expected_model:
+                logger.warning(
+                    f"[traj-replay] step {current_idx} model mismatch: "
+                    f"recorded={recorded_model!r} requested={expected_model!r}"
+                )
+        return record
+
+    def reset(self) -> None:
+        self._idx = 0
+
+    @property
+    def position(self) -> int:
+        return self._idx
+
+    @property
+    def total(self) -> int:
+        return len(self.records)
+
+
+def _record_to_model_response(record: dict) -> ModelResponse:
+    response = record.get("response")
+    if not isinstance(response, dict):
+        raise CustomLLMError(
+            status_code=500,
+            message=f"traj record at step has no usable 'response' dict: got {type(response).__name__}",
+        )
+    return ModelResponse(**response)
+
+
+def _extract_assistant_text(record: dict) -> str:
+    response = record.get("response") or {}
+    choices = response.get("choices") or []
+    if not choices:
+        return ""
+    message = choices[0].get("message") or {}
+    return message.get("content") or ""
+
+
+class TrajectoryReplayer(CustomLLM):
+    """litellm CustomLLM that returns recorded responses in sequential order."""
+
+    def __init__(self, traj_path: str | os.PathLike) -> None:
+        super().__init__()
+        self.cursor = SequentialCursor.load(traj_path)
+
+    async def acompletion(
+        self,
+        model: str,
+        messages: list,
+        *args: Any,
+        **kwargs: Any,
+    ) -> ModelResponse:
+        record = await self.cursor.next(expected_model=model)
+        return _record_to_model_response(record)
+
+    async def astreaming(
+        self,
+        model: str,
+        messages: list,
+        *args: Any,
+        **kwargs: Any,
+    ) -> AsyncIterator[GenericStreamingChunk]:
+        record = await self.cursor.next(expected_model=model)
+        text = _extract_assistant_text(record)
+        model_response = kwargs.get("model_response")
+        async for chunk in async_mock_completion_streaming_obj(
+            model_response=model_response,
+            mock_response=text,
+            model=model,
+        ):
+            yield chunk
diff --git a/rock/sdk/model/server/main.py b/rock/sdk/model/server/main.py
index 7f8dabebe2..e2263a1858 100644
--- a/rock/sdk/model/server/main.py
+++ b/rock/sdk/model/server/main.py
@@ -11,7 +11,7 @@
 from rock.logger import init_logger
 from rock.sdk.model.server.api.local import init_local_api, local_router
 from rock.sdk.model.server.api.proxy import proxy_router
-from rock.sdk.model.server.config import ModelServiceConfig
+from rock.sdk.model.server.config import TRAJ_FILE, ModelServiceConfig
 
 # Configure logging
 logger = init_logger(__name__)
@@ -52,6 +52,39 @@ async def global_exception_handler(request, exc):
     return app
 
 
+def _configure_litellm_for_proxy(config: ModelServiceConfig) -> None:
+    """Wire up litellm record/replay integrations for the proxy mode.
+
+    - When ``replay_traj_path`` is set, register ``TrajectoryReplayer`` as a
+      custom provider so requests routed to ``traj-replay/<model>`` return
+      recorded responses without hitting any upstream.
+    - When recording is enabled (default), register ``TrajectoryRecorder`` as
+      a litellm callback so every chat/completions call appends a JSONL line.
+
+    Replay and record are mutually exclusive: in replay mode we don't record,
+    since replayed responses re-traversing the recorder would inflate metrics
+    and overwrite the source-of-truth file.
+    """
+    import litellm
+
+    from rock.sdk.model.server.integrations.traj_recorder import TrajectoryRecorder
+    from rock.sdk.model.server.integrations.traj_replayer import TrajectoryReplayer
+
+    if config.replay_traj_path:
+        replayer = TrajectoryReplayer(config.replay_traj_path)
+        litellm.custom_provider_map = [
+            {"provider": "traj-replay", "custom_handler": replayer},
+        ]
+        logger.info(f"litellm replay handler registered, traj_path={config.replay_traj_path}")
+        return
+
+    if config.traj_enabled:
+        traj_path = config.traj_file or TRAJ_FILE
+        recorder = TrajectoryRecorder(traj_file=traj_path)
+        litellm.callbacks.append(recorder)
+        logger.info(f"litellm trajectory recorder registered, traj_file={traj_path}")
+
+
 def main(
     model_servie_type: str,
     config: ModelServiceConfig,
@@ -63,6 +96,7 @@ def main(
         asyncio.run(init_local_api())
         app.include_router(local_router, prefix="", tags=["local"])
     else:
+        _configure_litellm_for_proxy(config)
         app.include_router(proxy_router, prefix="", tags=["proxy"])
 
     logger.info(f"Starting LLM Service on {config.host}:{config.port}, type: {model_servie_type}")
@@ -100,6 +134,13 @@ def create_config_from_args(args) -> ModelServiceConfig:
     if args.request_timeout:
         config.request_timeout = args.request_timeout
         logger.info(f"request_timeout set from command line: {args.request_timeout}s")
+    if getattr(args, "num_retries", None) is not None:
+        config.num_retries = args.num_retries
+        logger.info(f"num_retries set from command line: {args.num_retries}")
+    if getattr(args, "traj_file", None):
+        config.replay_traj_path = args.traj_file
+        config.traj_enabled = False
+        logger.info(f"replay mode enabled via --traj-file: {args.traj_file}")
 
     return config
 
@@ -142,6 +183,18 @@ def create_config_from_args(args) -> ModelServiceConfig:
     parser.add_argument(
         "--request-timeout", type=int, default=None, help="Request timeout in seconds. Overrides config file."
     )
+    parser.add_argument(
+        "--num-retries",
+        type=int,
+        default=None,
+        help="Number of retries for retryable failures (passed through to litellm). Overrides config file.",
+    )
+    parser.add_argument(
+        "--traj-file",
+        type=str,
+        default=None,
+        help="Replay mode: path to a recorded .jsonl traj file or directory. Disables real LLM upstreams.",
+    )
     args = parser.parse_args()
 
     config = create_config_from_args(args)
diff --git a/rock/sdk/model/server/utils.py b/rock/sdk/model/server/utils.py
index 20ae8896dc..86b7414e29 100644
--- a/rock/sdk/model/server/utils.py
+++ b/rock/sdk/model/server/utils.py
@@ -26,7 +26,13 @@ def _get_or_create_metrics_monitor() -> MetricsMonitor:
 
 
 def _write_traj(data: dict):
-    """Write traj data to file in JSONL format."""
+    """Write traj data to file in JSONL format.
+
+    Used by the legacy ``@record_traj`` decorator on the ``local`` model-service
+    flow. The proxy flow now persists trajectories via
+    :class:`rock.sdk.model.server.integrations.traj_recorder.TrajectoryRecorder`
+    instead, which uses litellm's StandardLoggingPayload schema.
+    """
     from rock import env_vars
 
     append = env_vars.ROCK_MODEL_SERVICE_TRAJ_APPEND_MODE
@@ -38,7 +44,12 @@ def _write_traj(data: dict):
 
 
 def record_traj(func: Callable):
-    """Decorator to record chat completions input/output as traj."""
+    """Decorator to record chat completions input/output as traj.
+
+    Kept for the ``local`` model-service mode (rock/sdk/model/server/api/local.py).
+    The ``proxy`` mode no longer uses this decorator — it relies on the
+    TrajectoryRecorder litellm callback for richer payloads.
+    """
 
     @wraps(func)
     async def wrapper(*args, **kwargs):
diff --git a/tests/unit/sdk/model/test_proxy.py b/tests/unit/sdk/model/test_proxy.py
index edce5584cb..c994c1c9d6 100644
--- a/tests/unit/sdk/model/test_proxy.py
+++ b/tests/unit/sdk/model/test_proxy.py
@@ -1,12 +1,12 @@
+from types import SimpleNamespace
 from unittest.mock import AsyncMock, MagicMock, patch
 
-import httpx
 import pytest
 import yaml
-from fastapi import FastAPI, Request
-from httpx import ASGITransport, AsyncClient, HTTPStatusError, Request, Response
+from fastapi import FastAPI
+from httpx import ASGITransport, AsyncClient
 
-from rock.sdk.model.server.api.proxy import perform_llm_request, proxy_router
+from rock.sdk.model.server.api.proxy import proxy_router
 from rock.sdk.model.server.config import ModelServiceConfig
 from rock.sdk.model.server.main import create_config_from_args, lifespan
 from rock.sdk.model.server.utils import (
@@ -24,18 +24,30 @@
 test_app.state.model_service_config = mock_config
 
 
+# Patch path for the litellm.acompletion symbol as imported inside proxy.py.
+ACOMPLETION_PATCH = "rock.sdk.model.server.api.proxy.litellm.acompletion"
+
+
+def _fake_model_response(*, id="chat-123", choices=None) -> SimpleNamespace:
+    """Build a litellm-shaped object that exposes .model_dump() like a Pydantic model."""
+    payload = {
+        "id": id,
+        "object": "chat.completion",
+        "model": "gpt-3.5-turbo",
+        "choices": choices
+        or [
+            {"index": 0, "message": {"role": "assistant", "content": "hi"}, "finish_reason": "stop"},
+        ],
+        "usage": {"prompt_tokens": 1, "completion_tokens": 1, "total_tokens": 2},
+    }
+    return SimpleNamespace(model_dump=lambda: payload)
+
+
 @pytest.mark.asyncio
 async def test_chat_completions_routing_success():
-    """
-    Test the high-level routing logic.
-    """
-    patch_path = "rock.sdk.model.server.api.proxy.perform_llm_request"
-
-    with patch(patch_path, new_callable=AsyncMock) as mock_request:
-        mock_resp = MagicMock(spec=Response)
-        mock_resp.status_code = 200
-        mock_resp.json.return_value = {"id": "chat-123", "choices": []}
-        mock_request.return_value = mock_resp
+    """Routing: model name maps to its proxy_rules entry, passed to litellm as api_base."""
+    with patch(ACOMPLETION_PATCH, new_callable=AsyncMock) as mock_acompletion:
+        mock_acompletion.return_value = _fake_model_response()
 
         transport = ASGITransport(app=test_app)
         async with AsyncClient(transport=transport, base_url="http://test") as ac:
@@ -43,51 +55,38 @@ async def test_chat_completions_routing_success():
             response = await ac.post("/v1/chat/completions", json=payload)
 
         assert response.status_code == 200
-        call_args = mock_request.call_args[0]
-        assert call_args[0] == "https://api.openai.com/v1/chat/completions"
-        assert mock_request.called
+        assert mock_acompletion.called
+        call_kwargs = mock_acompletion.call_args.kwargs
+        assert call_kwargs["api_base"] == "https://api.openai.com/v1"
+        assert call_kwargs["model"] == "openai/gpt-3.5-turbo"
+        assert call_kwargs["messages"] == [{"role": "user", "content": "hello"}]
 
 
 @pytest.mark.asyncio
 async def test_chat_completions_fallback_to_default_when_not_found():
-    """
-    Test that an unrecognized model name correctly falls back to the 'default' URL.
-    """
-    patch_path = "rock.sdk.model.server.api.proxy.perform_llm_request"
-
-    with patch(patch_path, new_callable=AsyncMock) as mock_request:
-        mock_resp = MagicMock(spec=Response)
-        mock_resp.status_code = 200
-        mock_resp.json.return_value = {"id": "chat-fallback", "choices": []}
-        mock_request.return_value = mock_resp
+    """Unrecognized model name → falls back to the 'default' base URL."""
+    with patch(ACOMPLETION_PATCH, new_callable=AsyncMock) as mock_acompletion:
+        mock_acompletion.return_value = _fake_model_response(id="chat-fallback")
 
         config = test_app.state.model_service_config
         default_base_url = config.proxy_rules["default"].rstrip("/")
-        expected_target_url = f"{default_base_url}/chat/completions"
 
         transport = ASGITransport(app=test_app)
         async with AsyncClient(transport=transport, base_url="http://test") as ac:
             payload = {
-                "model": "some-random-unsupported-model",  # This model is NOT in proxy_rules
+                "model": "some-random-unsupported-model",
                 "messages": [{"role": "user", "content": "hello"}],
             }
             response = await ac.post("/v1/chat/completions", json=payload)
 
         assert response.status_code == 200
-
-        # Verify that perform_llm_request was called with the DEFAULT URL
-        call_args = mock_request.call_args[0]
-        actual_url = call_args[0]
-
-        assert actual_url == expected_target_url
-        assert mock_request.called
+        call_kwargs = mock_acompletion.call_args.kwargs
+        assert call_kwargs["api_base"] == default_base_url
 
 
 @pytest.mark.asyncio
 async def test_chat_completions_routing_absolute_fail():
-    """
-    Test that both the specific model and the 'default' rule are missing.
-    """
+    """No matching rule and no 'default' → 400."""
     empty_config = ModelServiceConfig()
     empty_config.proxy_rules = {}
 
@@ -103,98 +102,143 @@ async def test_chat_completions_routing_absolute_fail():
 
 
 @pytest.mark.asyncio
-async def test_perform_llm_request_retry_on_whitelist():
-    """
-    Test that the proxy retries when receiving a whitelisted error code.
-    """
-    client_post_path = "rock.sdk.model.server.api.proxy.http_client.post"
-
-    # Patch asyncio.sleep inside the retry module to avoid actual waiting
-    with (
-        patch(client_post_path, new_callable=AsyncMock) as mock_post,
-        patch("rock.utils.retry.asyncio.sleep", return_value=None),
-    ):
-        # 1. Setup Failed Response (429)
-        resp_429 = MagicMock(spec=Response)
-        resp_429.status_code = 429
-        error_429 = HTTPStatusError("Rate Limited", request=MagicMock(spec=Request), response=resp_429)
+async def test_proxy_base_url_overrides_proxy_rules():
+    """When proxy_base_url is set, all requests go to that URL, ignoring proxy_rules."""
+    config = ModelServiceConfig()
+    config.proxy_base_url = "https://custom-endpoint.example.com/v1"
 
-        # 2. Setup Success Response (200)
-        resp_200 = MagicMock(spec=Response)
-        resp_200.status_code = 200
-        resp_200.json.return_value = {"ok": True}
+    local_app = FastAPI()
+    local_app.state.model_service_config = config
+    local_app.include_router(proxy_router)
 
-        # Sequence: Fail with 429, then Succeed with 200
-        mock_post.side_effect = [error_429, resp_200]
+    with patch(ACOMPLETION_PATCH, new_callable=AsyncMock) as mock_acompletion:
+        mock_acompletion.return_value = _fake_model_response()
 
-        result = await perform_llm_request("http://fake.url", {}, {}, mock_config)
+        transport = ASGITransport(app=local_app)
+        async with AsyncClient(transport=transport, base_url="http://test") as ac:
+            payload = {"model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "hello"}]}
+            response = await ac.post("/v1/chat/completions", json=payload)
 
-        assert result.status_code == 200
-        assert mock_post.call_count == 2
+        assert response.status_code == 200
+        call_kwargs = mock_acompletion.call_args.kwargs
+        assert call_kwargs["api_base"] == "https://custom-endpoint.example.com/v1"
 
 
 @pytest.mark.asyncio
-async def test_perform_llm_request_no_retry_on_non_whitelist():
-    """
-    Test that the proxy DOES NOT retry for non-retryable codes (e.g., 401).
-    It should return the error response immediately.
-    """
-    client_post_path = "rock.sdk.model.server.api.proxy.http_client.post"
+async def test_chat_completions_passes_num_retries_and_timeout():
+    """num_retries and request_timeout from config flow through to litellm.acompletion."""
+    config = ModelServiceConfig()
+    config.num_retries = 3
+    config.request_timeout = 45
 
-    with patch(client_post_path, new_callable=AsyncMock) as mock_post:
-        # Mock 401 Unauthorized (NOT in the retry whitelist)
-        resp_401 = MagicMock(spec=Response)
-        resp_401.status_code = 401
-        resp_401.json.return_value = {"error": "Invalid API Key"}
+    local_app = FastAPI()
+    local_app.state.model_service_config = config
+    local_app.include_router(proxy_router)
 
-        # The function should return this response directly
-        mock_post.return_value = resp_401
+    with patch(ACOMPLETION_PATCH, new_callable=AsyncMock) as mock_acompletion:
+        mock_acompletion.return_value = _fake_model_response()
 
-        result = await perform_llm_request("http://fake.url", {}, {}, mock_config)
+        transport = ASGITransport(app=local_app)
+        async with AsyncClient(transport=transport, base_url="http://test") as ac:
+            payload = {"model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "hi"}]}
+            await ac.post("/v1/chat/completions", json=payload)
 
-        assert result.status_code == 401
-        # Call count must be 1, meaning no retries were attempted
-        assert mock_post.call_count == 1
+        call_kwargs = mock_acompletion.call_args.kwargs
+        assert call_kwargs["num_retries"] == 3
+        assert call_kwargs["timeout"] == 45
 
 
 @pytest.mark.asyncio
-async def test_perform_llm_request_network_timeout_retry():
-    """
-    Test that network-level exceptions (like Timeout) also trigger retries.
-    """
-    client_post_path = "rock.sdk.model.server.api.proxy.http_client.post"
+async def test_chat_completions_litellm_error_returns_proxy_schema():
+    """A litellm exception is converted to {error:{message,type,code}} JSON
+    so agent-side keyword detection (e.g. 'context length exceeded') keeps working."""
+    from litellm.exceptions import BadRequestError
+
+    err = BadRequestError(
+        message="context length exceeded for this model",
+        model="gpt-3.5-turbo",
+        llm_provider="openai",
+    )
 
-    with (
-        patch(client_post_path, new_callable=AsyncMock) as mock_post,
-        patch("rock.utils.retry.asyncio.sleep", return_value=None),
-    ):
-        resp_200 = MagicMock(spec=Response)
-        resp_200.status_code = 200
+    with patch(ACOMPLETION_PATCH, new_callable=AsyncMock) as mock_acompletion:
+        mock_acompletion.side_effect = err
 
-        mock_post.side_effect = [httpx.TimeoutException("Network Timeout"), resp_200]
+        transport = ASGITransport(app=test_app)
+        async with AsyncClient(transport=transport, base_url="http://test") as ac:
+            payload = {"model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "hello"}]}
+            response = await ac.post("/v1/chat/completions", json=payload)
+
+        body = response.json()
+        assert "error" in body
+        assert "context length exceeded" in body["error"]["message"]
+        assert body["error"]["type"] == "BadRequestError"
+        assert body["error"]["code"] == response.status_code
+
+
+@pytest.mark.asyncio
+async def test_chat_completions_replay_mode_uses_traj_replay_provider():
+    """In replay mode the proxy targets traj-replay/<model> instead of a real upstream."""
+    config = ModelServiceConfig()
+    config.replay_traj_path = "/tmp/does-not-matter-for-this-test"
+
+    local_app = FastAPI()
+    local_app.state.model_service_config = config
+    local_app.include_router(proxy_router)
+
+    with patch(ACOMPLETION_PATCH, new_callable=AsyncMock) as mock_acompletion:
+        mock_acompletion.return_value = _fake_model_response()
+
+        transport = ASGITransport(app=local_app)
+        async with AsyncClient(transport=transport, base_url="http://test") as ac:
+            payload = {"model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "hi"}]}
+            response = await ac.post("/v1/chat/completions", json=payload)
+
+        assert response.status_code == 200
+        call_kwargs = mock_acompletion.call_args.kwargs
+        assert call_kwargs["model"] == "traj-replay/gpt-3.5-turbo"
+        assert call_kwargs["api_base"] is None
+
+
+@pytest.mark.asyncio
+async def test_chat_completions_strips_hop_by_hop_headers():
+    """host / content-length / transfer-encoding etc. are not forwarded."""
+    captured = {}
 
-        result = await perform_llm_request("http://fake.url", {}, {}, mock_config)
+    async def capture(*args, **kwargs):
+        captured.update(kwargs)
+        return _fake_model_response()
 
-        assert result.status_code == 200
-        assert mock_post.call_count == 2
+    with patch(ACOMPLETION_PATCH, new=capture):
+        transport = ASGITransport(app=test_app)
+        async with AsyncClient(transport=transport, base_url="http://test") as ac:
+            payload = {"model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "hi"}]}
+            await ac.post(
+                "/v1/chat/completions",
+                json=payload,
+                headers={"Authorization": "Bearer abc", "X-Trace": "t1"},
+            )
+
+    forwarded = captured["extra_headers"]
+    forwarded_lower = {k.lower() for k in forwarded}
+    assert "authorization" in forwarded_lower
+    assert "x-trace" in forwarded_lower
+    assert "host" not in forwarded_lower
+    assert "content-length" not in forwarded_lower
+    assert "content-type" not in forwarded_lower
+    assert "transfer-encoding" not in forwarded_lower
 
 
 @pytest.mark.asyncio
 async def test_lifespan_initialization_with_config(tmp_path):
-    """
-    Test that the application correctly initializes and overrides defaults
-    when a valid configuration file path is provided.
-    """
+    """Application initializes correctly when a valid config file is provided."""
     conf_file = tmp_path / "proxy.yml"
     conf_file.write_text(yaml.dump({"proxy_rules": {"my-model": "http://custom-url"}, "request_timeout": 50}))
 
-    # Initialize App and load config from file
     config = ModelServiceConfig.from_file(str(conf_file))
     app = FastAPI(lifespan=lambda app: lifespan(app, config))
 
     async with lifespan(app, config):
         app_config = app.state.model_service_config
-        # Verify that the config reflects file content instead of defaults
         assert app_config.proxy_rules["my-model"] == "http://custom-url"
         assert app_config.request_timeout == 50
         assert "gpt-3.5-turbo" not in app_config.proxy_rules
@@ -202,67 +246,26 @@ async def test_lifespan_initialization_with_config(tmp_path):
 
 @pytest.mark.asyncio
 async def test_lifespan_initialization_no_config():
-    """
-    Test that the application initializes with default ModelServiceConfig
-    settings when no configuration file path is provided.
-    """
+    """Defaults are loaded when no config file is provided."""
     config = ModelServiceConfig()
     app = FastAPI(lifespan=lambda app: lifespan(app, config))
 
     async with lifespan(app, config):
         app_config = app.state.model_service_config
-        # Verify that default rules (e.g., 'gpt-3.5-turbo') are loaded
         assert "gpt-3.5-turbo" in app_config.proxy_rules
         assert app_config.request_timeout == 120
 
 
 @pytest.mark.asyncio
 async def test_lifespan_invalid_config_path():
-    """
-    Test that providing a non-existent configuration file path causes
-    ModelServiceConfig.from_file to raise a FileNotFoundError.
-    """
-    # Expect FileNotFoundError when loading from non-existent file
+    """Non-existent config path → FileNotFoundError."""
     with pytest.raises(FileNotFoundError):
         ModelServiceConfig.from_file("/tmp/non_existent_file.yml")
 
 
-@pytest.mark.asyncio
-async def test_proxy_base_url_overrides_proxy_rules(tmp_path):
-    """
-    Test that when proxy_base_url is set, all requests are forwarded to that URL,
-    bypassing proxy_rules entirely.
-    """
-    config = ModelServiceConfig()
-    config.proxy_base_url = "https://custom-endpoint.example.com/v1"
-
-    test_app = FastAPI()
-    test_app.state.model_service_config = config
-    test_app.include_router(proxy_router)
-
-    with patch("rock.sdk.model.server.api.proxy.perform_llm_request", new_callable=AsyncMock) as mock_request:
-        mock_resp = MagicMock(spec=Response)
-        mock_resp.status_code = 200
-        mock_resp.json.return_value = {"id": "chat-123", "choices": []}
-        mock_request.return_value = mock_resp
-
-        transport = ASGITransport(app=test_app)
-        async with AsyncClient(transport=transport, base_url="http://test") as ac:
-            # Even when requesting gpt-3.5-turbo, should forward to proxy_base_url
-            payload = {"model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "hello"}]}
-            response = await ac.post("/v1/chat/completions", json=payload)
-
-        assert response.status_code == 200
-        # Verify request was sent to proxy_base_url
-        call_args = mock_request.call_args[0]
-        assert call_args[0] == "https://custom-endpoint.example.com/v1/chat/completions"
-
-
 @pytest.mark.asyncio
 async def test_config_loads_host_and_port_from_file(tmp_path):
-    """
-    Test that ModelServiceConfig correctly loads host and port from config file.
-    """
+    """ModelServiceConfig loads host and port from config file."""
     conf_file = tmp_path / "proxy.yml"
     conf_file.write_text(
         yaml.dump({"host": "127.0.0.1", "port": 9000, "proxy_rules": {"my-model": "http://my-backend"}})
@@ -276,101 +279,59 @@ async def test_config_loads_host_and_port_from_file(tmp_path):
 
 
 def test_config_default_host_and_port():
-    """
-    Test default values for host and port.
-    """
     config = ModelServiceConfig()
-
     assert config.host == "0.0.0.0"
     assert config.port == 8080
 
 
 @pytest.mark.asyncio
 async def test_config_loads_retryable_status_codes_from_file(tmp_path):
-    """
-    Test that ModelServiceConfig correctly loads retryable_status_codes from config file.
-    """
     conf_file = tmp_path / "proxy.yml"
     conf_file.write_text(yaml.dump({"retryable_status_codes": [429, 500, 502, 503]}))
 
     config = ModelServiceConfig.from_file(str(conf_file))
-
     assert config.retryable_status_codes == [429, 500, 502, 503]
 
 
 def test_config_default_retryable_status_codes():
-    """
-    Test default values for retryable_status_codes.
-    """
     config = ModelServiceConfig()
-
     assert config.retryable_status_codes == [429, 500]
 
 
-@pytest.mark.asyncio
-async def test_perform_llm_request_respects_custom_retryable_codes():
-    """
-    Test that custom retryable_status_codes are respected (502 retries, 401 does not).
-    """
+def test_config_default_traj_and_replay():
+    """New traj/replay defaults: recording on (append=True), replay off."""
     config = ModelServiceConfig()
-    config.retryable_status_codes = [502, 503, 504]  # Custom retryable status codes
-
-    client_post_path = "rock.sdk.model.server.api.proxy.http_client.post"
-
-    with (
-        patch(client_post_path, new_callable=AsyncMock) as mock_post,
-        patch("rock.utils.retry.asyncio.sleep", return_value=None),
-    ):
-        # 502 should retry (in custom list)
-        resp_502 = MagicMock(spec=Response)
-        resp_502.status_code = 502
-        error_502 = HTTPStatusError("Bad Gateway", request=MagicMock(spec=Request), response=resp_502)
-
-        resp_200 = MagicMock(spec=Response)
-        resp_200.status_code = 200
-        resp_200.json.return_value = {"ok": True}
-
-        # Sequence: 502 fail, then 200 success
-        mock_post.side_effect = [error_502, resp_200]
-
-        result = await perform_llm_request("http://fake.url", {}, {}, config)
-
-        assert result.status_code == 200
-        assert mock_post.call_count == 2
+    assert config.traj_enabled is True
+    assert config.traj_file is None
+    assert config.replay_traj_path is None
+    assert config.num_retries == 6
 
 
 @pytest.mark.asyncio
-async def test_perform_llm_request_non_retryable_code_not_retried():
-    """
-    Test that 401 (not in custom retryable_status_codes) does not trigger retry.
-    """
-    config = ModelServiceConfig()
-    config.retryable_status_codes = [502, 503, 504]  # Custom retryable status codes, excluding 401
-
-    client_post_path = "rock.sdk.model.server.api.proxy.http_client.post"
-
-    with patch(client_post_path, new_callable=AsyncMock) as mock_post:
-        # 401 should not retry (not in custom list)
-        resp_401 = MagicMock(spec=Response)
-        resp_401.status_code = 401
-        resp_401.json.return_value = {"error": "Invalid API Key"}
-
-        mock_post.return_value = resp_401
-
-        result = await perform_llm_request("http://fake.url", {}, {}, config)
+async def test_config_loads_traj_and_replay_from_file(tmp_path):
+    conf_file = tmp_path / "proxy.yml"
+    conf_file.write_text(
+        yaml.dump(
+            {
+                "traj_enabled": False,
+                "traj_file": "/tmp/my-traj.jsonl",
+                "replay_traj_path": "/tmp/in.jsonl",
+                "num_retries": 2,
+            }
+        )
+    )
 
-        assert result.status_code == 401
-        assert mock_post.call_count == 1  # No retry
+    config = ModelServiceConfig.from_file(str(conf_file))
+    assert config.traj_enabled is False
+    assert config.traj_file == "/tmp/my-traj.jsonl"
+    assert config.replay_traj_path == "/tmp/in.jsonl"
+    assert config.num_retries == 2
 
 
 def test_cli_args_override_config_file(tmp_path):
-    """
-    Test that CLI arguments override config file settings.
-    This tests the logic in create_config_from_args().
-    """
+    """CLI arguments override config file settings."""
     import argparse
 
-    # Create args with config file and CLI parameters
     conf_file = tmp_path / "proxy.yml"
     conf_file.write_text(
         yaml.dump(
@@ -386,28 +347,47 @@ def test_cli_args_override_config_file(tmp_path):
 
     args = argparse.Namespace(
         config_file=str(conf_file),
-        host="0.0.0.0",  # CLI overrides config file
-        port=9000,  # CLI overrides config file
-        proxy_base_url="https://cli-url.example.com/v1",  # CLI overrides config file
-        retryable_status_codes="502,503",  # CLI overrides config file
-        request_timeout=30,  # CLI overrides config file
+        host="0.0.0.0",
+        port=9000,
+        proxy_base_url="https://cli-url.example.com/v1",
+        retryable_status_codes="502,503",
+        request_timeout=30,
+        num_retries=4,
+        traj_file=None,
     )
 
     config = create_config_from_args(args)
 
-    # Verify CLI arguments override config file
     assert config.host == "0.0.0.0"
     assert config.port == 9000
     assert config.proxy_base_url == "https://cli-url.example.com/v1"
     assert config.retryable_status_codes == [502, 503]
     assert config.request_timeout == 30
+    assert config.num_retries == 4
+
+
+def test_cli_traj_file_enables_replay():
+    """--traj-file sets replay_enabled, replay_traj_path, and disables recording."""
+    import argparse
+
+    args = argparse.Namespace(
+        config_file=None,
+        host=None,
+        port=None,
+        proxy_base_url=None,
+        retryable_status_codes=None,
+        request_timeout=None,
+        num_retries=None,
+        traj_file="/tmp/in.jsonl",
+    )
+
+    config = create_config_from_args(args)
+    assert config.replay_traj_path == "/tmp/in.jsonl"
+    assert config.traj_enabled is False
 
 
 @pytest.mark.asyncio
 async def test_config_file_overrides_defaults(tmp_path):
-    """
-    Test that config file values override default values.
-    """
     conf_file = tmp_path / "proxy.yml"
     conf_file.write_text(
         yaml.dump(
@@ -422,27 +402,20 @@ async def test_config_file_overrides_defaults(tmp_path):
 
     config = ModelServiceConfig.from_file(str(conf_file))
 
-    # Verify config file overrides defaults
     assert config.host == "10.0.0.1"
     assert config.port == 8888
     assert config.request_timeout == 300
     assert config.proxy_rules["test-model"] == "http://test-backend"
-    # Verify other fields remain as defaults
     assert config.proxy_base_url is None
 
 
 def test_metrics_monitor_is_singleton():
-    """
-    Test that _get_or_create_metrics_monitor returns the same instance
-    on repeated calls (module-level singleton, created only once).
-    """
+    """_get_or_create_metrics_monitor returns the same instance on repeated calls."""
     import rock.sdk.model.server.utils as utils_module
 
     with patch("rock.sdk.model.server.utils.MetricsMonitor") as mock_cls:
         mock_monitor = MagicMock()
         mock_cls.create.return_value = mock_monitor
-
-        # Reset singleton so the test is isolated
         utils_module._metrics_monitor = None
 
         first = _get_or_create_metrics_monitor()
@@ -450,15 +423,11 @@ def test_metrics_monitor_is_singleton():
 
         assert first is second
         assert mock_cls.create.call_count == 1
-
-        # Cleanup
         utils_module._metrics_monitor = None
 
 
 def test_metrics_monitor_uses_env_endpoint():
-    """
-    Test that ROCK_METRICS_ENDPOINT env var is passed to MetricsMonitor.create().
-    """
+    """ROCK_METRICS_ENDPOINT env var is passed to MetricsMonitor.create()."""
     import rock.sdk.model.server.utils as utils_module
 
     custom_endpoint = "http://my-otel-collector:4318/v1/metrics"
@@ -469,26 +438,19 @@ def test_metrics_monitor_uses_env_endpoint():
     ):
         mock_monitor = MagicMock()
         mock_cls.create.return_value = mock_monitor
-
         utils_module._metrics_monitor = None
         _get_or_create_metrics_monitor()
-
         mock_cls.create.assert_called_once_with(metrics_endpoint=custom_endpoint)
-
         utils_module._metrics_monitor = None
 
 
 def test_metrics_monitor_registers_gauge_and_counter():
-    """
-    Test that _get_or_create_metrics_monitor registers both
-    the RT gauge and request count counter on first creation.
-    """
+    """_get_or_create_metrics_monitor registers both metrics on first creation."""
     import rock.sdk.model.server.utils as utils_module
 
     with patch("rock.sdk.model.server.utils.MetricsMonitor") as mock_cls:
         mock_monitor = MagicMock()
         mock_cls.create.return_value = mock_monitor
-
         utils_module._metrics_monitor = None
         _get_or_create_metrics_monitor()
 
@@ -498,16 +460,12 @@ def test_metrics_monitor_registers_gauge_and_counter():
         mock_monitor._register_counter.assert_called_once_with(
             MODEL_SERVICE_REQUEST_COUNT, "total request count", "count"
         )
-
         utils_module._metrics_monitor = None
 
 
 @pytest.mark.asyncio
 async def test_record_traj_reports_rt_and_count():
-    """
-    Test that record_traj decorator calls record_gauge_by_name (RT)
-    and record_counter_by_name (count) with correct metric names and attributes.
-    """
+    """Legacy record_traj decorator (still used by local mode) reports RT/count."""
     import rock.sdk.model.server.utils as utils_module
 
     mock_monitor = MagicMock()
@@ -542,15 +500,12 @@ async def fake_handler(body: dict):
 
 @pytest.mark.asyncio
 async def test_record_traj_sandbox_id_defaults_to_unknown():
-    """
-    Test that sandbox_id defaults to 'unknown' when ROCK_SANDBOX_ID is not set.
-    """
+    """sandbox_id defaults to 'unknown' when ROCK_SANDBOX_ID is not set."""
     import rock.sdk.model.server.utils as utils_module
 
     mock_monitor = MagicMock()
 
     with patch("rock.sdk.model.server.utils.MetricsMonitor") as mock_cls, patch.dict("os.environ", {}, clear=False):
-        # Ensure ROCK_SANDBOX_ID is not set
         os_env = __import__("os").environ
         os_env.pop("ROCK_SANDBOX_ID", None)
 
diff --git a/tests/unit/sdk/model/test_traj_recorder.py b/tests/unit/sdk/model/test_traj_recorder.py
new file mode 100644
index 0000000000..c9b1c20197
--- /dev/null
+++ b/tests/unit/sdk/model/test_traj_recorder.py
@@ -0,0 +1,170 @@
+"""Tests for TrajectoryRecorder (litellm CustomLogger that writes JSONL + emits OTLP metrics)."""
+
+import json
+from unittest.mock import MagicMock, patch
+
+import pytest
+
+from rock.sdk.model.server.integrations.traj_recorder import TrajectoryRecorder
+
+
+def _sample_payload(**overrides):
+    payload = {
+        "id": "chatcmpl-abc",
+        "trace_id": "trace-1",
+        "call_type": "acompletion",
+        "stream": False,
+        "status": "success",
+        "model": "gpt-3.5-turbo",
+        "model_id": None,
+        "model_group": None,
+        "api_base": "https://api.openai.com/v1",
+        "messages": [{"role": "user", "content": "hi"}],
+        "response": {
+            "id": "chatcmpl-abc",
+            "choices": [
+                {
+                    "index": 0,
+                    "message": {"role": "assistant", "content": "hello back"},
+                    "finish_reason": "stop",
+                }
+            ],
+        },
+        "model_parameters": {"temperature": 0.7},
+        "startTime": 100.0,
+        "endTime": 100.5,
+        "completionStartTime": 100.5,
+        "response_time": 0.5,
+        "total_tokens": 12,
+        "prompt_tokens": 4,
+        "completion_tokens": 8,
+        "metadata": {},
+    }
+    payload.update(overrides)
+    return payload
+
+
+@pytest.fixture
+def mock_monitor():
+    monitor = MagicMock()
+    with patch(
+        "rock.sdk.model.server.integrations.traj_recorder._get_or_create_metrics_monitor",
+        return_value=monitor,
+    ):
+        yield monitor
+
+
+@pytest.mark.asyncio
+async def test_recorder_appends_each_call_as_jsonl_line(tmp_path, mock_monitor):
+    """Each successful call adds one JSONL line (always append-only)."""
+    traj_file = tmp_path / "traj.jsonl"
+    recorder = TrajectoryRecorder(traj_file=traj_file)
+
+    payload_a = _sample_payload(id="a", trace_id="run-1")
+    payload_b = _sample_payload(id="b", trace_id="run-1")
+
+    await recorder.async_log_success_event(
+        kwargs={"standard_logging_object": payload_a}, response_obj=None, start_time=0, end_time=1
+    )
+    await recorder.async_log_success_event(
+        kwargs={"standard_logging_object": payload_b}, response_obj=None, start_time=0, end_time=1
+    )
+
+    lines = traj_file.read_text(encoding="utf-8").strip().split("\n")
+    assert len(lines) == 2
+    assert json.loads(lines[0])["id"] == "a"
+    assert json.loads(lines[1])["id"] == "b"
+
+
+@pytest.mark.asyncio
+async def test_recorder_emits_metrics_with_sandbox_id(tmp_path, mock_monitor):
+    traj_file = tmp_path / "traj.jsonl"
+    recorder = TrajectoryRecorder(traj_file=traj_file)
+
+    with patch.dict("os.environ", {"ROCK_SANDBOX_ID": "sandbox-xyz"}):
+        await recorder.async_log_success_event(
+            kwargs={"standard_logging_object": _sample_payload()},
+            response_obj=None,
+            start_time=0,
+            end_time=1,
+        )
+
+    mock_monitor.record_gauge_by_name.assert_called_once()
+    gauge_args = mock_monitor.record_gauge_by_name.call_args
+    assert gauge_args.args[0] == "model_service.request.rt"
+    # response_time of 0.5s → 500 ms
+    assert gauge_args.args[1] == 500.0
+    assert gauge_args.kwargs["attributes"]["status"] == "success"
+    assert gauge_args.kwargs["attributes"]["sandbox_id"] == "sandbox-xyz"
+    assert gauge_args.kwargs["attributes"]["type"] == "chat_completions"
+
+    mock_monitor.record_counter_by_name.assert_called_once_with(
+        "model_service.request.count", 1, attributes=gauge_args.kwargs["attributes"]
+    )
+
+
+@pytest.mark.asyncio
+async def test_recorder_records_failure_with_failure_status(tmp_path, mock_monitor):
+    traj_file = tmp_path / "traj.jsonl"
+    recorder = TrajectoryRecorder(traj_file=traj_file)
+
+    failed_payload = _sample_payload(status="failure", error_information={"error_class": "RateLimitError"})
+
+    await recorder.async_log_failure_event(
+        kwargs={"standard_logging_object": failed_payload},
+        response_obj=None,
+        start_time=0,
+        end_time=1,
+    )
+
+    lines = traj_file.read_text(encoding="utf-8").strip().split("\n")
+    assert len(lines) == 1
+    assert json.loads(lines[0])["status"] == "failure"
+
+    gauge_args = mock_monitor.record_gauge_by_name.call_args
+    assert gauge_args.kwargs["attributes"]["status"] == "failure"
+
+
+@pytest.mark.asyncio
+async def test_recorder_skips_when_payload_missing(tmp_path, mock_monitor):
+    """If litellm doesn't attach a standard_logging_object, the recorder no-ops."""
+    traj_file = tmp_path / "traj.jsonl"
+    recorder = TrajectoryRecorder(traj_file=traj_file)
+
+    await recorder.async_log_success_event(kwargs={}, response_obj=None, start_time=0, end_time=1)
+
+    assert not traj_file.exists() or traj_file.read_text() == ""
+    mock_monitor.record_gauge_by_name.assert_not_called()
+    mock_monitor.record_counter_by_name.assert_not_called()
+
+
+@pytest.mark.asyncio
+async def test_recorder_creates_parent_directory(tmp_path, mock_monitor):
+    traj_file = tmp_path / "deep" / "nested" / "traj.jsonl"
+
+    recorder = TrajectoryRecorder(traj_file=traj_file)
+    await recorder.async_log_success_event(
+        kwargs={"standard_logging_object": _sample_payload()},
+        response_obj=None,
+        start_time=0,
+        end_time=1,
+    )
+
+    assert traj_file.exists()
+    assert traj_file.parent.is_dir()
+
+
+@pytest.mark.asyncio
+async def test_recorder_falls_back_to_start_end_time_when_response_time_missing(tmp_path, mock_monitor):
+    traj_file = tmp_path / "traj.jsonl"
+    recorder = TrajectoryRecorder(traj_file=traj_file)
+
+    payload = _sample_payload(startTime=10.0, endTime=10.25)
+    payload.pop("response_time", None)
+
+    await recorder.async_log_success_event(
+        kwargs={"standard_logging_object": payload}, response_obj=None, start_time=0, end_time=1
+    )
+
+    gauge_args = mock_monitor.record_gauge_by_name.call_args
+    assert abs(gauge_args.args[1] - 250.0) < 1e-6
diff --git a/tests/unit/sdk/model/test_traj_replayer.py b/tests/unit/sdk/model/test_traj_replayer.py
new file mode 100644
index 0000000000..7bfe30ef4e
--- /dev/null
+++ b/tests/unit/sdk/model/test_traj_replayer.py
@@ -0,0 +1,204 @@
+"""Tests for SequentialCursor + TrajectoryReplayer."""
+
+import json
+from types import SimpleNamespace
+
+import pytest
+from litellm.llms.custom_llm import CustomLLMError
+
+from rock.sdk.model.server.integrations.traj_replayer import (
+    SequentialCursor,
+    TrajectoryReplayer,
+)
+
+
+def _record(*, msg: str, model: str = "gpt-3.5-turbo", call_id: str = "x") -> dict:
+    """Build a minimal StandardLoggingPayload-shaped record."""
+    return {
+        "id": call_id,
+        "model": model,
+        "messages": [{"role": "user", "content": msg}],
+        "response": {
+            "id": call_id,
+            "object": "chat.completion",
+            "model": model,
+            "choices": [
+                {
+                    "index": 0,
+                    "message": {"role": "assistant", "content": f"reply: {msg}"},
+                    "finish_reason": "stop",
+                }
+            ],
+            "usage": {"prompt_tokens": 1, "completion_tokens": 1, "total_tokens": 2},
+        },
+    }
+
+
+def _write_jsonl(path, records):
+    with path.open("w", encoding="utf-8") as f:
+        for r in records:
+            f.write(json.dumps(r) + "\n")
+
+
+# ----- SequentialCursor -----
+
+
+def test_cursor_load_from_single_file(tmp_path):
+    p = tmp_path / "traj.jsonl"
+    _write_jsonl(p, [_record(msg="a"), _record(msg="b")])
+
+    cur = SequentialCursor.load(p)
+    assert cur.total == 2
+    assert cur.position == 0
+
+
+def test_cursor_load_skips_empty_lines(tmp_path):
+    p = tmp_path / "traj.jsonl"
+    p.write_text(
+        json.dumps(_record(msg="a")) + "\n\n  \n" + json.dumps(_record(msg="b")) + "\n",
+        encoding="utf-8",
+    )
+
+    cur = SequentialCursor.load(p)
+    assert cur.total == 2
+
+
+def test_cursor_load_missing_file_raises(tmp_path):
+    with pytest.raises(FileNotFoundError):
+        SequentialCursor.load(tmp_path / "missing.jsonl")
+
+
+def test_cursor_load_directory_raises(tmp_path):
+    """A directory is no longer a valid traj_file — must point to a single .jsonl."""
+    with pytest.raises(FileNotFoundError):
+        SequentialCursor.load(tmp_path)
+
+
+@pytest.mark.asyncio
+async def test_cursor_next_returns_records_in_order(tmp_path):
+    p = tmp_path / "traj.jsonl"
+    _write_jsonl(p, [_record(msg="a", call_id="1"), _record(msg="b", call_id="2")])
+
+    cur = SequentialCursor.load(p)
+    first = await cur.next()
+    second = await cur.next()
+
+    assert first["id"] == "1"
+    assert second["id"] == "2"
+    assert cur.position == 2
+
+
+@pytest.mark.asyncio
+async def test_cursor_next_raises_when_exhausted(tmp_path):
+    p = tmp_path / "traj.jsonl"
+    _write_jsonl(p, [_record(msg="only")])
+
+    cur = SequentialCursor.load(p)
+    await cur.next()
+
+    with pytest.raises(CustomLLMError) as exc_info:
+        await cur.next()
+    assert exc_info.value.status_code == 404
+    assert "exhausted" in exc_info.value.message
+
+
+@pytest.mark.asyncio
+async def test_cursor_reset_replays_from_start(tmp_path):
+    p = tmp_path / "traj.jsonl"
+    _write_jsonl(p, [_record(msg="a"), _record(msg="b")])
+
+    cur = SequentialCursor.load(p)
+    await cur.next()
+    await cur.next()
+    cur.reset()
+
+    again = await cur.next()
+    assert again["messages"][0]["content"] == "a"
+
+
+@pytest.mark.asyncio
+async def test_cursor_model_mismatch_only_warns(tmp_path, caplog):
+    p = tmp_path / "traj.jsonl"
+    _write_jsonl(p, [_record(msg="a", model="gpt-3.5-turbo")])
+
+    cur = SequentialCursor.load(p)
+    record = await cur.next(expected_model="gpt-4o")  # different model -> warn but don't raise
+    assert record["id"] == "x"
+
+
+# ----- TrajectoryReplayer -----
+
+
+@pytest.mark.asyncio
+async def test_replayer_acompletion_returns_recorded_response(tmp_path):
+    p = tmp_path / "traj.jsonl"
+    _write_jsonl(p, [_record(msg="a", call_id="step-1")])
+
+    replayer = TrajectoryReplayer(p)
+    response = await replayer.acompletion(
+        model="gpt-3.5-turbo",
+        messages=[{"role": "user", "content": "anything"}],
+    )
+
+    assert response.id == "step-1"
+    assert response.choices[0].message.content == "reply: a"
+
+
+@pytest.mark.asyncio
+async def test_replayer_acompletion_advances_cursor(tmp_path):
+    p = tmp_path / "traj.jsonl"
+    _write_jsonl(
+        p,
+        [
+            _record(msg="a", call_id="step-1"),
+            _record(msg="b", call_id="step-2"),
+        ],
+    )
+
+    replayer = TrajectoryReplayer(p)
+    r1 = await replayer.acompletion(model="gpt-3.5-turbo", messages=[])
+    r2 = await replayer.acompletion(model="gpt-3.5-turbo", messages=[])
+
+    assert r1.id == "step-1"
+    assert r2.id == "step-2"
+
+
+@pytest.mark.asyncio
+async def test_replayer_astreaming_yields_chunks_that_recompose_the_text(tmp_path):
+    """The chunks produced by astreaming should reassemble into the recorded text."""
+    p = tmp_path / "traj.jsonl"
+    recorded_text = "Hello world, this is a deterministic replay."
+    record = _record(msg="hi")
+    record["response"]["choices"][0]["message"]["content"] = recorded_text
+    _write_jsonl(p, [record])
+
+    replayer = TrajectoryReplayer(p)
+
+    # Build a litellm-shaped ModelResponse mock with one Choice/Delta slot.
+    fake_choice = SimpleNamespace(delta=SimpleNamespace(role=None, content=None), index=0)
+    fake_response = SimpleNamespace(choices=[fake_choice])
+
+    chunks_text = []
+    async for chunk in replayer.astreaming(
+        model="gpt-3.5-turbo",
+        messages=[],
+        model_response=fake_response,
+    ):
+        if hasattr(chunk, "choices") and chunk.choices and getattr(chunk.choices[0], "delta", None):
+            piece = chunk.choices[0].delta.content
+            if piece:
+                chunks_text.append(piece)
+
+    assert "".join(chunks_text) == recorded_text
+
+
+@pytest.mark.asyncio
+async def test_replayer_acompletion_raises_on_exhaustion(tmp_path):
+    p = tmp_path / "traj.jsonl"
+    _write_jsonl(p, [_record(msg="only")])
+
+    replayer = TrajectoryReplayer(p)
+    await replayer.acompletion(model="gpt-3.5-turbo", messages=[])
+
+    with pytest.raises(CustomLLMError):
+        await replayer.acompletion(model="gpt-3.5-turbo", messages=[])
diff --git a/uv.lock b/uv.lock
index e00a7f86b3..6a3efc2ba4 100644
--- a/uv.lock
+++ b/uv.lock
@@ -1196,6 +1196,15 @@ wheels = [
     { url = "https://mirrors.aliyun.com/pypi/packages/33/6b/e0547afaf41bf2c42e52430072fa5658766e3d65bd4b03a563d1b6336f57/distlib-0.4.0-py2.py3-none-any.whl", hash = "sha256:9659f7d87e46584a30b5780e43ac7a2143098441670ff0a49d5f9034c54a6c16" },
 ]
 
+[[package]]
+name = "distro"
+version = "1.9.0"
+source = { registry = "https://mirrors.aliyun.com/pypi/simple/" }
+sdist = { url = "https://mirrors.aliyun.com/pypi/packages/fc/f8/98eea607f65de6527f8a2e8885fc8015d3e6f5775df186e443e0964a11c3/distro-1.9.0.tar.gz", hash = "sha256:2fa77c6fd8940f116ee1d6b94a2f90b13b5ea8d019b98bc8bafdcabcdd9bdbed" }
+wheels = [
+    { url = "https://mirrors.aliyun.com/pypi/packages/12/b3/231ffd4ab1fc9d679809f356cebee130ac7daa00d6d6f3206dd4fd137e9e/distro-1.9.0-py3-none-any.whl", hash = "sha256:7bffd925d65168f85027d8da9af6bddab658135b840670a223589bc0c8ef02b2" },
+]
+
 [[package]]
 name = "docker"
 version = "7.1.0"
@@ -1294,6 +1303,69 @@ wheels = [
     { url = "https://mirrors.aliyun.com/pypi/packages/2e/7a/c11883a98676e74a405d6503d65f58c3fa076ddd9c0cee6044884f6eac38/fastcore-1.8.15-py3-none-any.whl", hash = "sha256:d005d10d7ee5c2abb7ac0544da7c9f0a0a2f7706b48892a27c1906487ca6dea9" },
 ]
 
+[[package]]
+name = "fastuuid"
+version = "0.14.0"
+source = { registry = "https://mirrors.aliyun.com/pypi/simple/" }
+sdist = { url = "https://mirrors.aliyun.com/pypi/packages/c3/7d/d9daedf0f2ebcacd20d599928f8913e9d2aea1d56d2d355a93bfa2b611d7/fastuuid-0.14.0.tar.gz", hash = "sha256:178947fc2f995b38497a74172adee64fdeb8b7ec18f2a5934d037641ba265d26" }
+wheels = [
+    { url = "https://mirrors.aliyun.com/pypi/packages/ad/b2/731a6696e37cd20eed353f69a09f37a984a43c9713764ee3f7ad5f57f7f9/fastuuid-0.14.0-cp310-cp310-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl", hash = "sha256:6e6243d40f6c793c3e2ee14c13769e341b90be5ef0c23c82fa6515a96145181a" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/c5/79/c73c47be2a3b8734d16e628982653517f80bbe0570e27185d91af6096507/fastuuid-0.14.0-cp310-cp310-macosx_10_12_x86_64.whl", hash = "sha256:13ec4f2c3b04271f62be2e1ce7e95ad2dd1cf97e94503a3760db739afbd48f00" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/24/c5/84c1eea05977c8ba5173555b0133e3558dc628bcf868d6bf1689ff14aedc/fastuuid-0.14.0-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:b2fdd48b5e4236df145a149d7125badb28e0a383372add3fbaac9a6b7a394470" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/0e/23/4e362367b7fa17dbed646922f216b9921efb486e7abe02147e4b917359f8/fastuuid-0.14.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:f74631b8322d2780ebcf2d2d75d58045c3e9378625ec51865fe0b5620800c39d" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/b2/72/3985be633b5a428e9eaec4287ed4b873b7c4c53a9639a8b416637223c4cd/fastuuid-0.14.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:83cffc144dc93eb604b87b179837f2ce2af44871a7b323f2bfed40e8acb40ba8" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/b3/6d/6ef192a6df34e2266d5c9deb39cd3eea986df650cbcfeaf171aa52a059c3/fastuuid-0.14.0-cp310-cp310-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:1a771f135ab4523eb786e95493803942a5d1fc1610915f131b363f55af53b219" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/9d/11/8a2ea753c68d4fece29d5d7c6f3f903948cc6e82d1823bc9f7f7c0355db3/fastuuid-0.14.0-cp310-cp310-musllinux_1_1_aarch64.whl", hash = "sha256:4edc56b877d960b4eda2c4232f953a61490c3134da94f3c28af129fb9c62a4f6" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/23/42/7a32c93b6ce12642d9a152ee4753a078f372c9ebb893bc489d838dd4afd5/fastuuid-0.14.0-cp310-cp310-musllinux_1_1_i686.whl", hash = "sha256:bcc96ee819c282e7c09b2eed2b9bd13084e3b749fdb2faf58c318d498df2efbe" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/b9/e9/a5f6f686b46e3ed4ed3b93770111c233baac87dd6586a411b4988018ef1d/fastuuid-0.14.0-cp310-cp310-musllinux_1_1_x86_64.whl", hash = "sha256:7a3c0bca61eacc1843ea97b288d6789fbad7400d16db24e36a66c28c268cfe3d" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/b4/c9/18abc73c9c5b7fc0e476c1733b678783b2e8a35b0be9babd423571d44e98/fastuuid-0.14.0-cp310-cp310-win32.whl", hash = "sha256:7f2f3efade4937fae4e77efae1af571902263de7b78a0aee1a1653795a093b2a" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/5e/8a/d9e33f4eb4d4f6d9f2c5c7d7e96b5cdbb535c93f3b1ad6acce97ee9d4bf8/fastuuid-0.14.0-cp310-cp310-win_amd64.whl", hash = "sha256:ae64ba730d179f439b0736208b4c279b8bc9c089b102aec23f86512ea458c8a4" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/98/f3/12481bda4e5b6d3e698fbf525df4443cc7dce746f246b86b6fcb2fba1844/fastuuid-0.14.0-cp311-cp311-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl", hash = "sha256:73946cb950c8caf65127d4e9a325e2b6be0442a224fd51ba3b6ac44e1912ce34" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/59/19/2fc58a1446e4d72b655648eb0879b04e88ed6fa70d474efcf550f640f6ec/fastuuid-0.14.0-cp311-cp311-macosx_10_12_x86_64.whl", hash = "sha256:12ac85024637586a5b69645e7ed986f7535106ed3013640a393a03e461740cb7" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/78/29/3c74756e5b02c40cfcc8b1d8b5bac4edbd532b55917a6bcc9113550e99d1/fastuuid-0.14.0-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:05a8dde1f395e0c9b4be515b7a521403d1e8349443e7641761af07c7ad1624b1" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/52/96/d761da3fccfa84f0f353ce6e3eb8b7f76b3aa21fd25e1b00a19f9c80a063/fastuuid-0.14.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:09378a05020e3e4883dfdab438926f31fea15fd17604908f3d39cbeb22a0b4dc" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/fc/c2/f84c90167cc7765cb82b3ff7808057608b21c14a38531845d933a4637307/fastuuid-0.14.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:bbb0c4b15d66b435d2538f3827f05e44e2baafcc003dd7d8472dc67807ab8fd8" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/af/7b/4bacd03897b88c12348e7bd77943bac32ccf80ff98100598fcff74f75f2e/fastuuid-0.14.0-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:cd5a7f648d4365b41dbf0e38fe8da4884e57bed4e77c83598e076ac0c93995e7" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/c0/a2/584f2c29641df8bd810d00c1f21d408c12e9ad0c0dafdb8b7b29e5ddf787/fastuuid-0.14.0-cp311-cp311-musllinux_1_1_aarch64.whl", hash = "sha256:c0a94245afae4d7af8c43b3159d5e3934c53f47140be0be624b96acd672ceb73" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/24/68/c6b77443bb7764c760e211002c8638c0c7cce11cb584927e723215ba1398/fastuuid-0.14.0-cp311-cp311-musllinux_1_1_i686.whl", hash = "sha256:2b29e23c97e77c3a9514d70ce343571e469098ac7f5a269320a0f0b3e193ab36" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/5a/87/93f553111b33f9bb83145be12868c3c475bf8ea87c107063d01377cc0e8e/fastuuid-0.14.0-cp311-cp311-musllinux_1_1_x86_64.whl", hash = "sha256:1e690d48f923c253f28151b3a6b4e335f2b06bf669c68a02665bc150b7839e94" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/9e/8c/a04d486ca55b5abb7eaa65b39df8d891b7b1635b22db2163734dc273579a/fastuuid-0.14.0-cp311-cp311-win32.whl", hash = "sha256:a6f46790d59ab38c6aa0e35c681c0484b50dc0acf9e2679c005d61e019313c24" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/9c/b2/2d40bf00820de94b9280366a122cbaa60090c8cf59e89ac3938cf5d75895/fastuuid-0.14.0-cp311-cp311-win_amd64.whl", hash = "sha256:e150eab56c95dc9e3fefc234a0eedb342fac433dacc273cd4d150a5b0871e1fa" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/02/a2/e78fcc5df65467f0d207661b7ef86c5b7ac62eea337c0c0fcedbeee6fb13/fastuuid-0.14.0-cp312-cp312-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl", hash = "sha256:77e94728324b63660ebf8adb27055e92d2e4611645bf12ed9d88d30486471d0a" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/2b/b3/c846f933f22f581f558ee63f81f29fa924acd971ce903dab1a9b6701816e/fastuuid-0.14.0-cp312-cp312-macosx_10_12_x86_64.whl", hash = "sha256:caa1f14d2102cb8d353096bc6ef6c13b2c81f347e6ab9d6fbd48b9dea41c153d" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/54/ea/682551030f8c4fa9a769d9825570ad28c0c71e30cf34020b85c1f7ee7382/fastuuid-0.14.0-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:d23ef06f9e67163be38cece704170486715b177f6baae338110983f99a72c070" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/14/dd/5927f0a523d8e6a76b70968e6004966ee7df30322f5fc9b6cdfb0276646a/fastuuid-0.14.0-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:0c9ec605ace243b6dbe3bd27ebdd5d33b00d8d1d3f580b39fdd15cd96fd71796" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/16/6e/c0fb547eef61293153348f12e0f75a06abb322664b34a1573a7760501336/fastuuid-0.14.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:808527f2407f58a76c916d6aa15d58692a4a019fdf8d4c32ac7ff303b7d7af09" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/2d/b1/b9c75e03b768f61cf2e84ee193dc18601aeaf89a4684b20f2f0e9f52b62c/fastuuid-0.14.0-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:2fb3c0d7fef6674bbeacdd6dbd386924a7b60b26de849266d1ff6602937675c8" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/fc/fa/f7395fdac07c7a54f18f801744573707321ca0cee082e638e36452355a9d/fastuuid-0.14.0-cp312-cp312-musllinux_1_1_aarch64.whl", hash = "sha256:ab3f5d36e4393e628a4df337c2c039069344db5f4b9d2a3c9cea48284f1dd741" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/66/49/c9fd06a4a0b1f0f048aacb6599e7d96e5d6bc6fa680ed0d46bf111929d1b/fastuuid-0.14.0-cp312-cp312-musllinux_1_1_i686.whl", hash = "sha256:b9a0ca4f03b7e0b01425281ffd44e99d360e15c895f1907ca105854ed85e2057" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/be/9c/909e8c95b494e8e140e8be6165d5fc3f61fdc46198c1554df7b3e1764471/fastuuid-0.14.0-cp312-cp312-musllinux_1_1_x86_64.whl", hash = "sha256:3acdf655684cc09e60fb7e4cf524e8f42ea760031945aa8086c7eae2eeeabeb8" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/90/eb/d29d17521976e673c55ef7f210d4cdd72091a9ec6755d0fd4710d9b3c871/fastuuid-0.14.0-cp312-cp312-win32.whl", hash = "sha256:9579618be6280700ae36ac42c3efd157049fe4dd40ca49b021280481c78c3176" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/cc/fc/f5c799a6ea6d877faec0472d0b27c079b47c86b1cdc577720a5386483b36/fastuuid-0.14.0-cp312-cp312-win_amd64.whl", hash = "sha256:d9e4332dc4ba054434a9594cbfaf7823b57993d7d8e7267831c3e059857cf397" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/a5/83/ae12dd39b9a39b55d7f90abb8971f1a5f3c321fd72d5aa83f90dc67fe9ed/fastuuid-0.14.0-cp313-cp313-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl", hash = "sha256:77a09cb7427e7af74c594e409f7731a0cf887221de2f698e1ca0ebf0f3139021" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/53/b0/a4b03ff5d00f563cc7546b933c28cb3f2a07344b2aec5834e874f7d44143/fastuuid-0.14.0-cp313-cp313-macosx_10_12_x86_64.whl", hash = "sha256:9bd57289daf7b153bfa3e8013446aa144ce5e8c825e9e366d455155ede5ea2dc" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/9c/6d/64aee0a0f6a58eeabadd582e55d0d7d70258ffdd01d093b30c53d668303b/fastuuid-0.14.0-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:ac60fc860cdf3c3f327374db87ab8e064c86566ca8c49d2e30df15eda1b0c2d5" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/60/f5/a7e9cda8369e4f7919d36552db9b2ae21db7915083bc6336f1b0082c8b2e/fastuuid-0.14.0-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:ab32f74bd56565b186f036e33129da77db8be09178cd2f5206a5d4035fb2a23f" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/f0/d3/8ce11827c783affffd5bd4d6378b28eb6cc6d2ddf41474006b8d62e7448e/fastuuid-0.14.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:33e678459cf4addaedd9936bbb038e35b3f6b2061330fd8f2f6a1d80414c0f87" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/a2/51/680fb6352d0bbade04036da46264a8001f74b7484e2fd1f4da9e3db1c666/fastuuid-0.14.0-cp313-cp313-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:1e3cc56742f76cd25ecb98e4b82a25f978ccffba02e4bdce8aba857b6d85d87b" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/fa/7c/2014b5785bd8ebdab04ec857635ebd84d5ee4950186a577db9eff0fb8ff6/fastuuid-0.14.0-cp313-cp313-musllinux_1_1_aarch64.whl", hash = "sha256:cb9a030f609194b679e1660f7e32733b7a0f332d519c5d5a6a0a580991290022" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/01/d2/524d4ceeba9160e7a9bc2ea3e8f4ccf1ad78f3bde34090ca0c51f09a5e91/fastuuid-0.14.0-cp313-cp313-musllinux_1_1_i686.whl", hash = "sha256:09098762aad4f8da3a888eb9ae01c84430c907a297b97166b8abc07b640f2995" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/bc/17/354d04951ce114bf4afc78e27a18cfbd6ee319ab1829c2d5fb5e94063ac6/fastuuid-0.14.0-cp313-cp313-musllinux_1_1_x86_64.whl", hash = "sha256:1383fff584fa249b16329a059c68ad45d030d5a4b70fb7c73a08d98fd53bcdab" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/fb/be/d7be8670151d16d88f15bb121c5b66cdb5ea6a0c2a362d0dcf30276ade53/fastuuid-0.14.0-cp313-cp313-win32.whl", hash = "sha256:a0809f8cc5731c066c909047f9a314d5f536c871a7a22e815cc4967c110ac9ad" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/22/1d/5573ef3624ceb7abf4a46073d3554e37191c868abc3aecd5289a72f9810a/fastuuid-0.14.0-cp313-cp313-win_amd64.whl", hash = "sha256:0df14e92e7ad3276327631c9e7cec09e32572ce82089c55cb1bb8df71cf394ed" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/16/c9/8c7660d1fe3862e3f8acabd9be7fc9ad71eb270f1c65cce9a2b7a31329ab/fastuuid-0.14.0-cp314-cp314-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl", hash = "sha256:b852a870a61cfc26c884af205d502881a2e59cc07076b60ab4a951cc0c94d1ad" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/4c/f4/a989c82f9a90d0ad995aa957b3e572ebef163c5299823b4027986f133dfb/fastuuid-0.14.0-cp314-cp314-macosx_10_12_x86_64.whl", hash = "sha256:c7502d6f54cd08024c3ea9b3514e2d6f190feb2f46e6dbcd3747882264bb5f7b" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/da/6c/a1a24f73574ac995482b1326cf7ab41301af0fabaa3e37eeb6b3df00e6e2/fastuuid-0.14.0-cp314-cp314-macosx_11_0_arm64.whl", hash = "sha256:1ca61b592120cf314cfd66e662a5b54a578c5a15b26305e1b8b618a6f22df714" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/1a/20/2a9b59185ba7a6c7b37808431477c2d739fcbdabbf63e00243e37bd6bf49/fastuuid-0.14.0-cp314-cp314-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:aa75b6657ec129d0abded3bec745e6f7ab642e6dba3a5272a68247e85f5f316f" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/ef/33/4105ca574f6ded0af6a797d39add041bcfb468a1255fbbe82fcb6f592da2/fastuuid-0.14.0-cp314-cp314-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:a8a0dfea3972200f72d4c7df02c8ac70bad1bb4c58d7e0ec1e6f341679073a7f" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/fe/8c/fca59f8e21c4deb013f574eae05723737ddb1d2937ce87cb2a5d20992dc3/fastuuid-0.14.0-cp314-cp314-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:1bf539a7a95f35b419f9ad105d5a8a35036df35fdafae48fb2fd2e5f318f0d75" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/cb/e2/f78c271b909c034d429218f2798ca4e89eeda7983f4257d7865976ddbb6c/fastuuid-0.14.0-cp314-cp314-musllinux_1_1_aarch64.whl", hash = "sha256:9a133bf9cc78fdbd1179cb58a59ad0100aa32d8675508150f3658814aeefeaa4" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/1e/f0/5ff209d865897667a2ff3e7a572267a9ced8f7313919f6d6043aed8b1caa/fastuuid-0.14.0-cp314-cp314-musllinux_1_1_i686.whl", hash = "sha256:f54d5b36c56a2d5e1a31e73b950b28a0d83eb0c37b91d10408875a5a29494bad" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/e0/c8/2ce1c78f983a2c4987ea865d9516dbdfb141a120fd3abb977ae6f02ba7ca/fastuuid-0.14.0-cp314-cp314-musllinux_1_1_x86_64.whl", hash = "sha256:ec27778c6ca3393ef662e2762dba8af13f4ec1aaa32d08d77f71f2a70ae9feb8" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/df/60/dad662ec9a33b4a5fe44f60699258da64172c39bd041da2994422cdc40fe/fastuuid-0.14.0-cp314-cp314-win32.whl", hash = "sha256:e23fc6a83f112de4be0cc1990e5b127c27663ae43f866353166f87df58e73d06" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/1f/f6/da4db31001e854025ffd26bc9ba0740a9cbba2c3259695f7c5834908b336/fastuuid-0.14.0-cp314-cp314-win_amd64.whl", hash = "sha256:df61342889d0f5e7a32f7284e55ef95103f2110fee433c2ae7c2c0956d76ac8a" },
+]
+
 [[package]]
 name = "filelock"
 version = "3.20.0"
@@ -1919,6 +1991,121 @@ wheels = [
     { url = "https://mirrors.aliyun.com/pypi/packages/cb/b1/3846dd7f199d53cb17f49cba7e651e9ce294d8497c8c150530ed11865bb8/iniconfig-2.3.0-py3-none-any.whl", hash = "sha256:f631c04d2c48c52b84d0d0549c99ff3859c98df65b3101406327ecc7d53fbf12" },
 ]
 
+[[package]]
+name = "jinja2"
+version = "3.1.6"
+source = { registry = "https://mirrors.aliyun.com/pypi/simple/" }
+dependencies = [
+    { name = "markupsafe" },
+]
+sdist = { url = "https://mirrors.aliyun.com/pypi/packages/df/bf/f7da0350254c0ed7c72f3e33cef02e048281fec7ecec5f032d4aac52226b/jinja2-3.1.6.tar.gz", hash = "sha256:0137fb05990d35f1275a587e9aee6d56da821fc83491a0fb838183be43f66d6d" }
+wheels = [
+    { url = "https://mirrors.aliyun.com/pypi/packages/62/a1/3d680cbfd5f4b8f15abc1d571870c5fc3e594bb582bc3b64ea099db13e56/jinja2-3.1.6-py3-none-any.whl", hash = "sha256:85ece4451f492d0c13c5dd7c13a64681a86afae63a5f347908daf103ce6d2f67" },
+]
+
+[[package]]
+name = "jiter"
+version = "0.14.0"
+source = { registry = "https://mirrors.aliyun.com/pypi/simple/" }
+sdist = { url = "https://mirrors.aliyun.com/pypi/packages/6e/c1/0cddc6eb17d4c53a99840953f95dd3accdc5cfc7a337b0e9b26476276be9/jiter-0.14.0.tar.gz", hash = "sha256:e8a39e66dac7153cf3f964a12aad515afa8d74938ec5cc0018adcdae5367c79e" }
+wheels = [
+    { url = "https://mirrors.aliyun.com/pypi/packages/64/2e/a9959997739c403378d0a4a3a1c4ed80b60aeace216c4d37b303a9fc60a4/jiter-0.14.0-cp310-cp310-macosx_10_12_x86_64.whl", hash = "sha256:02f36a5c700f105ac04a6556fe664a59037a2c200db3b7e88784fac2ddf02531" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/27/72/b6de8a531e0adbadd839bec301165feb1fccf00e9ff55073ba2dd20f0043/jiter-0.14.0-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:41eab6c09ceffb6f0fe25e214b3068146edb1eda3649ca2aee2a061029c7ba2e" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/db/d8/2040b9efa13c917f855c40890ae4119fe02c25b7c7677d5b4fa820a851fc/jiter-0.14.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:5cf4d4c109641f9cfaf4a7b6aebd51654e405cd00fa9ebbf87163b8b97b325aa" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/49/62/655c0ad5ce6a8e90f9068c175b8a236877d753e460762b3183c136db1c5b/jiter-0.14.0-cp310-cp310-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:b80c7b41a628e6be2213ad0ece763c5f88aa5ee003fa394d58acaaee1f4b8342" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/f1/66/549c40fa068f08710b7570869c306a051eb67a29758bd64f4114f730554c/jiter-0.14.0-cp310-cp310-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:fb3dbf7cc0d4dbe73cce307ebe7eefa7f73a7d3d854dd119ea0c243f03e40927" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/25/2f/97a32a05fed14ed58a18e181fdfb619e05163f3726b54ee6080ec0539c09/jiter-0.14.0-cp310-cp310-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:7054adcdeb06b46efd17b5734f75817a44a2d06d3748e36c3a023a1bb52af9ec" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/2a/3b/4347e1d6c2a973d653bbb7a2d671a2d2426e54b52ba735b8ff0d0a29b75c/jiter-0.14.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:d597cd1bf6790376f3fffc7c708766e57301d99a19314824ea0ccc9c3c70e1e2" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/ef/24/ca452fbf2ea33548ed30ce68a39a50442d3f7c9bf0704a7af958a930c057/jiter-0.14.0-cp310-cp310-manylinux_2_31_riscv64.whl", hash = "sha256:df63a14878da754427926281626fd3ee249424a186e25a274e78176d42945264" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/e3/a3/94470a0d199287caabeb4da2bb2ae5f6d17f3cf05dfc975d7cb064d58e0f/jiter-0.14.0-cp310-cp310-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:4ea73187627bcc5810e085df715e8a99da8bdfd96a7eb36b4b4df700ba6d4c9c" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/cf/71/6768edc09d7c45c39f093feb3de105fa718a3e982b5208b8a2ed6382b44b/jiter-0.14.0-cp310-cp310-musllinux_1_1_aarch64.whl", hash = "sha256:9f541eaf7bb8382367a1a23d6fc3d6aad57f8dd8c18c3c17f838bee20f217220" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/3d/6b/5c2e17559a0f4e96e934479f7137df46c939e983fa05244e674815befb73/jiter-0.14.0-cp310-cp310-musllinux_1_1_x86_64.whl", hash = "sha256:107465250de4fce00fdb47166bcd51df8e634e049541174fe3c71848e44f52ce" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/b1/83/c25f3556a60fc74d11199100f1b6cc0c006b815c8494dea8ca16fe398732/jiter-0.14.0-cp310-cp310-win32.whl", hash = "sha256:ffb2a08a406465bb076b7cc1df41d833106d3cf7905076cc73f0cb90078c7d10" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/2e/99/781a1b413f0989b7f2ea203b094b331685f1a35e52e0a45e5d000ecaab27/jiter-0.14.0-cp310-cp310-win_amd64.whl", hash = "sha256:cb8b682d10cb0cce7ff4c1af7244af7022c9b01ae16d46c357bdd0df13afb25d" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/8a/1f/198ae537fccb7080a0ed655eb56abf64a92f79489dfbf79f40fa34225bcd/jiter-0.14.0-cp311-cp311-macosx_10_12_x86_64.whl", hash = "sha256:7e791e247b8044512e070bd1f3633dc08350d32776d2d6e7473309d0edf256a2" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/cf/34/da67cff3fce964a36d03c3e365fb0f8726ade2a6cfd4d3c70107e216ead6/jiter-0.14.0-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:71527ce13fd5a0c4e40ad37331f8c547177dbb2dd0a93e5278b6a5eecf748804" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/ed/36/4c72e67180d4e71a4f5dcf7886d0840e83c49ab11788172177a77570326e/jiter-0.14.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:02c4a7ab56f746014874f2c525584c0daca1dec37f66fd707ecef3b7e5c2228c" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/bc/db/9b39e09ceafa9878235c0fc29e3e3f9b12a4c6a98ea3085b998cadf3accc/jiter-0.14.0-cp311-cp311-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:376e9dafff914253bb9d46cdc5f7965607fbe7feb0a491c34e35f92b2770702e" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/b0/96/0dcba1d7a82c1b720774b48ef239376addbaf30df24c34742ac4a57b67b2/jiter-0.14.0-cp311-cp311-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:23ad2a7a9da1935575c820428dd8d2490ce4d23189691ce33da1fc0a58e14e1c" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/f1/e3/f61b71543e746e6b8b805e7755814fc242715c16f1dba58e1cbccb8032c2/jiter-0.14.0-cp311-cp311-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:54b3ddf5786bc7732d293bba3411ac637ecfa200a39983166d1df86a59a43c9f" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/ad/5e/0ddeb7096aca099114abe36c4921016e8d251e6f35f5890240b31f1f60ae/jiter-0.14.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:5c001d5a646c2a50dc055dd526dad5d5245969e8234d2b1131d0451e81f3a373" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/e9/d1/fe0c46cd7fda9cad8f1ff9ad217dc61f1e4280b21052ec6dfe88c1446ef2/jiter-0.14.0-cp311-cp311-manylinux_2_31_riscv64.whl", hash = "sha256:834bb5bdabca2e91592a03d373838a8d0a1b8bbde7077ae6913fd2fc51812d00" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/ac/21/f5317f91729b501019184771c80d60abd89907009e7bfa6c7e348c5bdd44/jiter-0.14.0-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:4e9178be60e229b1b2b0710f61b9e24d1f4f8556985a83ff4c4f95920eea7314" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/e9/05/79d8f33fb2bf168db0df5c9cd16fe440a8ada57e929d3677b22712c2568f/jiter-0.14.0-cp311-cp311-musllinux_1_1_aarch64.whl", hash = "sha256:a7e4ccff04ec03614e62c613e976a3a5860dc9714ce8266f44328bdc8b1cab2c" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/5c/00/d1e3ff3d2a465e67f08507d74bafb2dcd29eba91dc939820e39e8dea38b8/jiter-0.14.0-cp311-cp311-musllinux_1_1_x86_64.whl", hash = "sha256:69539d936fb5d55caf6ecd33e2e884de083ff0ea28579780d56c4403094bb8d9" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/60/5b/bbb2189f62ace8d95e869aa4c84c9946616f301e2d02895a6f20dcc3bba3/jiter-0.14.0-cp311-cp311-win32.whl", hash = "sha256:4927d09b3e572787cc5e0a5318601448e1ab9391bcef95677f5840c2d00eaa6d" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/b8/86/c500b53dcbf08575f5963e536ebd757a1f7c568272ba5d180b212c9a87fb/jiter-0.14.0-cp311-cp311-win_amd64.whl", hash = "sha256:42d6ed359ac49eb922fdd565f209c57340aa06d589c84c8413e42a0f9ae1b842" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/75/4a/a676249049d42cb29bef82233e4fe0524d414cbe3606c7a4b311193c2f77/jiter-0.14.0-cp311-cp311-win_arm64.whl", hash = "sha256:6dd689f5f4a5a33747b28686e051095beb214fe28cfda5e9fe58a295a788f593" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/5a/68/7390a418f10897da93b158f2d5a8bd0bcd73a0f9ec3bb36917085bb759ef/jiter-0.14.0-cp312-cp312-macosx_10_12_x86_64.whl", hash = "sha256:2fb2ce3a7bc331256dfb14cefc34832366bb28a9aca81deaf43bbf2a5659e607" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/60/a0/5854ac00ff63551c52c6c89534ec6aba4b93474e7924d64e860b1c94165b/jiter-0.14.0-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:5252a7ca23785cef5d02d4ece6077a1b556a410c591b379f82091c3001e14844" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/41/a1/4f44832650a16b18e8391f1bf1d6ca4909bc738351826bcc198bba4357f4/jiter-0.14.0-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:c409578cbd77c338975670ada777add4efd53379667edf0aceea730cabede6fb" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/48/64/a329e9d469f86307203594b1707e11ae51c3348d03bfd514a5f997870012/jiter-0.14.0-cp312-cp312-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:7ede4331a1899d604463369c730dbb961ffdc5312bc7f16c41c2896415b1304a" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/94/c1/5e3dfc59635aa4d4c7bd20a820ac1d09b8ed851568356802cf1c08edb3cf/jiter-0.14.0-cp312-cp312-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:92cd8b6025981a041f5310430310b55b25ca593972c16407af8837d3d7d2ca01" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/e3/1b/dd157009dbc058f7b00108f545ccb72a2d56461395c4fc7b9cfdccb00af4/jiter-0.14.0-cp312-cp312-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:351bf6eda4e3a7ceb876377840c702e9a3e4ecc4624dbfb2d6463c67ae52637d" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/91/78/256013667b7c10b8834f8e6e54cd3e562d4c6e34227a1596addccc05e38c/jiter-0.14.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:c1dcfbeb93d9ecd9ca128bbf8910120367777973fa193fb9a39c31237d8df165" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/de/d9/137d65ade9093a409fe80955ce60b12bb753722c986467aeda47faf450ad/jiter-0.14.0-cp312-cp312-manylinux_2_31_riscv64.whl", hash = "sha256:ae039aaef8de3f8157ecc1fdd4d85043ac4f57538c245a0afaecb8321ec951c3" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/2e/48/76750835b87029342727c1a268bea8878ab988caf81ee4e7b880900eeb5a/jiter-0.14.0-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:7d9d51eb96c82a9652933bd769fe6de66877d6eb2b2440e281f2938c51b5643e" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/a6/60/456c4e81d5c8045279aefe60e9e483be08793828800a4e64add8fdde7f2a/jiter-0.14.0-cp312-cp312-musllinux_1_1_aarch64.whl", hash = "sha256:d824ca4148b705970bf4e120924a212fdfca9859a73e42bd7889a63a4ea6bb98" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/a8/9f/2020e0984c235f678dced38fe4eec3058cf528e6af36ebf969b410305941/jiter-0.14.0-cp312-cp312-musllinux_1_1_x86_64.whl", hash = "sha256:ff3a6465b3a0f54b1a430f45c3c0ba7d61ceb45cbc3e33f9e1a7f638d690baf3" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/ef/32/e2d298e1a22a4bbe6062136d1c7192db7dba003a6975e51d9a9eecabc4c2/jiter-0.14.0-cp312-cp312-win32.whl", hash = "sha256:5dec7c0a3e98d2a3f8a2e67382d0d7c3ac60c69103a4b271da889b4e8bb1e129" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/36/ac/96369141b3d8a4a8e4590e983085efe1c436f35c0cda940dd76d942e3e40/jiter-0.14.0-cp312-cp312-win_amd64.whl", hash = "sha256:fc7e37b4b8bc7e80a63ad6cfa5fc11fab27dbfea4cc4ae644b1ab3f273dc348f" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/01/c3/75d847f264647017d7e3052bbcc8b1e24b95fa139c320c5f5066fa7a0bdd/jiter-0.14.0-cp312-cp312-win_arm64.whl", hash = "sha256:ee4a72f12847ef29b072aee9ad5474041ab2924106bdca9fcf5d7d965853e057" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/97/2a/09f70020898507a89279659a1afe3364d57fc1b2c89949081975d135f6f5/jiter-0.14.0-cp313-cp313-macosx_10_12_x86_64.whl", hash = "sha256:af72f204cf4d44258e5b4c1745130ac45ddab0e71a06333b01de660ab4187a94" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/d6/be/080c96a45cd74f9fce5db4fd68510b88087fb37ffe2541ff73c12db92535/jiter-0.14.0-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:4b77da71f6e819be5fbcec11a453fde5b1d0267ef6ed487e2a392fd8e14e4e3a" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/7d/5e/2d0fee155826a968a832cc32438de5e2a193292c8721ca70d0b53e58245b/jiter-0.14.0-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:77f4ea612fe8b84b8b04e51d0e78029ecf3466348e25973f953de6e6a59aa4c1" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/70/af/bf9ee0d3a4f8dc0d679fc1337f874fe60cdbf841ebbb304b374e1c9aaceb/jiter-0.14.0-cp313-cp313-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:62fe2451f8fcc0240261e6a4df18ecbcd58327857e61e625b2393ea3b468aac9" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/0f/83/8e8561eadba31f4d3948a5b712fb0447ec71c3560b57a855449e7b8ddc98/jiter-0.14.0-cp313-cp313-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:6112f26f5afc75bcb475787d29da3aa92f9d09c7858f632f4be6ffe607be82e9" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/f6/c9/c5299e826a5fe6108d172b344033f61c69b1bb979dd8d9ddd4278a160971/jiter-0.14.0-cp313-cp313-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:215a6cb8fb7dc702aa35d475cc00ddc7f970e5c0b1417fb4b4ac5d82fa2a29db" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/5d/37/c16d9d15c0a471b8644b1abe3c82668092a707d9bedcf076f24ff2e380cd/jiter-0.14.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:fc4ab96a30fb3cb2c7e0cd33f7616c8860da5f5674438988a54ac717caccdbaa" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/58/ea/8050cb0dc654e728e1bfacbc0c640772f2181af5dedd13ae70145743a439/jiter-0.14.0-cp313-cp313-manylinux_2_31_riscv64.whl", hash = "sha256:3a99c1387b1f2928f799a9de899193484d66206a50e98233b6b088a7f0c1edb2" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/b0/3b/cf71506d270e5f84d97326bf220e47aed9b95e9a4a060758fb07772170ab/jiter-0.14.0-cp313-cp313-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:ab18d11074485438695f8d34a1b6da61db9754248f96d51341956607a8f39985" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/b0/cc/8c6c74a3efb5bd671bfd14f51e8a73375464ca914b1551bc3b40e26ac2c9/jiter-0.14.0-cp313-cp313-musllinux_1_1_aarch64.whl", hash = "sha256:801028dcfc26ac0895e4964cbc0fd62c73be9fd4a7d7b1aaf6e5790033a719b7" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/41/24/68d7b883ec959884ddf00d019b2e0e82ba81b167e1253684fa90519ce33c/jiter-0.14.0-cp313-cp313-musllinux_1_1_x86_64.whl", hash = "sha256:ad425b087aafb4a1c7e1e98a279200743b9aaf30c3e0ba723aec93f061bd9bc8" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/b6/89/b1a0985223bbf3150ff9e8f46f98fc9360c1de94f48abe271bbe1b465682/jiter-0.14.0-cp313-cp313-win32.whl", hash = "sha256:882bcb9b334318e233950b8be366fe5f92c86b66a7e449e76975dfd6d776a01f" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/4c/19/3f339a5a7f14a11730e67f6be34f9d5105751d547b615ef593fa122a5ded/jiter-0.14.0-cp313-cp313-win_amd64.whl", hash = "sha256:9b8c571a5dba09b98bd3462b5a53f27209a5cbbe85670391692ede71974e979f" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/50/56/752dd89c84be0e022a8ea3720bcfa0a8431db79a962578544812ce061739/jiter-0.14.0-cp313-cp313-win_arm64.whl", hash = "sha256:34f19dcc35cb1abe7c369b3756babf8c7f04595c0807a848df8f26ef8298ef92" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/91/28/292916f354f25a1fe8cf2c918d1415c699a4a659ae00be0430e1c5d9ffea/jiter-0.14.0-cp313-cp313t-macosx_11_0_arm64.whl", hash = "sha256:e89bcd7d426a75bb4952c696b267075790d854a07aad4c9894551a82c5b574ab" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/ad/c7/b002a7d8b8957ac3d469bd59c18ef4b1595a5216ae0de639a287b9816023/jiter-0.14.0-cp313-cp313t-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:7b25beaa0d4447ea8c7ae0c18c688905d34840d7d0b937f2f7bdd52162c98a40" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/f9/3b/f8d07580d8706021d255a6356b8fab13ee4c869412995550ce6ed4ddf97d/jiter-0.14.0-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:651a8758dd413c51e3b7f6557cdc6921faf70b14106f45f969f091f5cda990ea" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/47/5b/ac1a974da29e35507230383110ffec59998b290a8732585d04e19a9eb5ba/jiter-0.14.0-cp313-cp313t-win_amd64.whl", hash = "sha256:e1a7eead856a5038a8d291f1447176ab0b525c77a279a058121b5fccee257f6f" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/96/6d/9fc8433d667d2454271378a79747d8c76c10b51b482b454e6190e511f244/jiter-0.14.0-cp313-cp313t-win_arm64.whl", hash = "sha256:2e692633a12cda97e352fdcd1c4acc971b1c28707e1e33aeef782b0cbf051975" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/4f/1e/354ed92461b165bd581f9ef5150971a572c873ec3b68a916d5aa91da3cc2/jiter-0.14.0-cp314-cp314-macosx_10_12_x86_64.whl", hash = "sha256:6f396837fc7577871ca8c12edaf239ed9ccef3bbe39904ae9b8b63ce0a48b140" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/a6/95/8c7c7028aa8636ac21b7a55faef3e34215e6ed0cbf5ae58258427f621aa3/jiter-0.14.0-cp314-cp314-macosx_11_0_arm64.whl", hash = "sha256:a4d50ea3d8ba4176f79754333bd35f1bbcd28e91adc13eb9b7ca91bc52a6cef9" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/47/40/e2a852a44c4a089f2681a16611b7ce113224a80fd8504c46d78491b47220/jiter-0.14.0-cp314-cp314-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:ce17f8a050447d1b4153bda4fb7d26e6a9e74eb4f4a41913f30934c5075bf615" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/fc/1f/670f92adee1e9895eac41e8a4d623b6da68c4d46249d8b556b60b63f949e/jiter-0.14.0-cp314-cp314-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:f4f1c4b125e1652aefbc2e2c1617b60a160ab789d180e3d423c41439e5f32850" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/01/2f/541c9ba567d05de1c4874a0f8f8c5e3fd78e2b874266623da9a775cf46e0/jiter-0.14.0-cp314-cp314-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:be808176a6a3a14321d18c603f2d40741858a7c4fc982f83232842689fe86dd9" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/ce/a9/c31cbec09627e0d5de7aeaec7690dba03e090caa808fefd8133137cf45bc/jiter-0.14.0-cp314-cp314-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:26679d58ba816f88c3849306dd58cb863a90a1cf352cdd4ef67e30ccf8a77994" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/50/02/3c05c1666c41904a2f607475a73e7a4763d1cbde2d18229c4f85b22dc253/jiter-0.14.0-cp314-cp314-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:80381f5a19af8fa9aef743f080e34f6b25ebd89656475f8cf0470ec6157052aa" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/7d/97/e15b33545c2b13518f560d695f974b9891b311641bdcf178d63177e8801e/jiter-0.14.0-cp314-cp314-manylinux_2_31_riscv64.whl", hash = "sha256:004df5fdb8ecbd6d99f3227df18ba1a259254c4359736a2e6f036c944e02d7c5" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/ad/d2/8b1461def6b96ba44530df20d07ef7a1c7da22f3f9bf1727e2d611077bf1/jiter-0.14.0-cp314-cp314-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:cff5708f7ed0fa098f2b53446c6fa74c48469118e5cd7497b4f1cd569ab06928" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/e3/88/837566dd6ed6e452e8d3205355afd484ce44b2533edfa4ed73a298ea893e/jiter-0.14.0-cp314-cp314-musllinux_1_1_aarch64.whl", hash = "sha256:2492e5f06c36a976d25c7cc347a60e26d5470178d44cde1b9b75e60b4e519f28" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/89/6b/b00b45c4d1b4c031777fe161d620b755b5b02cdade1e316dcb46e4471d63/jiter-0.14.0-cp314-cp314-musllinux_1_1_x86_64.whl", hash = "sha256:7609cfbe3a03d37bfdbf5052012d5a879e72b83168a363deae7b3a26564d57de" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/ad/d8/6fe5b42011d19397433d345716eac16728ac241862a2aac9c91923c7509a/jiter-0.14.0-cp314-cp314-win32.whl", hash = "sha256:7282342d32e357543565286b6450378c3cd402eea333fc1ebe146f1fabb306fc" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/e5/43/5c2e08da1efad5e410f0eaaabeadd954812612c33fbbd8fd5328b489139d/jiter-0.14.0-cp314-cp314-win_amd64.whl", hash = "sha256:bd77945f38866a448e73b0b7637366afa814d4617790ecd88a18ca74377e6c02" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/aa/1f/6e39ac0b4cdfa23e606af5b245df5f9adaa76f35e0c5096790da430ca506/jiter-0.14.0-cp314-cp314-win_arm64.whl", hash = "sha256:f2d4c61da0821ee42e0cdf5489da60a6d074306313a377c2b35af464955a3611" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/05/57/7dbc0ffbbb5176a27e3518716608aa464aee2e2887dc938f0b900a120449/jiter-0.14.0-cp314-cp314t-macosx_11_0_arm64.whl", hash = "sha256:1bf7ff85517dd2f20a5750081d2b75083c1b269cf75afc7511bdf1f9548beb3b" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/83/6e/7b3314398d8983f06b557aa21b670511ec72d3b79a68ee5e4d9bff972286/jiter-0.14.0-cp314-cp314t-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:c8ef8791c3e78d6c6b157c6d360fbb5c715bebb8113bc6a9303c5caff012754a" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/ae/4f/8dc674bcd7db6dba566de73c08c763c337058baff1dbeb34567045b27cdc/jiter-0.14.0-cp314-cp314t-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:e74663b8b10da1fe0f4e4703fd7980d24ad17174b6bb35d8498d6e3ebce2ae6a" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/3b/5f/188e09a1f20906f98bbdec44ed820e19f4e8eb8aff88b9d1a5a497587ff3/jiter-0.14.0-cp314-cp314t-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:1aca29ba52913f78362ec9c2da62f22cdc4c3083313403f90c15460979b84d9b" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/ac/f0/19046ef965ed8f349e8554775bb12ff4352f443fbe12b95d31f575891256/jiter-0.14.0-cp314-cp314t-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:8b39b7d87a952b79949af5fef44d2544e58c21a28da7f1bae3ef166455c61746" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/c4/c3/da43bd8431ee175695777ee78cf0e93eacbb47393ff493f18c45231b427d/jiter-0.14.0-cp314-cp314t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:78d918a68b26e9fab068c2b5453577ef04943ab2807b9a6275df2a812599a310" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/72/26/e054771be889707c6161dbdec9c23d33a9ec70945395d70f07cfea1e9a6f/jiter-0.14.0-cp314-cp314t-manylinux_2_31_riscv64.whl", hash = "sha256:b08997c35aee1201c1a5361466a8fb9162d03ae7bf6568df70b6c859f1e654a4" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/c3/0f/7bea65ea2a6d91f2bf989ff11a18136644392bf2b0497a1fa50934c30a9c/jiter-0.14.0-cp314-cp314t-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:260bf7ca20704d58d41f669e5e9fe7fe2fa72901a6b324e79056f5d52e9c9be2" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/3c/a1/b1ff7d70deef61ac0b7c6c2f12d2ace950cdeecb4fdc94500a0926802857/jiter-0.14.0-cp314-cp314t-musllinux_1_1_aarch64.whl", hash = "sha256:37826e3df29e60f30a382f9294348d0238ef127f4b5d7f5f8da78b5b9e050560" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/0b/7b/3b0649983cbaf15eda26a414b5b1982e910c67bd6f7b1b490f3cfc76896a/jiter-0.14.0-cp314-cp314t-musllinux_1_1_x86_64.whl", hash = "sha256:645be49c46f2900937ba0eaf871ad5183c96858c0af74b6becc7f4e367e36e06" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/97/f8/33d78c83bd93ae0c0af05293a6660f88a1977caef39a6d72a84afab94ce0/jiter-0.14.0-cp314-cp314t-win32.whl", hash = "sha256:2f7877ed45118de283786178eceaf877110abacd04fde31efff3940ae9672674" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/d6/ac/2b760516c03e2227826d1f7025d89bf6bf6357a28fe75c2a2800873c50bf/jiter-0.14.0-cp314-cp314t-win_amd64.whl", hash = "sha256:14c0cb10337c49f5eafe8e7364daca5e29a020ea03580b8f8e6c597fed4e1588" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/dc/2e/a44c20c58aeed0355f2d326969a181696aeb551a25195f47563908a815be/jiter-0.14.0-cp314-cp314t-win_arm64.whl", hash = "sha256:5419d4aa2024961da9fe12a9cfe7484996735dca99e8e090b5c88595ef1951ff" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/32/a1/ef34ca2cab2962598591636a1804b93645821201cc0095d4a93a9a329c9d/jiter-0.14.0-graalpy311-graalpy242_311_native-macosx_10_12_x86_64.whl", hash = "sha256:a25ffa2dbbdf8721855612f6dca15c108224b12d0c4024d0ac3d7902132b4211" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/60/bb/520576a532a6b8a6f42747afed289c8448c879a34d7802fe2c832d4fd38f/jiter-0.14.0-graalpy311-graalpy242_311_native-macosx_11_0_arm64.whl", hash = "sha256:0ac9cbaa86c10996b92bd12c91659b60f939f8e28fcfa6bc11a0e90a774ce95b" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/b2/7c/c16db114ea1f2f532f198aa8dc39585026af45af362c69a0492f31bc4821/jiter-0.14.0-graalpy311-graalpy242_311_native-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:844e73b6c56b505e9e169234ea3bdea2ea43f769f847f47ac559ba1d2361ebea" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/99/8f/15e7741ff19e9bcd4d753f7ff22f988fd54592f134ca13701c13ea8c20e0/jiter-0.14.0-graalpy311-graalpy242_311_native-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:e52c076f187405fc21523c746c04399c9af8ece566077ed147b2126f2bcba577" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/21/42/9042c3f3019de4adcb8c16591c325ec7255beea9fcd33a42a43f3b0b1000/jiter-0.14.0-graalpy312-graalpy250_312_native-macosx_10_12_x86_64.whl", hash = "sha256:fbd9e482663ca9d005d051330e4d2d8150bb208a209409c10f7e7dfdf7c49da9" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/60/cf/a7e19b308bd86bb04776803b1f01a5f9a287a4c55205f4708827ee487fbf/jiter-0.14.0-graalpy312-graalpy250_312_native-macosx_11_0_arm64.whl", hash = "sha256:33a20d838b91ef376b3a56896d5b04e725c7df5bc4864cc6569cf046a8d73b6d" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/ca/44/e26ede3f0caeff93f222559cb0cc4ca68579f07d009d7b6010c5b586f9b1/jiter-0.14.0-graalpy312-graalpy250_312_native-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:432c4db5255d86a259efde91e55cb4c8d18c0521d844c9e2e7efcce3899fb016" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/da/e9/1f9ada30cef7b05e74bb06f52127e7a724976c225f46adb65c37b1dadfb6/jiter-0.14.0-graalpy312-graalpy250_312_native-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:67f00d94b281174144d6532a04b66a12cb866cbdc47c3af3bfe2973677f9861a" },
+]
+
 [[package]]
 name = "jmespath"
 version = "0.10.0"
@@ -2122,6 +2309,29 @@ antlr4-13-2 = [
     { name = "antlr4-python3-runtime" },
 ]
 
+[[package]]
+name = "litellm"
+version = "1.82.6"
+source = { registry = "https://mirrors.aliyun.com/pypi/simple/" }
+dependencies = [
+    { name = "aiohttp" },
+    { name = "click" },
+    { name = "fastuuid" },
+    { name = "httpx" },
+    { name = "importlib-metadata" },
+    { name = "jinja2" },
+    { name = "jsonschema" },
+    { name = "openai" },
+    { name = "pydantic" },
+    { name = "python-dotenv" },
+    { name = "tiktoken" },
+    { name = "tokenizers" },
+]
+sdist = { url = "https://mirrors.aliyun.com/pypi/packages/29/75/1c537aa458426a9127a92bc2273787b2f987f4e5044e21f01f2eed5244fd/litellm-1.82.6.tar.gz", hash = "sha256:2aa1c2da21fe940c33613aa447119674a3ad4d2ad5eb064e4d5ce5ee42420136" }
+wheels = [
+    { url = "https://mirrors.aliyun.com/pypi/packages/02/6c/5327667e6dbe9e98cbfbd4261c8e91386a52e38f41419575854248bbab6a/litellm-1.82.6-py3-none-any.whl", hash = "sha256:164a3ef3e19f309e3cabc199bef3d2045212712fefdfa25fc7f75884a5b5b205" },
+]
+
 [[package]]
 name = "magiccube"
 version = "0.3.0"
@@ -2146,6 +2356,91 @@ wheels = [
     { url = "https://mirrors.aliyun.com/pypi/packages/94/54/e7d793b573f298e1c9013b8c4dade17d481164aa517d1d7148619c2cedbf/markdown_it_py-4.0.0-py3-none-any.whl", hash = "sha256:87327c59b172c5011896038353a81343b6754500a08cd7a4973bb48c6d578147" },
 ]
 
+[[package]]
+name = "markupsafe"
+version = "3.0.3"
+source = { registry = "https://mirrors.aliyun.com/pypi/simple/" }
+sdist = { url = "https://mirrors.aliyun.com/pypi/packages/7e/99/7690b6d4034fffd95959cbe0c02de8deb3098cc577c67bb6a24fe5d7caa7/markupsafe-3.0.3.tar.gz", hash = "sha256:722695808f4b6457b320fdc131280796bdceb04ab50fe1795cd540799ebe1698" }
+wheels = [
+    { url = "https://mirrors.aliyun.com/pypi/packages/e8/4b/3541d44f3937ba468b75da9eebcae497dcf67adb65caa16760b0a6807ebb/markupsafe-3.0.3-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:2f981d352f04553a7171b8e44369f2af4055f888dfb147d55e42d29e29e74559" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/98/1b/fbd8eed11021cabd9226c37342fa6ca4e8a98d8188a8d9b66740494960e4/markupsafe-3.0.3-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:e1c1493fb6e50ab01d20a22826e57520f1284df32f2d8601fdd90b6304601419" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/40/01/e560d658dc0bb8ab762670ece35281dec7b6c1b33f5fbc09ebb57a185519/markupsafe-3.0.3-cp310-cp310-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:1ba88449deb3de88bd40044603fafffb7bc2b055d626a330323a9ed736661695" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/af/cd/ce6e848bbf2c32314c9b237839119c5a564a59725b53157c856e90937b7a/markupsafe-3.0.3-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:f42d0984e947b8adf7dd6dde396e720934d12c506ce84eea8476409563607591" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/c9/2a/b5c12c809f1c3045c4d580b035a743d12fcde53cf685dbc44660826308da/markupsafe-3.0.3-cp310-cp310-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:c0c0b3ade1c0b13b936d7970b1d37a57acde9199dc2aecc4c336773e1d86049c" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/cf/e3/9427a68c82728d0a88c50f890d0fc072a1484de2f3ac1ad0bfc1a7214fd5/markupsafe-3.0.3-cp310-cp310-musllinux_1_2_aarch64.whl", hash = "sha256:0303439a41979d9e74d18ff5e2dd8c43ed6c6001fd40e5bf2e43f7bd9bbc523f" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/bc/36/23578f29e9e582a4d0278e009b38081dbe363c5e7165113fad546918a232/markupsafe-3.0.3-cp310-cp310-musllinux_1_2_riscv64.whl", hash = "sha256:d2ee202e79d8ed691ceebae8e0486bd9a2cd4794cec4824e1c99b6f5009502f6" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/56/21/dca11354e756ebd03e036bd8ad58d6d7168c80ce1fe5e75218e4945cbab7/markupsafe-3.0.3-cp310-cp310-musllinux_1_2_x86_64.whl", hash = "sha256:177b5253b2834fe3678cb4a5f0059808258584c559193998be2601324fdeafb1" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/87/99/faba9369a7ad6e4d10b6a5fbf71fa2a188fe4a593b15f0963b73859a1bbd/markupsafe-3.0.3-cp310-cp310-win32.whl", hash = "sha256:2a15a08b17dd94c53a1da0438822d70ebcd13f8c3a95abe3a9ef9f11a94830aa" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/d6/25/55dc3ab959917602c96985cb1253efaa4ff42f71194bddeb61eb7278b8be/markupsafe-3.0.3-cp310-cp310-win_amd64.whl", hash = "sha256:c4ffb7ebf07cfe8931028e3e4c85f0357459a3f9f9490886198848f4fa002ec8" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/d0/9e/0a02226640c255d1da0b8d12e24ac2aa6734da68bff14c05dd53b94a0fc3/markupsafe-3.0.3-cp310-cp310-win_arm64.whl", hash = "sha256:e2103a929dfa2fcaf9bb4e7c091983a49c9ac3b19c9061b6d5427dd7d14d81a1" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/08/db/fefacb2136439fc8dd20e797950e749aa1f4997ed584c62cfb8ef7c2be0e/markupsafe-3.0.3-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:1cc7ea17a6824959616c525620e387f6dd30fec8cb44f649e31712db02123dad" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/e1/2e/5898933336b61975ce9dc04decbc0a7f2fee78c30353c5efba7f2d6ff27a/markupsafe-3.0.3-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:4bd4cd07944443f5a265608cc6aab442e4f74dff8088b0dfc8238647b8f6ae9a" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/1d/09/adf2df3699d87d1d8184038df46a9c80d78c0148492323f4693df54e17bb/markupsafe-3.0.3-cp311-cp311-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:6b5420a1d9450023228968e7e6a9ce57f65d148ab56d2313fcd589eee96a7a50" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/30/ac/0273f6fcb5f42e314c6d8cd99effae6a5354604d461b8d392b5ec9530a54/markupsafe-3.0.3-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:0bf2a864d67e76e5c9a34dc26ec616a66b9888e25e7b9460e1c76d3293bd9dbf" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/19/ae/31c1be199ef767124c042c6c3e904da327a2f7f0cd63a0337e1eca2967a8/markupsafe-3.0.3-cp311-cp311-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:bc51efed119bc9cfdf792cdeaa4d67e8f6fcccab66ed4bfdd6bde3e59bfcbb2f" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/b2/76/7edcab99d5349a4532a459e1fe64f0b0467a3365056ae550d3bcf3f79e1e/markupsafe-3.0.3-cp311-cp311-musllinux_1_2_aarch64.whl", hash = "sha256:068f375c472b3e7acbe2d5318dea141359e6900156b5b2ba06a30b169086b91a" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/a4/28/6e74cdd26d7514849143d69f0bf2399f929c37dc2b31e6829fd2045b2765/markupsafe-3.0.3-cp311-cp311-musllinux_1_2_riscv64.whl", hash = "sha256:7be7b61bb172e1ed687f1754f8e7484f1c8019780f6f6b0786e76bb01c2ae115" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/62/7e/a145f36a5c2945673e590850a6f8014318d5577ed7e5920a4b3448e0865d/markupsafe-3.0.3-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:f9e130248f4462aaa8e2552d547f36ddadbeaa573879158d721bbd33dfe4743a" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/0f/62/d9c46a7f5c9adbeeeda52f5b8d802e1094e9717705a645efc71b0913a0a8/markupsafe-3.0.3-cp311-cp311-win32.whl", hash = "sha256:0db14f5dafddbb6d9208827849fad01f1a2609380add406671a26386cdf15a19" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/83/8a/4414c03d3f891739326e1783338e48fb49781cc915b2e0ee052aa490d586/markupsafe-3.0.3-cp311-cp311-win_amd64.whl", hash = "sha256:de8a88e63464af587c950061a5e6a67d3632e36df62b986892331d4620a35c01" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/35/73/893072b42e6862f319b5207adc9ae06070f095b358655f077f69a35601f0/markupsafe-3.0.3-cp311-cp311-win_arm64.whl", hash = "sha256:3b562dd9e9ea93f13d53989d23a7e775fdfd1066c33494ff43f5418bc8c58a5c" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/5a/72/147da192e38635ada20e0a2e1a51cf8823d2119ce8883f7053879c2199b5/markupsafe-3.0.3-cp312-cp312-macosx_10_13_x86_64.whl", hash = "sha256:d53197da72cc091b024dd97249dfc7794d6a56530370992a5e1a08983ad9230e" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/9a/81/7e4e08678a1f98521201c3079f77db69fb552acd56067661f8c2f534a718/markupsafe-3.0.3-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:1872df69a4de6aead3491198eaf13810b565bdbeec3ae2dc8780f14458ec73ce" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/1e/2c/799f4742efc39633a1b54a92eec4082e4f815314869865d876824c257c1e/markupsafe-3.0.3-cp312-cp312-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:3a7e8ae81ae39e62a41ec302f972ba6ae23a5c5396c8e60113e9066ef893da0d" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/3c/2e/8d0c2ab90a8c1d9a24f0399058ab8519a3279d1bd4289511d74e909f060e/markupsafe-3.0.3-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:d6dd0be5b5b189d31db7cda48b91d7e0a9795f31430b7f271219ab30f1d3ac9d" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/2c/54/887f3092a85238093a0b2154bd629c89444f395618842e8b0c41783898ea/markupsafe-3.0.3-cp312-cp312-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:94c6f0bb423f739146aec64595853541634bde58b2135f27f61c1ffd1cd4d16a" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/c9/2f/336b8c7b6f4a4d95e91119dc8521402461b74a485558d8f238a68312f11c/markupsafe-3.0.3-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:be8813b57049a7dc738189df53d69395eba14fb99345e0a5994914a3864c8a4b" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/32/43/67935f2b7e4982ffb50a4d169b724d74b62a3964bc1a9a527f5ac4f1ee2b/markupsafe-3.0.3-cp312-cp312-musllinux_1_2_riscv64.whl", hash = "sha256:83891d0e9fb81a825d9a6d61e3f07550ca70a076484292a70fde82c4b807286f" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/89/e0/4486f11e51bbba8b0c041098859e869e304d1c261e59244baa3d295d47b7/markupsafe-3.0.3-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:77f0643abe7495da77fb436f50f8dab76dbc6e5fd25d39589a0f1fe6548bfa2b" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/2f/e1/78ee7a023dac597a5825441ebd17170785a9dab23de95d2c7508ade94e0e/markupsafe-3.0.3-cp312-cp312-win32.whl", hash = "sha256:d88b440e37a16e651bda4c7c2b930eb586fd15ca7406cb39e211fcff3bf3017d" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/aa/5b/bec5aa9bbbb2c946ca2733ef9c4ca91c91b6a24580193e891b5f7dbe8e1e/markupsafe-3.0.3-cp312-cp312-win_amd64.whl", hash = "sha256:26a5784ded40c9e318cfc2bdb30fe164bdb8665ded9cd64d500a34fb42067b1c" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/e5/f1/216fc1bbfd74011693a4fd837e7026152e89c4bcf3e77b6692fba9923123/markupsafe-3.0.3-cp312-cp312-win_arm64.whl", hash = "sha256:35add3b638a5d900e807944a078b51922212fb3dedb01633a8defc4b01a3c85f" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/38/2f/907b9c7bbba283e68f20259574b13d005c121a0fa4c175f9bed27c4597ff/markupsafe-3.0.3-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:e1cf1972137e83c5d4c136c43ced9ac51d0e124706ee1c8aa8532c1287fa8795" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/9c/d9/5f7756922cdd676869eca1c4e3c0cd0df60ed30199ffd775e319089cb3ed/markupsafe-3.0.3-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:116bb52f642a37c115f517494ea5feb03889e04df47eeff5b130b1808ce7c219" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/00/07/575a68c754943058c78f30db02ee03a64b3c638586fba6a6dd56830b30a3/markupsafe-3.0.3-cp313-cp313-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:133a43e73a802c5562be9bbcd03d090aa5a1fe899db609c29e8c8d815c5f6de6" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/a9/21/9b05698b46f218fc0e118e1f8168395c65c8a2c750ae2bab54fc4bd4e0e8/markupsafe-3.0.3-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:ccfcd093f13f0f0b7fdd0f198b90053bf7b2f02a3927a30e63f3ccc9df56b676" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/7f/71/544260864f893f18b6827315b988c146b559391e6e7e8f7252839b1b846a/markupsafe-3.0.3-cp313-cp313-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:509fa21c6deb7a7a273d629cf5ec029bc209d1a51178615ddf718f5918992ab9" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/c2/28/b50fc2f74d1ad761af2f5dcce7492648b983d00a65b8c0e0cb457c82ebbe/markupsafe-3.0.3-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:a4afe79fb3de0b7097d81da19090f4df4f8d3a2b3adaa8764138aac2e44f3af1" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/ed/76/104b2aa106a208da8b17a2fb72e033a5a9d7073c68f7e508b94916ed47a9/markupsafe-3.0.3-cp313-cp313-musllinux_1_2_riscv64.whl", hash = "sha256:795e7751525cae078558e679d646ae45574b47ed6e7771863fcc079a6171a0fc" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/b5/99/16a5eb2d140087ebd97180d95249b00a03aa87e29cc224056274f2e45fd6/markupsafe-3.0.3-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:8485f406a96febb5140bfeca44a73e3ce5116b2501ac54fe953e488fb1d03b12" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/19/bc/e7140ed90c5d61d77cea142eed9f9c303f4c4806f60a1044c13e3f1471d0/markupsafe-3.0.3-cp313-cp313-win32.whl", hash = "sha256:bdd37121970bfd8be76c5fb069c7751683bdf373db1ed6c010162b2a130248ed" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/05/73/c4abe620b841b6b791f2edc248f556900667a5a1cf023a6646967ae98335/markupsafe-3.0.3-cp313-cp313-win_amd64.whl", hash = "sha256:9a1abfdc021a164803f4d485104931fb8f8c1efd55bc6b748d2f5774e78b62c5" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/f0/3a/fa34a0f7cfef23cf9500d68cb7c32dd64ffd58a12b09225fb03dd37d5b80/markupsafe-3.0.3-cp313-cp313-win_arm64.whl", hash = "sha256:7e68f88e5b8799aa49c85cd116c932a1ac15caaa3f5db09087854d218359e485" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/e4/d7/e05cd7efe43a88a17a37b3ae96e79a19e846f3f456fe79c57ca61356ef01/markupsafe-3.0.3-cp313-cp313t-macosx_10_13_x86_64.whl", hash = "sha256:218551f6df4868a8d527e3062d0fb968682fe92054e89978594c28e642c43a73" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/99/9e/e412117548182ce2148bdeacdda3bb494260c0b0184360fe0d56389b523b/markupsafe-3.0.3-cp313-cp313t-macosx_11_0_arm64.whl", hash = "sha256:3524b778fe5cfb3452a09d31e7b5adefeea8c5be1d43c4f810ba09f2ceb29d37" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/bc/e6/fa0ffcda717ef64a5108eaa7b4f5ed28d56122c9a6d70ab8b72f9f715c80/markupsafe-3.0.3-cp313-cp313t-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:4e885a3d1efa2eadc93c894a21770e4bc67899e3543680313b09f139e149ab19" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/96/ec/2102e881fe9d25fc16cb4b25d5f5cde50970967ffa5dddafdb771237062d/markupsafe-3.0.3-cp313-cp313t-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:8709b08f4a89aa7586de0aadc8da56180242ee0ada3999749b183aa23df95025" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/4b/30/6f2fce1f1f205fc9323255b216ca8a235b15860c34b6798f810f05828e32/markupsafe-3.0.3-cp313-cp313t-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:b8512a91625c9b3da6f127803b166b629725e68af71f8184ae7e7d54686a56d6" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/58/47/4a0ccea4ab9f5dcb6f79c0236d954acb382202721e704223a8aafa38b5c8/markupsafe-3.0.3-cp313-cp313t-musllinux_1_2_aarch64.whl", hash = "sha256:9b79b7a16f7fedff2495d684f2b59b0457c3b493778c9eed31111be64d58279f" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/6a/70/3780e9b72180b6fecb83a4814d84c3bf4b4ae4bf0b19c27196104149734c/markupsafe-3.0.3-cp313-cp313t-musllinux_1_2_riscv64.whl", hash = "sha256:12c63dfb4a98206f045aa9563db46507995f7ef6d83b2f68eda65c307c6829eb" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/98/c5/c03c7f4125180fc215220c035beac6b9cb684bc7a067c84fc69414d315f5/markupsafe-3.0.3-cp313-cp313t-musllinux_1_2_x86_64.whl", hash = "sha256:8f71bc33915be5186016f675cd83a1e08523649b0e33efdb898db577ef5bb009" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/80/d6/2d1b89f6ca4bff1036499b1e29a1d02d282259f3681540e16563f27ebc23/markupsafe-3.0.3-cp313-cp313t-win32.whl", hash = "sha256:69c0b73548bc525c8cb9a251cddf1931d1db4d2258e9599c28c07ef3580ef354" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/2b/98/e48a4bfba0a0ffcf9925fe2d69240bfaa19c6f7507b8cd09c70684a53c1e/markupsafe-3.0.3-cp313-cp313t-win_amd64.whl", hash = "sha256:1b4b79e8ebf6b55351f0d91fe80f893b4743f104bff22e90697db1590e47a218" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/0e/72/e3cc540f351f316e9ed0f092757459afbc595824ca724cbc5a5d4263713f/markupsafe-3.0.3-cp313-cp313t-win_arm64.whl", hash = "sha256:ad2cf8aa28b8c020ab2fc8287b0f823d0a7d8630784c31e9ee5edea20f406287" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/33/8a/8e42d4838cd89b7dde187011e97fe6c3af66d8c044997d2183fbd6d31352/markupsafe-3.0.3-cp314-cp314-macosx_10_13_x86_64.whl", hash = "sha256:eaa9599de571d72e2daf60164784109f19978b327a3910d3e9de8c97b5b70cfe" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/b5/64/7660f8a4a8e53c924d0fa05dc3a55c9cee10bbd82b11c5afb27d44b096ce/markupsafe-3.0.3-cp314-cp314-macosx_11_0_arm64.whl", hash = "sha256:c47a551199eb8eb2121d4f0f15ae0f923d31350ab9280078d1e5f12b249e0026" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/da/ef/e648bfd021127bef5fa12e1720ffed0c6cbb8310c8d9bea7266337ff06de/markupsafe-3.0.3-cp314-cp314-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:f34c41761022dd093b4b6896d4810782ffbabe30f2d443ff5f083e0cbbb8c737" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/41/3c/a36c2450754618e62008bf7435ccb0f88053e07592e6028a34776213d877/markupsafe-3.0.3-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:457a69a9577064c05a97c41f4e65148652db078a3a509039e64d3467b9e7ef97" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/bc/20/b7fdf89a8456b099837cd1dc21974632a02a999ec9bf7ca3e490aacd98e7/markupsafe-3.0.3-cp314-cp314-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:e8afc3f2ccfa24215f8cb28dcf43f0113ac3c37c2f0f0806d8c70e4228c5cf4d" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/9a/a7/591f592afdc734f47db08a75793a55d7fbcc6902a723ae4cfbab61010cc5/markupsafe-3.0.3-cp314-cp314-musllinux_1_2_aarch64.whl", hash = "sha256:ec15a59cf5af7be74194f7ab02d0f59a62bdcf1a537677ce67a2537c9b87fcda" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/7d/33/45b24e4f44195b26521bc6f1a82197118f74df348556594bd2262bda1038/markupsafe-3.0.3-cp314-cp314-musllinux_1_2_riscv64.whl", hash = "sha256:0eb9ff8191e8498cca014656ae6b8d61f39da5f95b488805da4bb029cccbfbaf" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/ff/0e/53dfaca23a69fbfbbf17a4b64072090e70717344c52eaaaa9c5ddff1e5f0/markupsafe-3.0.3-cp314-cp314-musllinux_1_2_x86_64.whl", hash = "sha256:2713baf880df847f2bece4230d4d094280f4e67b1e813eec43b4c0e144a34ffe" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/46/11/f333a06fc16236d5238bfe74daccbca41459dcd8d1fa952e8fbd5dccfb70/markupsafe-3.0.3-cp314-cp314-win32.whl", hash = "sha256:729586769a26dbceff69f7a7dbbf59ab6572b99d94576a5592625d5b411576b9" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/28/52/182836104b33b444e400b14f797212f720cbc9ed6ba34c800639d154e821/markupsafe-3.0.3-cp314-cp314-win_amd64.whl", hash = "sha256:bdc919ead48f234740ad807933cdf545180bfbe9342c2bb451556db2ed958581" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/6f/18/acf23e91bd94fd7b3031558b1f013adfa21a8e407a3fdb32745538730382/markupsafe-3.0.3-cp314-cp314-win_arm64.whl", hash = "sha256:5a7d5dc5140555cf21a6fefbdbf8723f06fcd2f63ef108f2854de715e4422cb4" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/3c/f0/57689aa4076e1b43b15fdfa646b04653969d50cf30c32a102762be2485da/markupsafe-3.0.3-cp314-cp314t-macosx_10_13_x86_64.whl", hash = "sha256:1353ef0c1b138e1907ae78e2f6c63ff67501122006b0f9abad68fda5f4ffc6ab" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/89/c3/2e67a7ca217c6912985ec766c6393b636fb0c2344443ff9d91404dc4c79f/markupsafe-3.0.3-cp314-cp314t-macosx_11_0_arm64.whl", hash = "sha256:1085e7fbddd3be5f89cc898938f42c0b3c711fdcb37d75221de2666af647c175" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/f0/00/be561dce4e6ca66b15276e184ce4b8aec61fe83662cce2f7d72bd3249d28/markupsafe-3.0.3-cp314-cp314t-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:1b52b4fb9df4eb9ae465f8d0c228a00624de2334f216f178a995ccdcf82c4634" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/50/09/c419f6f5a92e5fadde27efd190eca90f05e1261b10dbd8cbcb39cd8ea1dc/markupsafe-3.0.3-cp314-cp314t-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:fed51ac40f757d41b7c48425901843666a6677e3e8eb0abcff09e4ba6e664f50" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/22/44/a0681611106e0b2921b3033fc19bc53323e0b50bc70cffdd19f7d679bb66/markupsafe-3.0.3-cp314-cp314t-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:f190daf01f13c72eac4efd5c430a8de82489d9cff23c364c3ea822545032993e" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/5f/57/1b0b3f100259dc9fffe780cfb60d4be71375510e435efec3d116b6436d43/markupsafe-3.0.3-cp314-cp314t-musllinux_1_2_aarch64.whl", hash = "sha256:e56b7d45a839a697b5eb268c82a71bd8c7f6c94d6fd50c3d577fa39a9f1409f5" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/26/6a/4bf6d0c97c4920f1597cc14dd720705eca0bf7c787aebc6bb4d1bead5388/markupsafe-3.0.3-cp314-cp314t-musllinux_1_2_riscv64.whl", hash = "sha256:f3e98bb3798ead92273dc0e5fd0f31ade220f59a266ffd8a4f6065e0a3ce0523" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/14/c7/ca723101509b518797fedc2fdf79ba57f886b4aca8a7d31857ba3ee8281f/markupsafe-3.0.3-cp314-cp314t-musllinux_1_2_x86_64.whl", hash = "sha256:5678211cb9333a6468fb8d8be0305520aa073f50d17f089b5b4b477ea6e67fdc" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/fb/df/5bd7a48c256faecd1d36edc13133e51397e41b73bb77e1a69deab746ebac/markupsafe-3.0.3-cp314-cp314t-win32.whl", hash = "sha256:915c04ba3851909ce68ccc2b8e2cd691618c4dc4c4232fb7982bca3f41fd8c3d" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/1a/8a/0402ba61a2f16038b48b39bccca271134be00c5c9f0f623208399333c448/markupsafe-3.0.3-cp314-cp314t-win_amd64.whl", hash = "sha256:4faffd047e07c38848ce017e8725090413cd80cbc23d86e55c587bf979e579c9" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/70/bc/6f1c2f612465f5fa89b95bead1f44dcb607670fd42891d8fdcd5d039f4f4/markupsafe-3.0.3-cp314-cp314t-win_arm64.whl", hash = "sha256:32001d6a8fc98c8cb5c947787c5d08b0a50663d139f1305bac5885d98d9b40fa" },
+]
+
 [[package]]
 name = "math-verify"
 version = "0.8.0"
@@ -2653,6 +2948,25 @@ wheels = [
     { url = "https://mirrors.aliyun.com/pypi/packages/be/9c/92789c596b8df838baa98fa71844d84283302f7604ed565dafe5a6b5041a/oauthlib-3.3.1-py3-none-any.whl", hash = "sha256:88119c938d2b8fb88561af5f6ee0eec8cc8d552b7bb1f712743136eb7523b7a1" },
 ]
 
+[[package]]
+name = "openai"
+version = "2.36.0"
+source = { registry = "https://mirrors.aliyun.com/pypi/simple/" }
+dependencies = [
+    { name = "anyio" },
+    { name = "distro" },
+    { name = "httpx" },
+    { name = "jiter" },
+    { name = "pydantic" },
+    { name = "sniffio" },
+    { name = "tqdm" },
+    { name = "typing-extensions" },
+]
+sdist = { url = "https://mirrors.aliyun.com/pypi/packages/f4/a1/4d5e84cf51720fc1526cc49e10ac1961abcccb55b0efb3d970db1e9a2728/openai-2.36.0.tar.gz", hash = "sha256:139dea0edd2f1b30c33d46ae1a6929e03906254140318e4608e98fe8c566f2e7" }
+wheels = [
+    { url = "https://mirrors.aliyun.com/pypi/packages/9d/1c/5d43735b2553baae2a5e899dcbcd0670a86930d993184d72ca909bf11c9b/openai-2.36.0-py3-none-any.whl", hash = "sha256:143f6194b548dbc2c921af1f1b03b9f14c85fed8a75b5b516f5bcc11a2a50c63" },
+]
+
 [[package]]
 name = "opencensus"
 version = "0.11.4"
@@ -4118,6 +4432,7 @@ builder = [
 model-service = [
     { name = "alibabacloud-cr20181201" },
     { name = "fastapi" },
+    { name = "litellm" },
     { name = "psutil" },
     { name = "swebench" },
     { name = "uvicorn" },
@@ -4181,6 +4496,7 @@ requires-dist = [
     { name = "gem-llm", marker = "extra == 'sandbox-actor'", specifier = ">=0.1.0" },
     { name = "httpx" },
     { name = "kubernetes", marker = "extra == 'admin'", specifier = ">=35.0.0" },
+    { name = "litellm", marker = "extra == 'model-service'", specifier = ">=1.50.0" },
     { name = "nacos-sdk-python", marker = "extra == 'admin'", specifier = ">=0.1.14" },
     { name = "nacos-sdk-python", marker = "extra == 'sandbox-actor'", specifier = ">=0.1.14" },
     { name = "numpy", marker = "extra == 'rocklet'", specifier = "<=2.2.6" },
@@ -4633,6 +4949,94 @@ wheels = [
     { url = "https://mirrors.aliyun.com/pypi/packages/e5/30/643397144bfbfec6f6ef821f36f33e57d35946c44a2352d3c9f0ae847619/tenacity-9.1.2-py3-none-any.whl", hash = "sha256:f77bf36710d8b73a50b2dd155c97b870017ad21afe6ab300326b0371b3b05138" },
 ]
 
+[[package]]
+name = "tiktoken"
+version = "0.12.0"
+source = { registry = "https://mirrors.aliyun.com/pypi/simple/" }
+dependencies = [
+    { name = "regex" },
+    { name = "requests" },
+]
+sdist = { url = "https://mirrors.aliyun.com/pypi/packages/7d/ab/4d017d0f76ec3171d469d80fc03dfbb4e48a4bcaddaa831b31d526f05edc/tiktoken-0.12.0.tar.gz", hash = "sha256:b18ba7ee2b093863978fcb14f74b3707cdc8d4d4d3836853ce7ec60772139931" }
+wheels = [
+    { url = "https://mirrors.aliyun.com/pypi/packages/89/b3/2cb7c17b6c4cf8ca983204255d3f1d95eda7213e247e6947a0ee2c747a2c/tiktoken-0.12.0-cp310-cp310-macosx_10_12_x86_64.whl", hash = "sha256:3de02f5a491cfd179aec916eddb70331814bd6bf764075d39e21d5862e533970" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/27/0f/df139f1df5f6167194ee5ab24634582ba9a1b62c6b996472b0277ec80f66/tiktoken-0.12.0-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:b6cfb6d9b7b54d20af21a912bfe63a2727d9cfa8fbda642fd8322c70340aad16" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/ef/5d/26a691f28ab220d5edc09b9b787399b130f24327ef824de15e5d85ef21aa/tiktoken-0.12.0-cp310-cp310-manylinux_2_28_aarch64.whl", hash = "sha256:cde24cdb1b8a08368f709124f15b36ab5524aac5fa830cc3fdce9c03d4fb8030" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/b2/94/443fab3d4e5ebecac895712abd3849b8da93b7b7dec61c7db5c9c7ebe40c/tiktoken-0.12.0-cp310-cp310-manylinux_2_28_x86_64.whl", hash = "sha256:6de0da39f605992649b9cfa6f84071e3f9ef2cec458d08c5feb1b6f0ff62e134" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/54/35/388f941251b2521c70dd4c5958e598ea6d2c88e28445d2fb8189eecc1dfc/tiktoken-0.12.0-cp310-cp310-musllinux_1_2_aarch64.whl", hash = "sha256:6faa0534e0eefbcafaccb75927a4a380463a2eaa7e26000f0173b920e98b720a" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/f8/00/c6681c7f833dd410576183715a530437a9873fa910265817081f65f9105f/tiktoken-0.12.0-cp310-cp310-musllinux_1_2_x86_64.whl", hash = "sha256:82991e04fc860afb933efb63957affc7ad54f83e2216fe7d319007dab1ba5892" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/5f/d2/82e795a6a9bafa034bf26a58e68fe9a89eeaaa610d51dbeb22106ba04f0a/tiktoken-0.12.0-cp310-cp310-win_amd64.whl", hash = "sha256:6fb2995b487c2e31acf0a9e17647e3b242235a20832642bb7a9d1a181c0c1bb1" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/de/46/21ea696b21f1d6d1efec8639c204bdf20fde8bafb351e1355c72c5d7de52/tiktoken-0.12.0-cp311-cp311-macosx_10_12_x86_64.whl", hash = "sha256:6e227c7f96925003487c33b1b32265fad2fbcec2b7cf4817afb76d416f40f6bb" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/c9/d9/35c5d2d9e22bb2a5f74ba48266fb56c63d76ae6f66e02feb628671c0283e/tiktoken-0.12.0-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:c06cf0fcc24c2cb2adb5e185c7082a82cba29c17575e828518c2f11a01f445aa" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/01/84/961106c37b8e49b9fdcf33fe007bb3a8fdcc380c528b20cc7fbba80578b8/tiktoken-0.12.0-cp311-cp311-manylinux_2_28_aarch64.whl", hash = "sha256:f18f249b041851954217e9fd8e5c00b024ab2315ffda5ed77665a05fa91f42dc" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/6a/d0/3d9275198e067f8b65076a68894bb52fd253875f3644f0a321a720277b8a/tiktoken-0.12.0-cp311-cp311-manylinux_2_28_x86_64.whl", hash = "sha256:47a5bc270b8c3db00bb46ece01ef34ad050e364b51d406b6f9730b64ac28eded" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/78/db/a58e09687c1698a7c592e1038e01c206569b86a0377828d51635561f8ebf/tiktoken-0.12.0-cp311-cp311-musllinux_1_2_aarch64.whl", hash = "sha256:508fa71810c0efdcd1b898fda574889ee62852989f7c1667414736bcb2b9a4bd" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/9e/1b/a9e4d2bf91d515c0f74afc526fd773a812232dd6cda33ebea7f531202325/tiktoken-0.12.0-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:a1af81a6c44f008cba48494089dd98cccb8b313f55e961a52f5b222d1e507967" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/9d/15/963819345f1b1fb0809070a79e9dd96938d4ca41297367d471733e79c76c/tiktoken-0.12.0-cp311-cp311-win_amd64.whl", hash = "sha256:3e68e3e593637b53e56f7237be560f7a394451cb8c11079755e80ae64b9e6def" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/a4/85/be65d39d6b647c79800fd9d29241d081d4eeb06271f383bb87200d74cf76/tiktoken-0.12.0-cp312-cp312-macosx_10_13_x86_64.whl", hash = "sha256:b97f74aca0d78a1ff21b8cd9e9925714c15a9236d6ceacf5c7327c117e6e21e8" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/4a/42/6573e9129bc55c9bf7300b3a35bef2c6b9117018acca0dc760ac2d93dffe/tiktoken-0.12.0-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:2b90f5ad190a4bb7c3eb30c5fa32e1e182ca1ca79f05e49b448438c3e225a49b" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/66/c5/ed88504d2f4a5fd6856990b230b56d85a777feab84e6129af0822f5d0f70/tiktoken-0.12.0-cp312-cp312-manylinux_2_28_aarch64.whl", hash = "sha256:65b26c7a780e2139e73acc193e5c63ac754021f160df919add909c1492c0fb37" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/f4/90/3dae6cc5436137ebd38944d396b5849e167896fc2073da643a49f372dc4f/tiktoken-0.12.0-cp312-cp312-manylinux_2_28_x86_64.whl", hash = "sha256:edde1ec917dfd21c1f2f8046b86348b0f54a2c0547f68149d8600859598769ad" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/a3/fe/26df24ce53ffde419a42f5f53d755b995c9318908288c17ec3f3448313a3/tiktoken-0.12.0-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:35a2f8ddd3824608b3d650a000c1ef71f730d0c56486845705a8248da00f9fe5" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/20/cc/b064cae1a0e9fac84b0d2c46b89f4e57051a5f41324e385d10225a984c24/tiktoken-0.12.0-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:83d16643edb7fa2c99eff2ab7733508aae1eebb03d5dfc46f5565862810f24e3" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/81/10/b8523105c590c5b8349f2587e2fdfe51a69544bd5a76295fc20f2374f470/tiktoken-0.12.0-cp312-cp312-win_amd64.whl", hash = "sha256:ffc5288f34a8bc02e1ea7047b8d041104791d2ddbf42d1e5fa07822cbffe16bd" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/00/61/441588ee21e6b5cdf59d6870f86beb9789e532ee9718c251b391b70c68d6/tiktoken-0.12.0-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:775c2c55de2310cc1bc9a3ad8826761cbdc87770e586fd7b6da7d4589e13dab3" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/1f/05/dcf94486d5c5c8d34496abe271ac76c5b785507c8eae71b3708f1ad9b45a/tiktoken-0.12.0-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:a01b12f69052fbe4b080a2cfb867c4de12c704b56178edf1d1d7b273561db160" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/a0/70/5163fe5359b943f8db9946b62f19be2305de8c3d78a16f629d4165e2f40e/tiktoken-0.12.0-cp313-cp313-manylinux_2_28_aarch64.whl", hash = "sha256:01d99484dc93b129cd0964f9d34eee953f2737301f18b3c7257bf368d7615baa" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/0c/da/c028aa0babf77315e1cef357d4d768800c5f8a6de04d0eac0f377cb619fa/tiktoken-0.12.0-cp313-cp313-manylinux_2_28_x86_64.whl", hash = "sha256:4a1a4fcd021f022bfc81904a911d3df0f6543b9e7627b51411da75ff2fe7a1be" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/a0/5a/886b108b766aa53e295f7216b509be95eb7d60b166049ce2c58416b25f2a/tiktoken-0.12.0-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:981a81e39812d57031efdc9ec59fa32b2a5a5524d20d4776574c4b4bd2e9014a" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/f4/f8/4db272048397636ac7a078d22773dd2795b1becee7bc4922fe6207288d57/tiktoken-0.12.0-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:9baf52f84a3f42eef3ff4e754a0db79a13a27921b457ca9832cf944c6be4f8f3" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/8e/32/45d02e2e0ea2be3a9ed22afc47d93741247e75018aac967b713b2941f8ea/tiktoken-0.12.0-cp313-cp313-win_amd64.whl", hash = "sha256:b8a0cd0c789a61f31bf44851defbd609e8dd1e2c8589c614cc1060940ef1f697" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/ce/76/994fc868f88e016e6d05b0da5ac24582a14c47893f4474c3e9744283f1d5/tiktoken-0.12.0-cp313-cp313t-macosx_10_13_x86_64.whl", hash = "sha256:d5f89ea5680066b68bcb797ae85219c72916c922ef0fcdd3480c7d2315ffff16" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/f6/b8/57ef1456504c43a849821920d582a738a461b76a047f352f18c0b26c6516/tiktoken-0.12.0-cp313-cp313t-macosx_11_0_arm64.whl", hash = "sha256:b4e7ed1c6a7a8a60a3230965bdedba8cc58f68926b835e519341413370e0399a" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/72/90/13da56f664286ffbae9dbcfadcc625439142675845baa62715e49b87b68b/tiktoken-0.12.0-cp313-cp313t-manylinux_2_28_aarch64.whl", hash = "sha256:fc530a28591a2d74bce821d10b418b26a094bf33839e69042a6e86ddb7a7fb27" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/05/df/4f80030d44682235bdaecd7346c90f67ae87ec8f3df4a3442cb53834f7e4/tiktoken-0.12.0-cp313-cp313t-manylinux_2_28_x86_64.whl", hash = "sha256:06a9f4f49884139013b138920a4c393aa6556b2f8f536345f11819389c703ebb" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/22/1f/ae535223a8c4ef4c0c1192e3f9b82da660be9eb66b9279e95c99288e9dab/tiktoken-0.12.0-cp313-cp313t-musllinux_1_2_aarch64.whl", hash = "sha256:04f0e6a985d95913cabc96a741c5ffec525a2c72e9df086ff17ebe35985c800e" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/78/a7/f8ead382fce0243cb625c4f266e66c27f65ae65ee9e77f59ea1653b6d730/tiktoken-0.12.0-cp313-cp313t-musllinux_1_2_x86_64.whl", hash = "sha256:0ee8f9ae00c41770b5f9b0bb1235474768884ae157de3beb5439ca0fd70f3e25" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/93/e0/6cc82a562bc6365785a3ff0af27a2a092d57c47d7a81d9e2295d8c36f011/tiktoken-0.12.0-cp313-cp313t-win_amd64.whl", hash = "sha256:dc2dd125a62cb2b3d858484d6c614d136b5b848976794edfb63688d539b8b93f" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/72/05/3abc1db5d2c9aadc4d2c76fa5640134e475e58d9fbb82b5c535dc0de9b01/tiktoken-0.12.0-cp314-cp314-macosx_10_13_x86_64.whl", hash = "sha256:a90388128df3b3abeb2bfd1895b0681412a8d7dc644142519e6f0a97c2111646" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/e3/7b/50c2f060412202d6c95f32b20755c7a6273543b125c0985d6fa9465105af/tiktoken-0.12.0-cp314-cp314-macosx_11_0_arm64.whl", hash = "sha256:da900aa0ad52247d8794e307d6446bd3cdea8e192769b56276695d34d2c9aa88" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/14/27/bf795595a2b897e271771cd31cb847d479073497344c637966bdf2853da1/tiktoken-0.12.0-cp314-cp314-manylinux_2_28_aarch64.whl", hash = "sha256:285ba9d73ea0d6171e7f9407039a290ca77efcdb026be7769dccc01d2c8d7fff" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/f5/de/9341a6d7a8f1b448573bbf3425fa57669ac58258a667eb48a25dfe916d70/tiktoken-0.12.0-cp314-cp314-manylinux_2_28_x86_64.whl", hash = "sha256:d186a5c60c6a0213f04a7a802264083dea1bbde92a2d4c7069e1a56630aef830" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/75/0d/881866647b8d1be4d67cb24e50d0c26f9f807f994aa1510cb9ba2fe5f612/tiktoken-0.12.0-cp314-cp314-musllinux_1_2_aarch64.whl", hash = "sha256:604831189bd05480f2b885ecd2d1986dc7686f609de48208ebbbddeea071fc0b" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/b3/1e/b651ec3059474dab649b8d5b69f5c65cd8fcd8918568c1935bd4136c9392/tiktoken-0.12.0-cp314-cp314-musllinux_1_2_x86_64.whl", hash = "sha256:8f317e8530bb3a222547b85a58583238c8f74fd7a7408305f9f63246d1a0958b" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/80/57/ce64fd16ac390fafde001268c364d559447ba09b509181b2808622420eec/tiktoken-0.12.0-cp314-cp314-win_amd64.whl", hash = "sha256:399c3dd672a6406719d84442299a490420b458c44d3ae65516302a99675888f3" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/ac/a4/72eed53e8976a099539cdd5eb36f241987212c29629d0a52c305173e0a68/tiktoken-0.12.0-cp314-cp314t-macosx_10_13_x86_64.whl", hash = "sha256:c2c714c72bc00a38ca969dae79e8266ddec999c7ceccd603cc4f0d04ccd76365" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/e6/d7/0110b8f54c008466b19672c615f2168896b83706a6611ba6e47313dbc6e9/tiktoken-0.12.0-cp314-cp314t-macosx_11_0_arm64.whl", hash = "sha256:cbb9a3ba275165a2cb0f9a83f5d7025afe6b9d0ab01a22b50f0e74fee2ad253e" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/5f/77/4f268c41a3957c418b084dd576ea2fad2e95da0d8e1ab705372892c2ca22/tiktoken-0.12.0-cp314-cp314t-manylinux_2_28_aarch64.whl", hash = "sha256:dfdfaa5ffff8993a3af94d1125870b1d27aed7cb97aa7eb8c1cefdbc87dbee63" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/4e/2b/fc46c90fe5028bd094cd6ee25a7db321cb91d45dc87531e2bdbb26b4867a/tiktoken-0.12.0-cp314-cp314t-manylinux_2_28_x86_64.whl", hash = "sha256:584c3ad3d0c74f5269906eb8a659c8bfc6144a52895d9261cdaf90a0ae5f4de0" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/28/c0/3c7a39ff68022ddfd7d93f3337ad90389a342f761c4d71de99a3ccc57857/tiktoken-0.12.0-cp314-cp314t-musllinux_1_2_aarch64.whl", hash = "sha256:54c891b416a0e36b8e2045b12b33dd66fb34a4fe7965565f1b482da50da3e86a" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/ab/0d/c1ad6f4016a3968c048545f5d9b8ffebf577774b2ede3e2e352553b685fe/tiktoken-0.12.0-cp314-cp314t-musllinux_1_2_x86_64.whl", hash = "sha256:5edb8743b88d5be814b1a8a8854494719080c28faaa1ccbef02e87354fe71ef0" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/af/df/c7891ef9d2712ad774777271d39fdef63941ffba0a9d59b7ad1fd2765e57/tiktoken-0.12.0-cp314-cp314t-win_amd64.whl", hash = "sha256:f61c0aea5565ac82e2ec50a05e02a6c44734e91b51c10510b084ea1b8e633a71" },
+]
+
+[[package]]
+name = "tokenizers"
+version = "0.23.1"
+source = { registry = "https://mirrors.aliyun.com/pypi/simple/" }
+dependencies = [
+    { name = "huggingface-hub" },
+]
+sdist = { url = "https://mirrors.aliyun.com/pypi/packages/c1/60/21f715d9faba5f5407ff759472ade058ec4a507ad62bcea47cb847239a73/tokenizers-0.23.1.tar.gz", hash = "sha256:1feeeadf865a7915adc25445dea30e9933e593c31bb96c277cee36de227c8bfa" }
+wheels = [
+    { url = "https://mirrors.aliyun.com/pypi/packages/87/39/b87a87d5bb9470610b80a2d31df42fcffeaf35118b8b97952b2aff598cc7/tokenizers-0.23.1-cp310-abi3-macosx_10_12_x86_64.whl", hash = "sha256:e03d6ffcbe0d56ee9c1ccd070e70a13fa750727c0277e138152acbc0252c2224" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/e2/6a/068ed9f6e444c9d7e9d55ce134181325700f3d7f30410721bdc8f848d727/tokenizers-0.23.1-cp310-abi3-macosx_11_0_arm64.whl", hash = "sha256:e0948bbb1ac1d7cdfc9fb6d62c596e3b7550036ad60ecd654a66ad273326324e" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/6c/36/e006edf031154cba92b8416057d92c3abe3635e4c4b0aa0b5b9bb39dde70/tokenizers-0.23.1-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:1bf13402aff9bc533c89cb849ec3b412dc3fbeacc9744840e423d7bf3f7dc0e3" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/a2/ef/7735d226f9c7f874a6bee5e3f27fb25ecabdf207d37b8cf45286d0795893/tokenizers-0.23.1-cp310-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:f836ca703b89ae07919a309f9651f7a88fd5a33d5f718ba5ad0870ec0256bad6" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/b9/d9/24827036f6e21297bfffda0768e58eb6096a4f411e932964a01707857931/tokenizers-0.23.1-cp310-abi3-manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:ae848657742035523fdf261773630cb819a26995fcd3d9ecae0c1daf6e5a4959" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/0c/9a/22f3582b3a4f49358293a5206e25317621ee4526bfe9cdaa0f07a12e770e/tokenizers-0.23.1-cp310-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:53b09e85775d5187941e7bab30e941b4134ab4a7dd8c68e783d231fb7ca27c51" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/7e/65/b8f8814eef95800f20721384136d9a1d22241d50b2874357cb70542c392f/tokenizers-0.23.1-cp310-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:ea5a0ce170074329faaa8ea3f6400ecde604b6678192688533af80980daae71a" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/0d/d5/1353e5f677ec27c2494fb6a6725e82d56c985f53e90ec511369e7e4f02c6/tokenizers-0.23.1-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:5075b405006415ea148a992d093699c66eb01952bf59f4d5727089a98bda45a4" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/71/89/39b6b8fc073fb6d413d0147aa333dc7eff7be65639ac9d19930a0b21bf33/tokenizers-0.23.1-cp310-abi3-manylinux_2_31_riscv64.whl", hash = "sha256:56f3a77de629917652f876294dc9fe6bad4a0c43bc229dc72e59bb23a0f4729a" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/0f/80/127c854da64827e5b79264ce524993a90dddcb320e5cd42412c5c02f9e8a/tokenizers-0.23.1-cp310-abi3-musllinux_1_2_aarch64.whl", hash = "sha256:9d10a6d957ef01896dc274e890eee27d41bd0e74ef31e60616f0fc311345184e" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/fe/ba/44c2502feb1a058f096ddfb4e0996ef3225a01a388e1a9b094e91689fe93/tokenizers-0.23.1-cp310-abi3-musllinux_1_2_armv7l.whl", hash = "sha256:1974288a609c343774f1b897c8b482c791ab17b75ab5c8c2b1737565c1d82288" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/9e/c1/464019a9fb059870bfe4eebb4ba12208f3042035e258bf5e782906bd3847/tokenizers-0.23.1-cp310-abi3-musllinux_1_2_i686.whl", hash = "sha256:120468fb4c24faf0543c835a4fabafa4deb3f20a035c9b6e83d0b553a97615d4" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/79/94/3ac1432bda31626071e9b6a12709b97ae05131c804b94c8f3ac622c5da32/tokenizers-0.23.1-cp310-abi3-musllinux_1_2_x86_64.whl", hash = "sha256:e3d8f40ea6268047de7046906326abed5134f27d4e8447b23763afe5808c8a96" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/6a/dd/631b21433c771b1382535326f0eca80b9c9cee2e64961dd993bc9ac4669e/tokenizers-0.23.1-cp310-abi3-win32.whl", hash = "sha256:93120a930b919416da7cd10a2f606ac9919cc69cacae7980fa2140e277660948" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/97/c9/2553f72aaf65a2797d4229e37fa7fbe38ffbf3e32912d31bdd78b3323e59/tokenizers-0.23.1-cp310-abi3-win_amd64.whl", hash = "sha256:e7bfaf995c1bdbbd21d13539decb6650967013759318627d85daeb7881af16b7" },
+    { url = "https://mirrors.aliyun.com/pypi/packages/cd/2b/2be299bab55fc595e3d38567edb1a87f86e594842968fa9515a07bdcf422/tokenizers-0.23.1-cp310-abi3-win_arm64.whl", hash = "sha256:a26197957d8e4425dfba746315f3c425ea00cfa8367c5fbc4ec73447893dcea9" },
+]
+
 [[package]]
 name = "toml"
 version = "0.10.2"

From a8fb54c66b7aa2e9376f4c593ca4bce75cb36be1 Mon Sep 17 00:00:00 2001
From: "pengshixin.psx" <pengshixin.psx@alibaba-inc.com>
Date: Tue, 12 May 2026 03:18:28 +0000
Subject: [PATCH 02/25] fix(model-service): pass api_key + use custom_openai
 prefix + suppress cost calc

Three fixes to proxy.py uncovered while testing against DashScope (glm-5):

- Extract Bearer token from incoming Authorization header and pass as litellm
  api_key kwarg; setting it via extra_headers does not work because litellm
  always regenerates Authorization from api_key. Authorization is now stripped
  from the forwarded header set.
- Switch upstream prefix from openai/<model> to custom_openai/<model>. This is
  litellm's standard pattern for OpenAI-compatible third-party endpoints
  (DashScope, ModelScope, Groq, Mistral, ...) and avoids "model isn't mapped"
  on arbitrary upstream model names.
- Pass input_cost_per_token=0 / output_cost_per_token=0 so litellm's cost
  calculator does not raise "model isn't mapped" on unknown models and pollute
  StandardLoggingPayload.response_cost_failure_debug_information.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 rock/sdk/model/server/api/proxy.py | 42 ++++++++++++++++++++++++++----
 tests/unit/sdk/model/test_proxy.py | 12 ++++++---
 2 files changed, 45 insertions(+), 9 deletions(-)

diff --git a/rock/sdk/model/server/api/proxy.py b/rock/sdk/model/server/api/proxy.py
index 4894430641..18d0ead93c 100644
--- a/rock/sdk/model/server/api/proxy.py
+++ b/rock/sdk/model/server/api/proxy.py
@@ -35,7 +35,26 @@
 #     so the client's values would be wrong or misleading
 #   - transfer-encoding / connection: true RFC 7230 hop-by-hop headers, scoped to
 #     the client↔proxy connection only
-_HEADERS_NOT_TO_FORWARD = frozenset({"host", "content-length", "content-type", "transfer-encoding", "connection"})
+_HEADERS_NOT_TO_FORWARD = frozenset(
+    {"host", "content-length", "content-type", "transfer-encoding", "connection", "authorization"}
+)
+
+
+def _extract_bearer_token(headers) -> str | None:
+    """Pull the Bearer token out of the Authorization header.
+
+    litellm's OpenAI client needs the API key as an explicit ``api_key=`` kwarg —
+    setting Authorization in extra_headers does not work because litellm always
+    regenerates that header from ``api_key`` (or env vars). So we extract it here
+    and let the proxy stay stateless about which key the client is using.
+    """
+    auth = headers.get("authorization") or headers.get("Authorization")
+    if not auth:
+        return None
+    parts = auth.split(None, 1)
+    if len(parts) == 2 and parts[0].lower() == "bearer":
+        return parts[1].strip()
+    return auth.strip()
 
 
 def get_base_url(model_name: str, config: ModelServiceConfig) -> str:
@@ -122,14 +141,20 @@ async def chat_completions(body: dict[str, Any], request: Request):
         logger.info(f"[replay] dispatching '{model_name}' to traj-replay handler")
     else:
         api_base = get_base_url(model_name, config)
-        # Tell litellm to treat the upstream as an OpenAI-compatible server.
-        litellm_model = f"openai/{model_name}" if model_name else "openai/default"
+        # custom_openai is litellm's catch-all for OpenAI-compatible third-party endpoints
+        # (DashScope, ModelScope, Groq, Mistral, ...). Unlike `openai/`, it does NOT do
+        # model-name lookup, so arbitrary upstream model names like "glm-5" / "qwen-turbo"
+        # work without "This model isn't mapped yet" errors.
+        litellm_model = f"custom_openai/{model_name}" if model_name else "custom_openai/default"
         logger.info(f"Routing model '{model_name}' to {api_base}")
 
-    # 2. Header forwarding (preserve Authorization, drop hop-by-hop)
+    # 2. Extract Bearer token (litellm needs api_key explicitly, not via headers)
+    api_key = _extract_bearer_token(request.headers)
+
+    # 3. Header forwarding (drop Authorization since we pass it via api_key, plus hop-by-hop)
     extra_headers = _filter_headers(request.headers)
 
-    # 3. Build call kwargs (transparent passthrough of body fields)
+    # 4. Build call kwargs (transparent passthrough of body fields)
     call_kwargs = dict(body)
     call_kwargs.pop("model", None)  # avoid duplicate kwargs
     is_stream = bool(call_kwargs.get("stream"))
@@ -138,9 +163,16 @@ async def chat_completions(body: dict[str, Any], request: Request):
         response = await litellm.acompletion(
             model=litellm_model,
             api_base=api_base,
+            api_key=api_key,
             extra_headers=extra_headers,
             timeout=config.request_timeout,
             num_retries=config.num_retries,
+            # Suppress litellm's "model isn't mapped yet" cost-calc exception for
+            # arbitrary upstream models (glm-5, qwen-turbo, ...) that aren't in
+            # litellm's pricing table. We don't care about cost tracking here, so
+            # zero rates make the calc succeed cleanly with response_cost=0.
+            input_cost_per_token=0,
+            output_cost_per_token=0,
             **call_kwargs,
         )
     except (RateLimitError, APIError, BadRequestError, AuthenticationError, Timeout) as exc:
diff --git a/tests/unit/sdk/model/test_proxy.py b/tests/unit/sdk/model/test_proxy.py
index c994c1c9d6..c1b141f930 100644
--- a/tests/unit/sdk/model/test_proxy.py
+++ b/tests/unit/sdk/model/test_proxy.py
@@ -58,7 +58,7 @@ async def test_chat_completions_routing_success():
         assert mock_acompletion.called
         call_kwargs = mock_acompletion.call_args.kwargs
         assert call_kwargs["api_base"] == "https://api.openai.com/v1"
-        assert call_kwargs["model"] == "openai/gpt-3.5-turbo"
+        assert call_kwargs["model"] == "custom_openai/gpt-3.5-turbo"
         assert call_kwargs["messages"] == [{"role": "user", "content": "hello"}]
 
 
@@ -200,8 +200,10 @@ async def test_chat_completions_replay_mode_uses_traj_replay_provider():
 
 
 @pytest.mark.asyncio
-async def test_chat_completions_strips_hop_by_hop_headers():
-    """host / content-length / transfer-encoding etc. are not forwarded."""
+async def test_chat_completions_extracts_bearer_token_and_strips_framing_headers():
+    """Bearer token goes to api_key kwarg; host / content-length / transfer-encoding /
+    Authorization are not forwarded as extra_headers (litellm regenerates Authorization
+    from api_key, so passing it both ways would conflict). Custom X-* headers pass through."""
     captured = {}
 
     async def capture(*args, **kwargs):
@@ -218,10 +220,12 @@ async def capture(*args, **kwargs):
                 headers={"Authorization": "Bearer abc", "X-Trace": "t1"},
             )
 
+    assert captured["api_key"] == "abc"
+
     forwarded = captured["extra_headers"]
     forwarded_lower = {k.lower() for k in forwarded}
-    assert "authorization" in forwarded_lower
     assert "x-trace" in forwarded_lower
+    assert "authorization" not in forwarded_lower
     assert "host" not in forwarded_lower
     assert "content-length" not in forwarded_lower
     assert "content-type" not in forwarded_lower

From 11852aed2c5956343e3dd471face487e66cfa999 Mon Sep 17 00:00:00 2001
From: "pengshixin.psx" <pengshixin.psx@alibaba-inc.com>
Date: Tue, 12 May 2026 03:36:52 +0000
Subject: [PATCH 03/25] refactor(model-service): drop CustomLLM, serve replay
 directly from cursor
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The replay path no longer goes through litellm. We have a complete OpenAI-shape
response on disk, so routing it through CustomLLM/CustomStreamWrapper just to
translate formats was pure overhead — and the source of every replay-side bug
(cursor-exhausted retried 6× and wrapped as APIConnectionError, GenericStreamingChunk
type gymnastics, finish_reason hardcoded to "stop", reasoning_content dropped on
streaming, tool_calls reconstruction left as TODO).

Changes:
- traj_replayer.py: delete TrajectoryReplayer(CustomLLM) and helpers; keep just
  SequentialCursor. Cursor exhaustion now raises a plain TrajectoryExhausted.
- proxy.py: in replay mode, fetch from app.state.replay_cursor and emit either
  the raw response dict (non-stream) or one SSE chunk + [DONE] (stream). The
  stream path renames message → delta and preserves all fields verbatim
  (finish_reason, tool_calls, reasoning_content, ...).
- main.py: rename _configure_litellm_for_proxy → _configure_proxy_integrations.
  Replay branch now just attaches a SequentialCursor to app.state; no
  litellm.custom_provider_map registration.
- Tests: drop the CustomLLM-based replayer tests; keep cursor tests; add three
  end-to-end proxy replay tests covering non-stream / stream / cursor exhausted.

43 passed. Direct curl against DashScope glm-5: record + replay (both modes)
verified end-to-end.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 rock/sdk/model/server/api/proxy.py            | 146 ++++++++++++------
 .../server/integrations/traj_replayer.py      |  91 ++---------
 rock/sdk/model/server/main.py                 |  43 +++---
 tests/unit/sdk/model/test_proxy.py            | 125 +++++++++++++--
 tests/unit/sdk/model/test_traj_replayer.py    | 108 ++-----------
 5 files changed, 270 insertions(+), 243 deletions(-)

diff --git a/rock/sdk/model/server/api/proxy.py b/rock/sdk/model/server/api/proxy.py
index 18d0ead93c..e7ded21cd2 100644
--- a/rock/sdk/model/server/api/proxy.py
+++ b/rock/sdk/model/server/api/proxy.py
@@ -1,18 +1,27 @@
-"""OpenAI-compatible chat/completions proxy backed by the litellm SDK.
-
-The proxy ``/v1/chat/completions`` handler routes a request to the configured
-upstream LLM (or to the in-process traj-replay handler when ``replay_traj_path``
-is set), forwards header/body, and applies retry via litellm's ``num_retries``.
-
-Trajectory recording is wired up at startup in
-``rock.sdk.model.server.main`` by registering ``TrajectoryRecorder`` as a
-``litellm.callbacks`` entry — this handler does not carry a ``@record_traj``
-decorator anymore.
+"""OpenAI-compatible chat/completions proxy.
+
+Two paths share this handler:
+
+1. **Record / forward mode** (default) — ``litellm.acompletion`` is called with
+   the user-supplied model/messages, the upstream is selected from
+   ``proxy_base_url`` / ``proxy_rules``, retries come from litellm's
+   ``num_retries``, and the recorded JSONL trajectory is written by a
+   ``litellm.callbacks`` entry registered at startup (see
+   ``rock.sdk.model.server.main``).
+
+2. **Replay mode** (``replay_traj_path`` set) — the request is served directly
+   from the next record in ``app.state.replay_cursor`` without going through
+   litellm at all. We have a complete OpenAI-shape response on disk, so there's
+   no value in routing through CustomLLM/CustomStreamWrapper just to translate
+   formats. Streaming emits the recorded response as a single SSE chunk +
+   ``[DONE]``, mirroring litellm's own ``MockResponseIterator`` strategy.
 """
 
 from __future__ import annotations
 
 import json
+import time
+import uuid
 from collections.abc import AsyncIterator
 from typing import Any
 
@@ -23,6 +32,7 @@
 
 from rock.logger import init_logger
 from rock.sdk.model.server.config import ModelServiceConfig
+from rock.sdk.model.server.integrations.traj_replayer import SequentialCursor, TrajectoryExhausted
 
 logger = init_logger(__name__)
 
@@ -35,6 +45,7 @@
 #     so the client's values would be wrong or misleading
 #   - transfer-encoding / connection: true RFC 7230 hop-by-hop headers, scoped to
 #     the client↔proxy connection only
+#   - authorization: extracted into api_key kwarg, see _extract_bearer_token
 _HEADERS_NOT_TO_FORWARD = frozenset(
     {"host", "content-length", "content-type", "transfer-encoding", "connection", "authorization"}
 )
@@ -121,43 +132,97 @@ async def _sse_iter(stream: AsyncIterator[Any]) -> AsyncIterator[bytes]:
         yield b"data: [DONE]\n\n"
 
 
+def _completion_to_chunk(response: dict, *, model: str) -> dict:
+    """Convert a recorded ``chat.completion`` response into a single
+    ``chat.completion.chunk`` shape (move ``message`` → ``delta``).
+
+    Mirrors what litellm's ``convert_model_response_to_streaming`` does for its
+    own non-streaming providers — preserves ``finish_reason``, ``tool_calls``
+    and any other fields verbatim by simply renaming the wrapper key.
+    """
+    choices_in = response.get("choices") or []
+    choices_out = []
+    for choice in choices_in:
+        delta = dict(choice.get("message") or {})
+        choices_out.append(
+            {
+                "index": choice.get("index", 0),
+                "delta": delta,
+                "finish_reason": choice.get("finish_reason"),
+                "logprobs": choice.get("logprobs"),
+            }
+        )
+    return {
+        "id": response.get("id") or f"chatcmpl-{uuid.uuid4()}",
+        "object": "chat.completion.chunk",
+        "created": response.get("created") or int(time.time()),
+        "model": response.get("model") or model,
+        "choices": choices_out,
+    }
+
+
+async def _replay_sse_iter(response: dict, *, model: str) -> AsyncIterator[bytes]:
+    """Emit a recorded response as a single SSE chunk + ``[DONE]``.
+
+    The whole recorded answer goes out in one chunk — same strategy as
+    litellm's ``MockResponseIterator``. Most agents accumulate SSE into a
+    final string anyway; faking finer-grained streaming would just add code
+    without buying anyone anything.
+    """
+    chunk = _completion_to_chunk(response, model=model)
+    yield f"data: {json.dumps(chunk, ensure_ascii=False)}\n\n".encode()
+    yield b"data: [DONE]\n\n"
+
+
 @proxy_router.post("/v1/chat/completions")
 async def chat_completions(body: dict[str, Any], request: Request):
     """OpenAI-compatible chat completions proxy endpoint.
 
-    Routes via ``proxy_base_url`` / ``proxy_rules``, forwards Authorization-style
-    headers, supports streaming, retries via litellm. In replay mode the request
-    is dispatched to the registered ``traj-replay`` CustomLLM provider instead
-    of being forwarded upstream.
+    In replay mode (``replay_traj_path`` set), serves the next record from
+    ``app.state.replay_cursor`` directly — no litellm involvement. Otherwise
+    forwards to the configured upstream via ``litellm.acompletion``.
     """
     config: ModelServiceConfig = request.app.state.model_service_config
-
     model_name = body.get("model", "")
+    is_stream = bool(body.get("stream"))
 
-    # 1. Route selection
+    # ---- Replay mode: short-circuit, never touch litellm ----
     if config.replay_traj_path:
-        litellm_model = f"traj-replay/{model_name or 'replay'}"
-        api_base: str | None = None
-        logger.info(f"[replay] dispatching '{model_name}' to traj-replay handler")
-    else:
-        api_base = get_base_url(model_name, config)
-        # custom_openai is litellm's catch-all for OpenAI-compatible third-party endpoints
-        # (DashScope, ModelScope, Groq, Mistral, ...). Unlike `openai/`, it does NOT do
-        # model-name lookup, so arbitrary upstream model names like "glm-5" / "qwen-turbo"
-        # work without "This model isn't mapped yet" errors.
-        litellm_model = f"custom_openai/{model_name}" if model_name else "custom_openai/default"
-        logger.info(f"Routing model '{model_name}' to {api_base}")
-
-    # 2. Extract Bearer token (litellm needs api_key explicitly, not via headers)
-    api_key = _extract_bearer_token(request.headers)
+        cursor: SequentialCursor = request.app.state.replay_cursor
+        try:
+            record = await cursor.next(expected_model=model_name)
+        except TrajectoryExhausted as exc:
+            raise HTTPException(status_code=404, detail=str(exc))
+
+        response_dict = record.get("response")
+        if not isinstance(response_dict, dict):
+            raise HTTPException(
+                status_code=500,
+                detail=f"replay record at step {cursor.position - 1} has no usable response dict",
+            )
+        logger.info(f"[replay] step {cursor.position}/{cursor.total} served for model={model_name!r}")
+
+        if is_stream:
+            return StreamingResponse(
+                _replay_sse_iter(response_dict, model=model_name),
+                media_type="text/event-stream",
+            )
+        return JSONResponse(status_code=200, content=response_dict)
+
+    # ---- Forward / record mode: go through litellm ----
+    api_base = get_base_url(model_name, config)
+    # custom_openai is litellm's catch-all for OpenAI-compatible third-party endpoints
+    # (DashScope, ModelScope, Groq, Mistral, ...). Unlike `openai/`, it does NOT do
+    # model-name lookup, so arbitrary upstream model names like "glm-5" / "qwen-turbo"
+    # work without "This model isn't mapped yet" errors.
+    litellm_model = f"custom_openai/{model_name}" if model_name else "custom_openai/default"
+    logger.info(f"Routing model '{model_name}' to {api_base}")
 
-    # 3. Header forwarding (drop Authorization since we pass it via api_key, plus hop-by-hop)
+    api_key = _extract_bearer_token(request.headers)
     extra_headers = _filter_headers(request.headers)
 
-    # 4. Build call kwargs (transparent passthrough of body fields)
     call_kwargs = dict(body)
-    call_kwargs.pop("model", None)  # avoid duplicate kwargs
-    is_stream = bool(call_kwargs.get("stream"))
+    call_kwargs.pop("model", None)
 
     try:
         response = await litellm.acompletion(
@@ -167,10 +232,8 @@ async def chat_completions(body: dict[str, Any], request: Request):
             extra_headers=extra_headers,
             timeout=config.request_timeout,
             num_retries=config.num_retries,
-            # Suppress litellm's "model isn't mapped yet" cost-calc exception for
-            # arbitrary upstream models (glm-5, qwen-turbo, ...) that aren't in
-            # litellm's pricing table. We don't care about cost tracking here, so
-            # zero rates make the calc succeed cleanly with response_cost=0.
+            # Zero-cost rates suppress "model isn't mapped yet" from litellm's
+            # post-call cost calculator for arbitrary upstream model names.
             input_cost_per_token=0,
             output_cost_per_token=0,
             **call_kwargs,
@@ -182,13 +245,8 @@ async def chat_completions(body: dict[str, Any], request: Request):
         logger.error(f"Unexpected proxy error: {exc}", exc_info=True)
         raise HTTPException(status_code=500, detail=str(exc))
 
-    # 4. Streaming vs non-streaming response
     if is_stream:
         return StreamingResponse(_sse_iter(response), media_type="text/event-stream")
 
-    # litellm returns a ModelResponse pydantic; expose the OpenAI-shape dict.
-    if hasattr(response, "model_dump"):
-        body_out = response.model_dump()
-    else:
-        body_out = response  # already a dict (replay path can short-circuit)
+    body_out = response.model_dump() if hasattr(response, "model_dump") else response
     return JSONResponse(status_code=200, content=body_out)
diff --git a/rock/sdk/model/server/integrations/traj_replayer.py b/rock/sdk/model/server/integrations/traj_replayer.py
index c87c0fe75f..af2fdd6bb4 100644
--- a/rock/sdk/model/server/integrations/traj_replayer.py
+++ b/rock/sdk/model/server/integrations/traj_replayer.py
@@ -1,9 +1,10 @@
-"""Replay a recorded trajectory by registering a litellm CustomLLM provider.
+"""Sequential cursor over a recorded JSONL trajectory.
 
-Loads a single JSONL trajectory file on init, then hands records out one at a
-time in recorded order. This is the simplest matching strategy and works for
-deterministic agent runs that replay the same sequence of LLM calls
-(SWE-agent / mini-swe-agent / OpenHands).
+Loaded once at startup; ``await cursor.next(expected_model=...)`` hands out the
+next record (full StandardLoggingPayload dict) and advances. Going past the end
+raises :class:`TrajectoryExhausted` so the proxy can return a clean 404 without
+involving litellm — that's the whole point: replay does NOT need to go through
+litellm's CustomLLM machinery, the proxy serves recorded responses directly.
 """
 
 from __future__ import annotations
@@ -11,25 +12,24 @@
 import asyncio
 import json
 import os
-from collections.abc import AsyncIterator
 from pathlib import Path
-from typing import Any
-
-from litellm.llms.custom_llm import CustomLLM, CustomLLMError
-from litellm.types.utils import GenericStreamingChunk, ModelResponse
-from litellm.utils import async_mock_completion_streaming_obj
 
 from rock.logger import init_logger
 
 logger = init_logger(__name__)
 
 
-class SequentialCursor:
-    """Hands out trajectory records one at a time, in recorded order.
+class TrajectoryExhausted(Exception):
+    """Raised by ``SequentialCursor.next`` when all recorded steps have been served."""
+
+    def __init__(self, position: int, total: int) -> None:
+        super().__init__(f"trajectory exhausted at step {position} (total recorded steps={total})")
+        self.position = position
+        self.total = total
+
 
-    Going past the end raises CustomLLMError(404) so the proxy returns a clear
-    error to the caller.
-    """
+class SequentialCursor:
+    """Hands out trajectory records one at a time, in recorded order."""
 
     def __init__(self, records: list[dict]) -> None:
         self.records = records
@@ -56,10 +56,7 @@ def load(cls, path: str | os.PathLike) -> SequentialCursor:
     async def next(self, expected_model: str | None = None) -> dict:
         async with self._lock:
             if self._idx >= len(self.records):
-                raise CustomLLMError(
-                    status_code=404,
-                    message=(f"trajectory exhausted at step {self._idx} (total recorded steps={len(self.records)})"),
-                )
+                raise TrajectoryExhausted(position=self._idx, total=len(self.records))
             record = self.records[self._idx]
             self._idx += 1
             current_idx = self._idx - 1
@@ -83,57 +80,3 @@ def position(self) -> int:
     @property
     def total(self) -> int:
         return len(self.records)
-
-
-def _record_to_model_response(record: dict) -> ModelResponse:
-    response = record.get("response")
-    if not isinstance(response, dict):
-        raise CustomLLMError(
-            status_code=500,
-            message=f"traj record at step has no usable 'response' dict: got {type(response).__name__}",
-        )
-    return ModelResponse(**response)
-
-
-def _extract_assistant_text(record: dict) -> str:
-    response = record.get("response") or {}
-    choices = response.get("choices") or []
-    if not choices:
-        return ""
-    message = choices[0].get("message") or {}
-    return message.get("content") or ""
-
-
-class TrajectoryReplayer(CustomLLM):
-    """litellm CustomLLM that returns recorded responses in sequential order."""
-
-    def __init__(self, traj_path: str | os.PathLike) -> None:
-        super().__init__()
-        self.cursor = SequentialCursor.load(traj_path)
-
-    async def acompletion(
-        self,
-        model: str,
-        messages: list,
-        *args: Any,
-        **kwargs: Any,
-    ) -> ModelResponse:
-        record = await self.cursor.next(expected_model=model)
-        return _record_to_model_response(record)
-
-    async def astreaming(
-        self,
-        model: str,
-        messages: list,
-        *args: Any,
-        **kwargs: Any,
-    ) -> AsyncIterator[GenericStreamingChunk]:
-        record = await self.cursor.next(expected_model=model)
-        text = _extract_assistant_text(record)
-        model_response = kwargs.get("model_response")
-        async for chunk in async_mock_completion_streaming_obj(
-            model_response=model_response,
-            mock_response=text,
-            model=model,
-        ):
-            yield chunk
diff --git a/rock/sdk/model/server/main.py b/rock/sdk/model/server/main.py
index e2263a1858..133605b046 100644
--- a/rock/sdk/model/server/main.py
+++ b/rock/sdk/model/server/main.py
@@ -52,33 +52,34 @@ async def global_exception_handler(request, exc):
     return app
 
 
-def _configure_litellm_for_proxy(config: ModelServiceConfig) -> None:
-    """Wire up litellm record/replay integrations for the proxy mode.
-
-    - When ``replay_traj_path`` is set, register ``TrajectoryReplayer`` as a
-      custom provider so requests routed to ``traj-replay/<model>`` return
-      recorded responses without hitting any upstream.
-    - When recording is enabled (default), register ``TrajectoryRecorder`` as
-      a litellm callback so every chat/completions call appends a JSONL line.
+def _configure_proxy_integrations(app: FastAPI, config: ModelServiceConfig) -> None:
+    """Wire up record/replay integrations for the proxy mode.
+
+    - When ``replay_traj_path`` is set, load the trajectory into a
+      ``SequentialCursor`` and attach it to ``app.state.replay_cursor``. The
+      proxy handler serves recorded responses directly from this cursor; we
+      do NOT register anything with litellm (replay path bypasses litellm
+      entirely so cursor-exhausted errors aren't swallowed by retry logic).
+    - Otherwise (record/forward mode), if ``traj_enabled`` is True, register
+      ``TrajectoryRecorder`` as a ``litellm.callbacks`` entry so every
+      chat/completions call appends a JSONL line.
 
     Replay and record are mutually exclusive: in replay mode we don't record,
-    since replayed responses re-traversing the recorder would inflate metrics
-    and overwrite the source-of-truth file.
+    since replayed responses round-tripping back into the source file would
+    inflate metrics and corrupt the trajectory.
     """
-    import litellm
-
-    from rock.sdk.model.server.integrations.traj_recorder import TrajectoryRecorder
-    from rock.sdk.model.server.integrations.traj_replayer import TrajectoryReplayer
-
     if config.replay_traj_path:
-        replayer = TrajectoryReplayer(config.replay_traj_path)
-        litellm.custom_provider_map = [
-            {"provider": "traj-replay", "custom_handler": replayer},
-        ]
-        logger.info(f"litellm replay handler registered, traj_path={config.replay_traj_path}")
+        from rock.sdk.model.server.integrations.traj_replayer import SequentialCursor
+
+        app.state.replay_cursor = SequentialCursor.load(config.replay_traj_path)
+        logger.info(f"replay cursor loaded, traj_path={config.replay_traj_path}")
         return
 
     if config.traj_enabled:
+        import litellm
+
+        from rock.sdk.model.server.integrations.traj_recorder import TrajectoryRecorder
+
         traj_path = config.traj_file or TRAJ_FILE
         recorder = TrajectoryRecorder(traj_file=traj_path)
         litellm.callbacks.append(recorder)
@@ -96,7 +97,7 @@ def main(
         asyncio.run(init_local_api())
         app.include_router(local_router, prefix="", tags=["local"])
     else:
-        _configure_litellm_for_proxy(config)
+        _configure_proxy_integrations(app, config)
         app.include_router(proxy_router, prefix="", tags=["proxy"])
 
     logger.info(f"Starting LLM Service on {config.host}:{config.port}, type: {model_servie_type}")
diff --git a/tests/unit/sdk/model/test_proxy.py b/tests/unit/sdk/model/test_proxy.py
index c1b141f930..5a4d7e8230 100644
--- a/tests/unit/sdk/model/test_proxy.py
+++ b/tests/unit/sdk/model/test_proxy.py
@@ -1,3 +1,4 @@
+import json
 from types import SimpleNamespace
 from unittest.mock import AsyncMock, MagicMock, patch
 
@@ -176,27 +177,133 @@ async def test_chat_completions_litellm_error_returns_proxy_schema():
 
 
 @pytest.mark.asyncio
-async def test_chat_completions_replay_mode_uses_traj_replay_provider():
-    """In replay mode the proxy targets traj-replay/<model> instead of a real upstream."""
+async def test_replay_mode_returns_recorded_response_without_calling_litellm(tmp_path):
+    """In replay mode the proxy serves the next record directly from app.state.replay_cursor;
+    litellm.acompletion must never be invoked."""
+    from rock.sdk.model.server.integrations.traj_replayer import SequentialCursor
+
+    record = {
+        "model": "gpt-3.5-turbo",
+        "response": {
+            "id": "rec-1",
+            "object": "chat.completion",
+            "model": "gpt-3.5-turbo",
+            "choices": [
+                {
+                    "index": 0,
+                    "message": {"role": "assistant", "content": "recorded reply"},
+                    "finish_reason": "stop",
+                }
+            ],
+            "usage": {"prompt_tokens": 1, "completion_tokens": 2, "total_tokens": 3},
+        },
+    }
+    traj = tmp_path / "t.jsonl"
+    traj.write_text(json.dumps(record) + "\n", encoding="utf-8")
+
     config = ModelServiceConfig()
-    config.replay_traj_path = "/tmp/does-not-matter-for-this-test"
+    config.replay_traj_path = str(traj)
 
     local_app = FastAPI()
     local_app.state.model_service_config = config
+    local_app.state.replay_cursor = SequentialCursor.load(traj)
     local_app.include_router(proxy_router)
 
     with patch(ACOMPLETION_PATCH, new_callable=AsyncMock) as mock_acompletion:
-        mock_acompletion.return_value = _fake_model_response()
-
         transport = ASGITransport(app=local_app)
         async with AsyncClient(transport=transport, base_url="http://test") as ac:
             payload = {"model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "hi"}]}
             response = await ac.post("/v1/chat/completions", json=payload)
 
-        assert response.status_code == 200
-        call_kwargs = mock_acompletion.call_args.kwargs
-        assert call_kwargs["model"] == "traj-replay/gpt-3.5-turbo"
-        assert call_kwargs["api_base"] is None
+    assert response.status_code == 200
+    body = response.json()
+    assert body["choices"][0]["message"]["content"] == "recorded reply"
+    mock_acompletion.assert_not_called()
+
+
+@pytest.mark.asyncio
+async def test_replay_mode_streaming_emits_recorded_response_as_sse(tmp_path):
+    """Replay + stream=True emits one SSE chunk (content moved into delta) plus [DONE]."""
+    from rock.sdk.model.server.integrations.traj_replayer import SequentialCursor
+
+    record = {
+        "model": "gpt-3.5-turbo",
+        "response": {
+            "id": "rec-stream",
+            "object": "chat.completion",
+            "model": "gpt-3.5-turbo",
+            "choices": [
+                {
+                    "index": 0,
+                    "message": {"role": "assistant", "content": "streamed reply"},
+                    "finish_reason": "tool_calls",
+                }
+            ],
+        },
+    }
+    traj = tmp_path / "t.jsonl"
+    traj.write_text(json.dumps(record) + "\n", encoding="utf-8")
+
+    config = ModelServiceConfig()
+    config.replay_traj_path = str(traj)
+
+    local_app = FastAPI()
+    local_app.state.model_service_config = config
+    local_app.state.replay_cursor = SequentialCursor.load(traj)
+    local_app.include_router(proxy_router)
+
+    transport = ASGITransport(app=local_app)
+    async with AsyncClient(transport=transport, base_url="http://test") as ac:
+        response = await ac.post(
+            "/v1/chat/completions",
+            json={"model": "gpt-3.5-turbo", "stream": True, "messages": [{"role": "user", "content": "hi"}]},
+        )
+
+    assert response.status_code == 200
+    body = response.text
+    assert "data: [DONE]" in body
+    # The SSE chunk shape is chat.completion.chunk with message → delta, finish_reason preserved
+    assert '"object": "chat.completion.chunk"' in body
+    assert '"delta": {"role": "assistant", "content": "streamed reply"}' in body
+    assert '"finish_reason": "tool_calls"' in body
+
+
+@pytest.mark.asyncio
+async def test_replay_mode_returns_404_when_cursor_exhausted(tmp_path):
+    """Cursor used up → 404 with a clear message; no litellm retries involved."""
+    from rock.sdk.model.server.integrations.traj_replayer import SequentialCursor
+
+    record = {
+        "model": "gpt-3.5-turbo",
+        "response": {
+            "id": "only",
+            "choices": [{"index": 0, "message": {"role": "assistant", "content": "x"}, "finish_reason": "stop"}],
+        },
+    }
+    traj = tmp_path / "t.jsonl"
+    traj.write_text(json.dumps(record) + "\n", encoding="utf-8")
+
+    config = ModelServiceConfig()
+    config.replay_traj_path = str(traj)
+
+    local_app = FastAPI()
+    local_app.state.model_service_config = config
+    local_app.state.replay_cursor = SequentialCursor.load(traj)
+    local_app.include_router(proxy_router)
+
+    transport = ASGITransport(app=local_app)
+    async with AsyncClient(transport=transport, base_url="http://test") as ac:
+        await ac.post(
+            "/v1/chat/completions",
+            json={"model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "hi"}]},
+        )
+        second = await ac.post(
+            "/v1/chat/completions",
+            json={"model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "again"}]},
+        )
+
+    assert second.status_code == 404
+    assert "exhausted" in second.json()["detail"]
 
 
 @pytest.mark.asyncio
diff --git a/tests/unit/sdk/model/test_traj_replayer.py b/tests/unit/sdk/model/test_traj_replayer.py
index 7bfe30ef4e..e4a379bd0d 100644
--- a/tests/unit/sdk/model/test_traj_replayer.py
+++ b/tests/unit/sdk/model/test_traj_replayer.py
@@ -1,19 +1,18 @@
-"""Tests for SequentialCursor + TrajectoryReplayer."""
+"""Tests for SequentialCursor (the replay cursor used by proxy.py).
+
+The proxy serves replay responses directly — there is no CustomLLM-based
+``TrajectoryReplayer`` anymore. End-to-end replay coverage (cursor + SSE chunk
+emit + cursor-exhausted → 404) lives in ``test_proxy.py``.
+"""
 
 import json
-from types import SimpleNamespace
 
 import pytest
-from litellm.llms.custom_llm import CustomLLMError
 
-from rock.sdk.model.server.integrations.traj_replayer import (
-    SequentialCursor,
-    TrajectoryReplayer,
-)
+from rock.sdk.model.server.integrations.traj_replayer import SequentialCursor, TrajectoryExhausted
 
 
 def _record(*, msg: str, model: str = "gpt-3.5-turbo", call_id: str = "x") -> dict:
-    """Build a minimal StandardLoggingPayload-shaped record."""
     return {
         "id": call_id,
         "model": model,
@@ -40,9 +39,6 @@ def _write_jsonl(path, records):
             f.write(json.dumps(r) + "\n")
 
 
-# ----- SequentialCursor -----
-
-
 def test_cursor_load_from_single_file(tmp_path):
     p = tmp_path / "traj.jsonl"
     _write_jsonl(p, [_record(msg="a"), _record(msg="b")])
@@ -69,7 +65,7 @@ def test_cursor_load_missing_file_raises(tmp_path):
 
 
 def test_cursor_load_directory_raises(tmp_path):
-    """A directory is no longer a valid traj_file — must point to a single .jsonl."""
+    """Path must be a single .jsonl file, not a directory."""
     with pytest.raises(FileNotFoundError):
         SequentialCursor.load(tmp_path)
 
@@ -89,17 +85,17 @@ async def test_cursor_next_returns_records_in_order(tmp_path):
 
 
 @pytest.mark.asyncio
-async def test_cursor_next_raises_when_exhausted(tmp_path):
+async def test_cursor_next_raises_trajectory_exhausted_when_done(tmp_path):
     p = tmp_path / "traj.jsonl"
     _write_jsonl(p, [_record(msg="only")])
 
     cur = SequentialCursor.load(p)
     await cur.next()
 
-    with pytest.raises(CustomLLMError) as exc_info:
+    with pytest.raises(TrajectoryExhausted) as exc_info:
         await cur.next()
-    assert exc_info.value.status_code == 404
-    assert "exhausted" in exc_info.value.message
+    assert exc_info.value.position == 1
+    assert exc_info.value.total == 1
 
 
 @pytest.mark.asyncio
@@ -117,88 +113,10 @@ async def test_cursor_reset_replays_from_start(tmp_path):
 
 
 @pytest.mark.asyncio
-async def test_cursor_model_mismatch_only_warns(tmp_path, caplog):
+async def test_cursor_model_mismatch_only_warns(tmp_path):
     p = tmp_path / "traj.jsonl"
     _write_jsonl(p, [_record(msg="a", model="gpt-3.5-turbo")])
 
     cur = SequentialCursor.load(p)
     record = await cur.next(expected_model="gpt-4o")  # different model -> warn but don't raise
     assert record["id"] == "x"
-
-
-# ----- TrajectoryReplayer -----
-
-
-@pytest.mark.asyncio
-async def test_replayer_acompletion_returns_recorded_response(tmp_path):
-    p = tmp_path / "traj.jsonl"
-    _write_jsonl(p, [_record(msg="a", call_id="step-1")])
-
-    replayer = TrajectoryReplayer(p)
-    response = await replayer.acompletion(
-        model="gpt-3.5-turbo",
-        messages=[{"role": "user", "content": "anything"}],
-    )
-
-    assert response.id == "step-1"
-    assert response.choices[0].message.content == "reply: a"
-
-
-@pytest.mark.asyncio
-async def test_replayer_acompletion_advances_cursor(tmp_path):
-    p = tmp_path / "traj.jsonl"
-    _write_jsonl(
-        p,
-        [
-            _record(msg="a", call_id="step-1"),
-            _record(msg="b", call_id="step-2"),
-        ],
-    )
-
-    replayer = TrajectoryReplayer(p)
-    r1 = await replayer.acompletion(model="gpt-3.5-turbo", messages=[])
-    r2 = await replayer.acompletion(model="gpt-3.5-turbo", messages=[])
-
-    assert r1.id == "step-1"
-    assert r2.id == "step-2"
-
-
-@pytest.mark.asyncio
-async def test_replayer_astreaming_yields_chunks_that_recompose_the_text(tmp_path):
-    """The chunks produced by astreaming should reassemble into the recorded text."""
-    p = tmp_path / "traj.jsonl"
-    recorded_text = "Hello world, this is a deterministic replay."
-    record = _record(msg="hi")
-    record["response"]["choices"][0]["message"]["content"] = recorded_text
-    _write_jsonl(p, [record])
-
-    replayer = TrajectoryReplayer(p)
-
-    # Build a litellm-shaped ModelResponse mock with one Choice/Delta slot.
-    fake_choice = SimpleNamespace(delta=SimpleNamespace(role=None, content=None), index=0)
-    fake_response = SimpleNamespace(choices=[fake_choice])
-
-    chunks_text = []
-    async for chunk in replayer.astreaming(
-        model="gpt-3.5-turbo",
-        messages=[],
-        model_response=fake_response,
-    ):
-        if hasattr(chunk, "choices") and chunk.choices and getattr(chunk.choices[0], "delta", None):
-            piece = chunk.choices[0].delta.content
-            if piece:
-                chunks_text.append(piece)
-
-    assert "".join(chunks_text) == recorded_text
-
-
-@pytest.mark.asyncio
-async def test_replayer_acompletion_raises_on_exhaustion(tmp_path):
-    p = tmp_path / "traj.jsonl"
-    _write_jsonl(p, [_record(msg="only")])
-
-    replayer = TrajectoryReplayer(p)
-    await replayer.acompletion(model="gpt-3.5-turbo", messages=[])
-
-    with pytest.raises(CustomLLMError):
-        await replayer.acompletion(model="gpt-3.5-turbo", messages=[])

From 7a4b37fda5c701fd12a8d50168efab8b30e71060 Mon Sep 17 00:00:00 2001
From: "pengshixin.psx" <pengshixin.psx@alibaba-inc.com>
Date: Tue, 12 May 2026 03:58:17 +0000
Subject: [PATCH 04/25] refactor(model-service): drop litellm, use httpx
 byte-passthrough + openai SDK as parser
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Replaces litellm.acompletion with raw httpx forwarding. The proxy no longer
parses or rewrites the OpenAI protocol on the forward path — body bytes go
upstream as-is, response bytes come back as-is. The openai SDK is kept solely
as a parser library for the recording side: ChatCompletionChunk + the official
ChatCompletionStreamState aggregate streaming chunks into a final ChatCompletion
that the recorder writes to JSONL.

This restores the proxy's original "transparent forward" intent and eliminates
several litellm-specific pain points encountered during testing:
- No "model isn't mapped yet" cost-calc exception (no calc happens at all).
- No need for the input_cost_per_token=0 / custom_openai prefix workarounds.
- Authorization header passes through verbatim (no api_key extraction kludge).
- Provider-specific fields (reasoning_content, citations, ...) are preserved
  byte-for-byte going to the client AND auto-aggregated in the recorded traj
  (openai SDK uses extra="allow" pydantic mode).
- Cursor exhaustion in replay returns 404 directly, never gets retried.

Changes:
- pyproject.toml: drop litellm>=1.50.0, add openai>=1.50.0 and httpx
- proxy.py: rewrite forward path with httpx; record streams via dual-purpose
  byte forwarding + parallel SSE parsing into ChatCompletionStreamState
- traj_recorder.py: drop CustomLogger inheritance; expose explicit
  recorder.record(request, response, status, ...) API called from proxy.py
- main.py: attach recorder/cursor to app.state instead of registering with
  litellm.callbacks / litellm.custom_provider_map
- test_proxy.py: rewritten to use httpx.MockTransport for upstream mocking;
  cover byte passthrough, provider-specific field preservation, error
  forwarding, recorder invocation, replay paths
- test_traj_recorder.py: rewritten for the explicit-call API

36 passed. End-to-end verified against DashScope glm-5: streaming record,
non-stream replay, streaming replay, cursor exhausted -> 404 all work.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 pyproject.toml                                |   8 +-
 rock/sdk/model/server/api/proxy.py            | 314 +++++----
 .../server/integrations/traj_recorder.py      |  93 +--
 rock/sdk/model/server/main.py                 |  28 +-
 tests/unit/sdk/model/test_proxy.py            | 614 ++++++++----------
 tests/unit/sdk/model/test_traj_recorder.py    | 203 +++---
 uv.lock                                       | 277 +-------
 7 files changed, 612 insertions(+), 925 deletions(-)

diff --git a/pyproject.toml b/pyproject.toml
index bf814e0aa1..ac87c14c41 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -86,7 +86,13 @@ model-service = [
     "psutil",
     "swebench",
     "alibabacloud_cr20181201==2.0.5",
-    "litellm>=1.50.0",
+    # openai SDK is used as a TYPE/parser library only — for ChatCompletionChunk
+    # validation and ChatCompletionStreamState (the official stream chunk aggregator).
+    # We do NOT use AsyncOpenAI as an HTTP client; transport is plain httpx so the
+    # proxy can forward upstream bytes verbatim, including any provider-specific
+    # fields (reasoning_content, citations, ...) without re-encoding OpenAI protocol.
+    "openai>=1.50.0",
+    "httpx",
 ]
 
 
diff --git a/rock/sdk/model/server/api/proxy.py b/rock/sdk/model/server/api/proxy.py
index e7ded21cd2..8161c4adc0 100644
--- a/rock/sdk/model/server/api/proxy.py
+++ b/rock/sdk/model/server/api/proxy.py
@@ -1,20 +1,20 @@
-"""OpenAI-compatible chat/completions proxy.
+"""OpenAI-compatible chat/completions proxy with trajectory record/replay.
 
 Two paths share this handler:
 
-1. **Record / forward mode** (default) — ``litellm.acompletion`` is called with
-   the user-supplied model/messages, the upstream is selected from
-   ``proxy_base_url`` / ``proxy_rules``, retries come from litellm's
-   ``num_retries``, and the recorded JSONL trajectory is written by a
-   ``litellm.callbacks`` entry registered at startup (see
-   ``rock.sdk.model.server.main``).
+1. **Forward / record mode** (default) — body bytes are POSTed verbatim to the
+   configured upstream via plain ``httpx``. The upstream response is forwarded
+   byte-for-byte back to the client (raw JSON for non-stream, raw SSE bytes
+   for stream). On the side we run a parser (``ChatCompletionChunk`` +
+   ``ChatCompletionStreamState`` from the openai SDK) to aggregate streaming
+   chunks into a final ChatCompletion that the recorder writes to JSONL. The
+   forward path itself does NOT depend on OpenAI types — anything the upstream
+   returns (provider-specific ``reasoning_content``, ``citations``, ...) is
+   passed through untouched.
 
 2. **Replay mode** (``replay_traj_path`` set) — the request is served directly
-   from the next record in ``app.state.replay_cursor`` without going through
-   litellm at all. We have a complete OpenAI-shape response on disk, so there's
-   no value in routing through CustomLLM/CustomStreamWrapper just to translate
-   formats. Streaming emits the recorded response as a single SSE chunk +
-   ``[DONE]``, mirroring litellm's own ``MockResponseIterator`` strategy.
+   from the next record in ``app.state.replay_cursor`` without any upstream
+   call. Streaming emits the recorded response as one SSE chunk + ``[DONE]``.
 """
 
 from __future__ import annotations
@@ -25,13 +25,15 @@
 from collections.abc import AsyncIterator
 from typing import Any
 
-import litellm
+import httpx
 from fastapi import APIRouter, HTTPException, Request
-from fastapi.responses import JSONResponse, StreamingResponse
-from litellm.exceptions import APIError, AuthenticationError, BadRequestError, RateLimitError, Timeout
+from fastapi.responses import JSONResponse, Response, StreamingResponse
+from openai.lib.streaming.chat import ChatCompletionStreamState
+from openai.types.chat import ChatCompletionChunk
 
 from rock.logger import init_logger
 from rock.sdk.model.server.config import ModelServiceConfig
+from rock.sdk.model.server.integrations.traj_recorder import TrajectoryRecorder
 from rock.sdk.model.server.integrations.traj_replayer import SequentialCursor, TrajectoryExhausted
 
 logger = init_logger(__name__)
@@ -41,31 +43,9 @@
 
 
 # Headers we never forward upstream:
-#   - host / content-length / content-type: litellm rewrites the body and re-targets,
-#     so the client's values would be wrong or misleading
-#   - transfer-encoding / connection: true RFC 7230 hop-by-hop headers, scoped to
-#     the client↔proxy connection only
-#   - authorization: extracted into api_key kwarg, see _extract_bearer_token
-_HEADERS_NOT_TO_FORWARD = frozenset(
-    {"host", "content-length", "content-type", "transfer-encoding", "connection", "authorization"}
-)
-
-
-def _extract_bearer_token(headers) -> str | None:
-    """Pull the Bearer token out of the Authorization header.
-
-    litellm's OpenAI client needs the API key as an explicit ``api_key=`` kwarg —
-    setting Authorization in extra_headers does not work because litellm always
-    regenerates that header from ``api_key`` (or env vars). So we extract it here
-    and let the proxy stay stateless about which key the client is using.
-    """
-    auth = headers.get("authorization") or headers.get("Authorization")
-    if not auth:
-        return None
-    parts = auth.split(None, 1)
-    if len(parts) == 2 and parts[0].lower() == "bearer":
-        return parts[1].strip()
-    return auth.strip()
+#   - host / content-length: rebuilt by httpx for the upstream request
+#   - transfer-encoding / connection: RFC 7230 hop-by-hop, scoped to one connection
+_HEADERS_NOT_TO_FORWARD = frozenset({"host", "content-length", "transfer-encoding", "connection"})
 
 
 def get_base_url(model_name: str, config: ModelServiceConfig) -> str:
@@ -93,53 +73,21 @@ def get_base_url(model_name: str, config: ModelServiceConfig) -> str:
 
 
 def _filter_headers(headers) -> dict[str, str]:
-    forwarded = {}
+    """Drop headers that are scoped to the client↔proxy hop or rebuilt by httpx.
+    ``Authorization`` is forwarded verbatim — proxy stays stateless about which
+    API key the client uses."""
+    out = {}
     for key, value in headers.items():
         if key.lower() in _HEADERS_NOT_TO_FORWARD:
             continue
-        forwarded[key] = value
-    return forwarded
-
-
-def _format_error_response(exc: Exception) -> JSONResponse:
-    """Render a litellm exception as the legacy ``{error:{message,type,code}}`` JSON.
-
-    Agent-side logic keys off message substrings (e.g. "context length exceeded",
-    "content violation"), so we keep the message verbatim from the upstream.
-    """
-    status_code = getattr(exc, "status_code", None) or 502
-    message = str(exc)
-    error_type = type(exc).__name__
-    return JSONResponse(
-        status_code=status_code,
-        content={
-            "error": {
-                "message": f"LLM backend error: {message}",
-                "type": error_type,
-                "code": status_code,
-            }
-        },
-    )
-
-
-async def _sse_iter(stream: AsyncIterator[Any]) -> AsyncIterator[bytes]:
-    """Convert a litellm async chunk stream into Server-Sent Events bytes."""
-    try:
-        async for chunk in stream:
-            payload = chunk.model_dump() if hasattr(chunk, "model_dump") else chunk
-            yield f"data: {json.dumps(payload, ensure_ascii=False)}\n\n".encode()
-    finally:
-        yield b"data: [DONE]\n\n"
+        out[key] = value
+    return out
 
 
 def _completion_to_chunk(response: dict, *, model: str) -> dict:
     """Convert a recorded ``chat.completion`` response into a single
-    ``chat.completion.chunk`` shape (move ``message`` → ``delta``).
-
-    Mirrors what litellm's ``convert_model_response_to_streaming`` does for its
-    own non-streaming providers — preserves ``finish_reason``, ``tool_calls``
-    and any other fields verbatim by simply renaming the wrapper key.
-    """
+    ``chat.completion.chunk`` shape (move ``message`` → ``delta``). Used only by
+    the replay streaming path."""
     choices_in = response.get("choices") or []
     choices_out = []
     for choice in choices_in:
@@ -162,31 +110,112 @@ def _completion_to_chunk(response: dict, *, model: str) -> dict:
 
 
 async def _replay_sse_iter(response: dict, *, model: str) -> AsyncIterator[bytes]:
-    """Emit a recorded response as a single SSE chunk + ``[DONE]``.
-
-    The whole recorded answer goes out in one chunk — same strategy as
-    litellm's ``MockResponseIterator``. Most agents accumulate SSE into a
-    final string anyway; faking finer-grained streaming would just add code
-    without buying anyone anything.
-    """
+    """Emit a recorded response as one SSE chunk + ``[DONE]``."""
     chunk = _completion_to_chunk(response, model=model)
     yield f"data: {json.dumps(chunk, ensure_ascii=False)}\n\n".encode()
     yield b"data: [DONE]\n\n"
 
 
+def _parse_sse_chunks_into_state(buffer: bytes, state: ChatCompletionStreamState) -> bytes:
+    """Pull complete SSE events out of ``buffer`` and feed each ``data:`` line
+    (other than ``[DONE]``) to the openai stream-state aggregator. Returns the
+    leftover bytes that did not yet form a complete event."""
+    while b"\n\n" in buffer:
+        event, buffer = buffer.split(b"\n\n", 1)
+        for raw_line in event.split(b"\n"):
+            line = raw_line.decode("utf-8", errors="replace").strip()
+            if not line.startswith("data:"):
+                continue
+            payload = line[len("data:") :].strip()
+            if not payload or payload == "[DONE]":
+                continue
+            try:
+                state.handle_chunk(ChatCompletionChunk.model_validate(json.loads(payload)))
+            except Exception as exc:  # parser error: forward continues, traj will be partial
+                logger.debug(f"[record] chunk parse failed (forward continues): {exc}")
+    return buffer
+
+
+async def _forward_stream_and_record(
+    *,
+    upstream_url: str,
+    body_bytes: bytes,
+    fwd_headers: dict[str, str],
+    timeout: float,
+    request_dict: dict[str, Any],
+    recorder: TrajectoryRecorder | None,
+) -> AsyncIterator[bytes]:
+    """SSE bytes are forwarded verbatim; chunks are parsed in parallel and
+    aggregated into the final ChatCompletion that the recorder writes to JSONL."""
+    state = ChatCompletionStreamState()
+    start = time.time()
+    parse_buffer = b""
+    upstream_status = 0
+
+    try:
+        async with httpx.AsyncClient(timeout=timeout) as client:
+            async with client.stream("POST", upstream_url, content=body_bytes, headers=fwd_headers) as r:
+                upstream_status = r.status_code
+                async for chunk in r.aiter_bytes():
+                    yield chunk
+                    parse_buffer = _parse_sse_chunks_into_state(parse_buffer + chunk, state)
+    except httpx.RequestError as exc:
+        # Connection died mid-stream. The bytes already sent reach the client;
+        # we still try to record what we got.
+        if recorder is not None:
+            await recorder.record(
+                request=request_dict,
+                response=None,
+                status="failure",
+                start_time=start,
+                end_time=time.time(),
+                error=f"{type(exc).__name__}: {exc}",
+            )
+        return
+
+    if recorder is None:
+        return
+
+    status = "success" if upstream_status < 400 else "failure"
+    final_dict: dict | None = None
+    if status == "success":
+        try:
+            final_dict = state.get_final_completion().model_dump()
+        except Exception as exc:
+            logger.warning(f"[record] stream aggregation failed: {exc}")
+
+    await recorder.record(
+        request=request_dict,
+        response=final_dict,
+        status=status,
+        start_time=start,
+        end_time=time.time(),
+        error=None if status == "success" else f"upstream_status={upstream_status}",
+    )
+
+
 @proxy_router.post("/v1/chat/completions")
-async def chat_completions(body: dict[str, Any], request: Request):
+async def chat_completions(request: Request):
     """OpenAI-compatible chat completions proxy endpoint.
 
-    In replay mode (``replay_traj_path`` set), serves the next record from
-    ``app.state.replay_cursor`` directly — no litellm involvement. Otherwise
-    forwards to the configured upstream via ``litellm.acompletion``.
+    Reads the body as raw bytes (no parsing on the forward path) and either
+    serves it from the replay cursor or forwards it to the configured upstream.
     """
     config: ModelServiceConfig = request.app.state.model_service_config
-    model_name = body.get("model", "")
-    is_stream = bool(body.get("stream"))
+    recorder: TrajectoryRecorder | None = getattr(request.app.state, "recorder", None)
+
+    body_bytes = await request.body()
+    try:
+        request_dict = json.loads(body_bytes) if body_bytes else {}
+    except json.JSONDecodeError:
+        raise HTTPException(status_code=400, detail="Request body is not valid JSON.")
+    if not isinstance(request_dict, dict):
+        raise HTTPException(status_code=400, detail="Request body must be a JSON object.")
 
-    # ---- Replay mode: short-circuit, never touch litellm ----
+    model_name = request_dict.get("model", "")
+    is_stream = bool(request_dict.get("stream"))
+
+    # ---- Replay mode: short-circuit, no upstream call ----
     if config.replay_traj_path:
         cursor: SequentialCursor = request.app.state.replay_cursor
         try:
@@ -209,44 +238,71 @@ async def chat_completions(body: dict[str, Any], request: Request):
             )
         return JSONResponse(status_code=200, content=response_dict)
 
-    # ---- Forward / record mode: go through litellm ----
-    api_base = get_base_url(model_name, config)
-    # custom_openai is litellm's catch-all for OpenAI-compatible third-party endpoints
-    # (DashScope, ModelScope, Groq, Mistral, ...). Unlike `openai/`, it does NOT do
-    # model-name lookup, so arbitrary upstream model names like "glm-5" / "qwen-turbo"
-    # work without "This model isn't mapped yet" errors.
-    litellm_model = f"custom_openai/{model_name}" if model_name else "custom_openai/default"
-    logger.info(f"Routing model '{model_name}' to {api_base}")
+    # ---- Forward / record mode: byte-passthrough via httpx ----
+    upstream_url = f"{get_base_url(model_name, config)}/chat/completions"
+    fwd_headers = _filter_headers(request.headers)
+    logger.info(f"Routing model {model_name!r} to {upstream_url}")
 
-    api_key = _extract_bearer_token(request.headers)
-    extra_headers = _filter_headers(request.headers)
+    if is_stream:
+        return StreamingResponse(
+            _forward_stream_and_record(
+                upstream_url=upstream_url,
+                body_bytes=body_bytes,
+                fwd_headers=fwd_headers,
+                timeout=config.request_timeout,
+                request_dict=request_dict,
+                recorder=recorder,
+            ),
+            media_type="text/event-stream",
+        )
 
-    call_kwargs = dict(body)
-    call_kwargs.pop("model", None)
+    # Non-stream: single POST, return upstream's status + body verbatim, record on the side.
+    start = time.time()
+    try:
+        async with httpx.AsyncClient(timeout=config.request_timeout) as client:
+            r = await client.post(upstream_url, content=body_bytes, headers=fwd_headers)
+    except httpx.TimeoutException as exc:
+        if recorder is not None:
+            await recorder.record(
+                request=request_dict,
+                response=None,
+                status="failure",
+                start_time=start,
+                end_time=time.time(),
+                error=f"timeout: {exc}",
+            )
+        raise HTTPException(status_code=504, detail=f"Upstream timed out: {exc}")
+    except httpx.RequestError as exc:
+        if recorder is not None:
+            await recorder.record(
+                request=request_dict,
+                response=None,
+                status="failure",
+                start_time=start,
+                end_time=time.time(),
+                error=f"{type(exc).__name__}: {exc}",
+            )
+        raise HTTPException(status_code=502, detail=f"Upstream request failed: {exc}")
 
+    response_text = r.text  # bytes already read by httpx; .text decodes once
+    response_dict: dict | None = None
     try:
-        response = await litellm.acompletion(
-            model=litellm_model,
-            api_base=api_base,
-            api_key=api_key,
-            extra_headers=extra_headers,
-            timeout=config.request_timeout,
-            num_retries=config.num_retries,
-            # Zero-cost rates suppress "model isn't mapped yet" from litellm's
-            # post-call cost calculator for arbitrary upstream model names.
-            input_cost_per_token=0,
-            output_cost_per_token=0,
-            **call_kwargs,
+        parsed = json.loads(response_text) if response_text else None
+        if isinstance(parsed, dict):
+            response_dict = parsed
+    except json.JSONDecodeError:
+        pass
+
+    if recorder is not None:
+        await recorder.record(
+            request=request_dict,
+            response=response_dict,
+            status="success" if r.status_code < 400 else "failure",
+            start_time=start,
+            end_time=time.time(),
+            error=None if r.status_code < 400 else f"upstream_status={r.status_code}",
         )
-    except (RateLimitError, APIError, BadRequestError, AuthenticationError, Timeout) as exc:
-        logger.warning(f"litellm error for model '{model_name}': {exc}")
-        return _format_error_response(exc)
-    except Exception as exc:  # pragma: no cover - last-resort safety net
-        logger.error(f"Unexpected proxy error: {exc}", exc_info=True)
-        raise HTTPException(status_code=500, detail=str(exc))
-
-    if is_stream:
-        return StreamingResponse(_sse_iter(response), media_type="text/event-stream")
 
-    body_out = response.model_dump() if hasattr(response, "model_dump") else response
-    return JSONResponse(status_code=200, content=body_out)
+    # Forward bytes verbatim — preserves any provider-specific fields untouched.
+    media_type = r.headers.get("content-type", "application/json")
+    return Response(content=response_text, status_code=r.status_code, media_type=media_type)
diff --git a/rock/sdk/model/server/integrations/traj_recorder.py b/rock/sdk/model/server/integrations/traj_recorder.py
index 6aa01a8eed..a0c7e08fc7 100644
--- a/rock/sdk/model/server/integrations/traj_recorder.py
+++ b/rock/sdk/model/server/integrations/traj_recorder.py
@@ -1,9 +1,13 @@
-"""Record chat/completions trajectories as JSONL via litellm's CustomLogger hook.
+"""Append chat/completions trajectories as JSONL.
 
-One line per call, each line is a ``StandardLoggingPayload`` dict from litellm.
-Streaming chunks are aggregated by litellm before this callback fires (see
-litellm/litellm_core_utils/litellm_logging.py around line 1930), so we don't
-need to handle the streaming/non-streaming split ourselves.
+The recorder is invoked **explicitly** from ``proxy.py`` after each forwarded
+call (success or failure). It is no longer a litellm CustomLogger — we removed
+the litellm SDK dependency in favor of httpx-based byte forwarding, and call
+this object directly so writes stay deterministic and locally testable.
+
+Schema per line: a small dict with ``request`` / ``response`` / ``status`` /
+``response_time`` / ``model`` / ``stream``. Faithful enough to drive the
+sequential replayer; not a full StandardLoggingPayload.
 """
 
 from __future__ import annotations
@@ -11,9 +15,9 @@
 import asyncio
 import json
 import os
+import time
 from pathlib import Path
-
-from litellm.integrations.custom_logger import CustomLogger
+from typing import Any
 
 from rock.logger import init_logger
 from rock.sdk.model.server.utils import (
@@ -25,53 +29,62 @@
 logger = init_logger(__name__)
 
 
-class TrajectoryRecorder(CustomLogger):
-    """litellm CustomLogger that appends each call's StandardLoggingPayload to JSONL
-    and reports OTLP RT/count metrics."""
+class TrajectoryRecorder:
+    """Appends one JSONL line per chat/completions call and reports OTLP metrics."""
 
     def __init__(self, traj_file: str | os.PathLike) -> None:
-        super().__init__()
         self.traj_file = Path(traj_file)
         self.traj_file.parent.mkdir(parents=True, exist_ok=True)
         self._lock = asyncio.Lock()
         self._monitor = _get_or_create_metrics_monitor()
 
-    async def async_log_success_event(self, kwargs, response_obj, start_time, end_time):
-        payload = kwargs.get("standard_logging_object")
-        if payload is None:
-            logger.debug("[traj-recorder] success event without standard_logging_object, skipping")
-            return
-        await self._append_jsonl(payload)
-        self._record_metrics(payload, status="success")
-
-    async def async_log_failure_event(self, kwargs, response_obj, start_time, end_time):
-        payload = kwargs.get("standard_logging_object")
-        if payload is None:
-            return
-        await self._append_jsonl(payload)
-        self._record_metrics(payload, status="failure")
-
-    async def _append_jsonl(self, payload: dict) -> None:
+    async def record(
+        self,
+        *,
+        request: dict[str, Any],
+        response: dict[str, Any] | None,
+        status: str,
+        start_time: float,
+        end_time: float,
+        error: str | None = None,
+    ) -> None:
+        """Persist one call to the JSONL file and report RT/count metrics.
+
+        ``request`` / ``response`` are stored verbatim (whatever the upstream
+        returned, including provider-specific fields like ``reasoning_content``).
+        For streaming calls, ``response`` is the aggregated final ChatCompletion
+        produced by ``ChatCompletionStreamState.get_final_completion().model_dump()``.
+        """
+        rt_seconds = end_time - start_time
+        payload = {
+            "model": request.get("model"),
+            "stream": bool(request.get("stream")),
+            "status": status,
+            "response_time": rt_seconds,
+            "start_time": start_time,
+            "end_time": end_time,
+            "request": request,
+            "response": response,
+            "error": error,
+        }
+
         line = json.dumps(payload, ensure_ascii=False, default=str) + "\n"
         async with self._lock:
             await asyncio.to_thread(self._write_line, line)
 
-    def _write_line(self, line: str) -> None:
-        with self.traj_file.open("a", encoding="utf-8") as f:
-            f.write(line)
-
-    def _record_metrics(self, payload: dict, *, status: str) -> None:
-        rt_seconds = payload.get("response_time")
-        if rt_seconds is None:
-            start = payload.get("startTime")
-            end = payload.get("endTime")
-            rt_seconds = (end - start) if (start is not None and end is not None) else 0.0
-        rt_ms = float(rt_seconds) * 1000.0
-
         attrs = {
             "type": "chat_completions",
             "status": status,
             "sandbox_id": os.getenv("ROCK_SANDBOX_ID", "unknown"),
         }
-        self._monitor.record_gauge_by_name(MODEL_SERVICE_REQUEST_RT, rt_ms, attributes=attrs)
+        self._monitor.record_gauge_by_name(MODEL_SERVICE_REQUEST_RT, rt_seconds * 1000.0, attributes=attrs)
         self._monitor.record_counter_by_name(MODEL_SERVICE_REQUEST_COUNT, 1, attributes=attrs)
+
+    def _write_line(self, line: str) -> None:
+        with self.traj_file.open("a", encoding="utf-8") as f:
+            f.write(line)
+
+
+def now() -> float:
+    """Wall-clock seconds (single shim so callers don't import time directly)."""
+    return time.time()
diff --git a/rock/sdk/model/server/main.py b/rock/sdk/model/server/main.py
index 133605b046..31b4918e55 100644
--- a/rock/sdk/model/server/main.py
+++ b/rock/sdk/model/server/main.py
@@ -53,18 +53,15 @@ async def global_exception_handler(request, exc):
 
 
 def _configure_proxy_integrations(app: FastAPI, config: ModelServiceConfig) -> None:
-    """Wire up record/replay integrations for the proxy mode.
-
-    - When ``replay_traj_path`` is set, load the trajectory into a
-      ``SequentialCursor`` and attach it to ``app.state.replay_cursor``. The
-      proxy handler serves recorded responses directly from this cursor; we
-      do NOT register anything with litellm (replay path bypasses litellm
-      entirely so cursor-exhausted errors aren't swallowed by retry logic).
-    - Otherwise (record/forward mode), if ``traj_enabled`` is True, register
-      ``TrajectoryRecorder`` as a ``litellm.callbacks`` entry so every
-      chat/completions call appends a JSONL line.
-
-    Replay and record are mutually exclusive: in replay mode we don't record,
+    """Wire up record/replay integrations and attach them to ``app.state``.
+
+    - Replay mode (``replay_traj_path`` set): load the trajectory into a
+      ``SequentialCursor`` and stash it as ``app.state.replay_cursor``.
+    - Forward/record mode (default): if ``traj_enabled`` is True, attach a
+      ``TrajectoryRecorder`` instance as ``app.state.recorder``. The proxy
+      handler invokes it explicitly after each forwarded call.
+
+    Replay and record are mutually exclusive — in replay mode we don't record,
     since replayed responses round-tripping back into the source file would
     inflate metrics and corrupt the trajectory.
     """
@@ -76,14 +73,11 @@ def _configure_proxy_integrations(app: FastAPI, config: ModelServiceConfig) -> N
         return
 
     if config.traj_enabled:
-        import litellm
-
         from rock.sdk.model.server.integrations.traj_recorder import TrajectoryRecorder
 
         traj_path = config.traj_file or TRAJ_FILE
-        recorder = TrajectoryRecorder(traj_file=traj_path)
-        litellm.callbacks.append(recorder)
-        logger.info(f"litellm trajectory recorder registered, traj_file={traj_path}")
+        app.state.recorder = TrajectoryRecorder(traj_file=traj_path)
+        logger.info(f"trajectory recorder attached, traj_file={traj_path}")
 
 
 def main(
diff --git a/tests/unit/sdk/model/test_proxy.py b/tests/unit/sdk/model/test_proxy.py
index 5a4d7e8230..88c61edcb3 100644
--- a/tests/unit/sdk/model/test_proxy.py
+++ b/tests/unit/sdk/model/test_proxy.py
@@ -1,7 +1,15 @@
+"""Tests for the chat/completions proxy.
+
+Forward path is exercised by pointing the proxy at an httpx ``MockTransport``
+(no real network). Replay path is exercised end-to-end via the FastAPI test
+client. Config / CLI / metrics-singleton tests round out the file.
+"""
+
+import argparse
 import json
-from types import SimpleNamespace
-from unittest.mock import AsyncMock, MagicMock, patch
+from unittest.mock import MagicMock, patch
 
+import httpx
 import pytest
 import yaml
 from fastapi import FastAPI
@@ -9,6 +17,7 @@
 
 from rock.sdk.model.server.api.proxy import proxy_router
 from rock.sdk.model.server.config import ModelServiceConfig
+from rock.sdk.model.server.integrations.traj_replayer import SequentialCursor
 from rock.sdk.model.server.main import create_config_from_args, lifespan
 from rock.sdk.model.server.utils import (
     MODEL_SERVICE_REQUEST_COUNT,
@@ -17,171 +26,287 @@
     record_traj,
 )
 
-# Initialize a temporary FastAPI application for testing the router
-test_app = FastAPI()
-test_app.include_router(proxy_router)
 
-mock_config = ModelServiceConfig()
-test_app.state.model_service_config = mock_config
+def _build_app(config: ModelServiceConfig, *, replay_cursor=None) -> FastAPI:
+    """Build a FastAPI app with the proxy router and the given config attached."""
+    app = FastAPI()
+    app.state.model_service_config = config
+    if replay_cursor is not None:
+        app.state.replay_cursor = replay_cursor
+    app.include_router(proxy_router)
+    return app
+
+
+def _patch_httpx_with_handler(handler):
+    """Patch ``proxy.httpx.AsyncClient`` so each ``async with httpx.AsyncClient(...)``
+    returns a real client wrapping ``MockTransport(handler)``."""
+    real_client_cls = httpx.AsyncClient  # capture before patching kicks in
+    transport = httpx.MockTransport(handler)
 
+    def factory(*args, **kwargs):
+        kwargs.pop("timeout", None)  # transport supplies the response, no timeout needed
+        return real_client_cls(transport=transport, **kwargs)
 
-# Patch path for the litellm.acompletion symbol as imported inside proxy.py.
-ACOMPLETION_PATCH = "rock.sdk.model.server.api.proxy.litellm.acompletion"
+    return patch("rock.sdk.model.server.api.proxy.httpx.AsyncClient", side_effect=factory)
 
 
-def _fake_model_response(*, id="chat-123", choices=None) -> SimpleNamespace:
-    """Build a litellm-shaped object that exposes .model_dump() like a Pydantic model."""
-    payload = {
-        "id": id,
+def _success_response_json(*, model: str = "gpt-3.5-turbo", content: str = "hi") -> dict:
+    return {
+        "id": "chatcmpl-1",
         "object": "chat.completion",
-        "model": "gpt-3.5-turbo",
-        "choices": choices
-        or [
-            {"index": 0, "message": {"role": "assistant", "content": "hi"}, "finish_reason": "stop"},
+        "created": 1234,
+        "model": model,
+        "choices": [
+            {
+                "index": 0,
+                "message": {"role": "assistant", "content": content},
+                "finish_reason": "stop",
+            }
         ],
         "usage": {"prompt_tokens": 1, "completion_tokens": 1, "total_tokens": 2},
     }
-    return SimpleNamespace(model_dump=lambda: payload)
+
+
+# ---------- Forward path: routing ----------
 
 
 @pytest.mark.asyncio
-async def test_chat_completions_routing_success():
-    """Routing: model name maps to its proxy_rules entry, passed to litellm as api_base."""
-    with patch(ACOMPLETION_PATCH, new_callable=AsyncMock) as mock_acompletion:
-        mock_acompletion.return_value = _fake_model_response()
+async def test_forward_routes_by_model_name_to_proxy_rules():
+    captured = {}
 
-        transport = ASGITransport(app=test_app)
+    def handler(request: httpx.Request) -> httpx.Response:
+        captured["url"] = str(request.url)
+        captured["body"] = json.loads(request.content)
+        return httpx.Response(200, json=_success_response_json())
+
+    app = _build_app(ModelServiceConfig())
+    with _patch_httpx_with_handler(handler):
+        transport = ASGITransport(app=app)
         async with AsyncClient(transport=transport, base_url="http://test") as ac:
-            payload = {"model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "hello"}]}
-            response = await ac.post("/v1/chat/completions", json=payload)
+            r = await ac.post(
+                "/v1/chat/completions",
+                json={"model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "hi"}]},
+            )
 
-        assert response.status_code == 200
-        assert mock_acompletion.called
-        call_kwargs = mock_acompletion.call_args.kwargs
-        assert call_kwargs["api_base"] == "https://api.openai.com/v1"
-        assert call_kwargs["model"] == "custom_openai/gpt-3.5-turbo"
-        assert call_kwargs["messages"] == [{"role": "user", "content": "hello"}]
+    assert r.status_code == 200
+    assert captured["url"] == "https://api.openai.com/v1/chat/completions"
+    assert captured["body"]["model"] == "gpt-3.5-turbo"
 
 
 @pytest.mark.asyncio
-async def test_chat_completions_fallback_to_default_when_not_found():
-    """Unrecognized model name → falls back to the 'default' base URL."""
-    with patch(ACOMPLETION_PATCH, new_callable=AsyncMock) as mock_acompletion:
-        mock_acompletion.return_value = _fake_model_response(id="chat-fallback")
+async def test_forward_falls_back_to_default_for_unknown_model():
+    captured = {}
+
+    def handler(request: httpx.Request) -> httpx.Response:
+        captured["url"] = str(request.url)
+        return httpx.Response(200, json=_success_response_json(model="some-random"))
 
-        config = test_app.state.model_service_config
-        default_base_url = config.proxy_rules["default"].rstrip("/")
+    config = ModelServiceConfig()
+    expected_default = config.proxy_rules["default"].rstrip("/") + "/chat/completions"
+    app = _build_app(config)
 
-        transport = ASGITransport(app=test_app)
+    with _patch_httpx_with_handler(handler):
+        transport = ASGITransport(app=app)
         async with AsyncClient(transport=transport, base_url="http://test") as ac:
-            payload = {
-                "model": "some-random-unsupported-model",
-                "messages": [{"role": "user", "content": "hello"}],
-            }
-            response = await ac.post("/v1/chat/completions", json=payload)
+            r = await ac.post(
+                "/v1/chat/completions",
+                json={"model": "some-random", "messages": [{"role": "user", "content": "hi"}]},
+            )
 
-        assert response.status_code == 200
-        call_kwargs = mock_acompletion.call_args.kwargs
-        assert call_kwargs["api_base"] == default_base_url
+    assert r.status_code == 200
+    assert captured["url"] == expected_default
 
 
 @pytest.mark.asyncio
-async def test_chat_completions_routing_absolute_fail():
-    """No matching rule and no 'default' → 400."""
-    empty_config = ModelServiceConfig()
-    empty_config.proxy_rules = {}
+async def test_forward_400_when_no_rule_and_no_default():
+    config = ModelServiceConfig()
+    config.proxy_rules = {}
+    app = _build_app(config)
 
-    with patch.object(test_app.state, "model_service_config", empty_config):
-        transport = ASGITransport(app=test_app)
-        async with AsyncClient(transport=transport, base_url="http://test") as ac:
-            payload = {"model": "any-model", "messages": [{"role": "user", "content": "hello"}]}
-            response = await ac.post("/v1/chat/completions", json=payload)
+    transport = ASGITransport(app=app)
+    async with AsyncClient(transport=transport, base_url="http://test") as ac:
+        r = await ac.post(
+            "/v1/chat/completions",
+            json={"model": "any", "messages": [{"role": "user", "content": "hi"}]},
+        )
 
-    assert response.status_code == 400
-    detail = response.json()["detail"]
-    assert "not configured" in detail
+    assert r.status_code == 400
+    assert "not configured" in r.json()["detail"]
 
 
 @pytest.mark.asyncio
-async def test_proxy_base_url_overrides_proxy_rules():
-    """When proxy_base_url is set, all requests go to that URL, ignoring proxy_rules."""
+async def test_forward_proxy_base_url_overrides_proxy_rules():
+    captured = {}
+
+    def handler(request: httpx.Request) -> httpx.Response:
+        captured["url"] = str(request.url)
+        return httpx.Response(200, json=_success_response_json())
+
     config = ModelServiceConfig()
     config.proxy_base_url = "https://custom-endpoint.example.com/v1"
+    app = _build_app(config)
 
-    local_app = FastAPI()
-    local_app.state.model_service_config = config
-    local_app.include_router(proxy_router)
+    with _patch_httpx_with_handler(handler):
+        transport = ASGITransport(app=app)
+        async with AsyncClient(transport=transport, base_url="http://test") as ac:
+            await ac.post(
+                "/v1/chat/completions",
+                json={"model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "hi"}]},
+            )
 
-    with patch(ACOMPLETION_PATCH, new_callable=AsyncMock) as mock_acompletion:
-        mock_acompletion.return_value = _fake_model_response()
+    assert captured["url"] == "https://custom-endpoint.example.com/v1/chat/completions"
 
-        transport = ASGITransport(app=local_app)
-        async with AsyncClient(transport=transport, base_url="http://test") as ac:
-            payload = {"model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "hello"}]}
-            response = await ac.post("/v1/chat/completions", json=payload)
 
-        assert response.status_code == 200
-        call_kwargs = mock_acompletion.call_args.kwargs
-        assert call_kwargs["api_base"] == "https://custom-endpoint.example.com/v1"
+# ---------- Forward path: byte passthrough ----------
 
 
 @pytest.mark.asyncio
-async def test_chat_completions_passes_num_retries_and_timeout():
-    """num_retries and request_timeout from config flow through to litellm.acompletion."""
-    config = ModelServiceConfig()
-    config.num_retries = 3
-    config.request_timeout = 45
+async def test_forward_response_body_is_byte_for_byte_passthrough():
+    """Upstream's exact JSON bytes (incl. provider-specific fields) reach the client."""
+    upstream_payload = {
+        "id": "x",
+        "object": "chat.completion",
+        "model": "glm-5",
+        "choices": [
+            {
+                "index": 0,
+                "message": {"role": "assistant", "content": "hi", "reasoning_content": "...think..."},
+                "finish_reason": "stop",
+            }
+        ],
+        "provider_specific_fields": {"vendor_field": "vendor_value"},
+    }
+
+    def handler(request: httpx.Request) -> httpx.Response:
+        return httpx.Response(200, json=upstream_payload)
+
+    app = _build_app(ModelServiceConfig())
+    with _patch_httpx_with_handler(handler):
+        transport = ASGITransport(app=app)
+        async with AsyncClient(transport=transport, base_url="http://test") as ac:
+            r = await ac.post(
+                "/v1/chat/completions",
+                json={"model": "glm-5", "messages": [{"role": "user", "content": "hi"}]},
+            )
 
-    local_app = FastAPI()
-    local_app.state.model_service_config = config
-    local_app.include_router(proxy_router)
+    body = r.json()
+    assert body["choices"][0]["message"]["reasoning_content"] == "...think..."
+    assert body["provider_specific_fields"] == {"vendor_field": "vendor_value"}
 
-    with patch(ACOMPLETION_PATCH, new_callable=AsyncMock) as mock_acompletion:
-        mock_acompletion.return_value = _fake_model_response()
 
-        transport = ASGITransport(app=local_app)
+@pytest.mark.asyncio
+async def test_forward_propagates_upstream_status_and_body_on_4xx():
+    """Upstream 4xx is forwarded verbatim — proxy doesn't re-shape error JSON."""
+    err_body = {"error": {"message": "context length exceeded", "type": "BadRequestError"}}
+
+    def handler(request: httpx.Request) -> httpx.Response:
+        return httpx.Response(400, json=err_body)
+
+    app = _build_app(ModelServiceConfig())
+    with _patch_httpx_with_handler(handler):
+        transport = ASGITransport(app=app)
         async with AsyncClient(transport=transport, base_url="http://test") as ac:
-            payload = {"model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "hi"}]}
-            await ac.post("/v1/chat/completions", json=payload)
+            r = await ac.post(
+                "/v1/chat/completions",
+                json={"model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "hi"}]},
+            )
 
-        call_kwargs = mock_acompletion.call_args.kwargs
-        assert call_kwargs["num_retries"] == 3
-        assert call_kwargs["timeout"] == 45
+    assert r.status_code == 400
+    assert r.json() == err_body
 
 
 @pytest.mark.asyncio
-async def test_chat_completions_litellm_error_returns_proxy_schema():
-    """A litellm exception is converted to {error:{message,type,code}} JSON
-    so agent-side keyword detection (e.g. 'context length exceeded') keeps working."""
-    from litellm.exceptions import BadRequestError
-
-    err = BadRequestError(
-        message="context length exceeded for this model",
-        model="gpt-3.5-turbo",
-        llm_provider="openai",
-    )
+async def test_forward_authorization_header_passes_through():
+    captured = {}
+
+    def handler(request: httpx.Request) -> httpx.Response:
+        captured["headers"] = dict(request.headers)
+        return httpx.Response(200, json=_success_response_json())
+
+    app = _build_app(ModelServiceConfig())
+    with _patch_httpx_with_handler(handler):
+        transport = ASGITransport(app=app)
+        async with AsyncClient(transport=transport, base_url="http://test") as ac:
+            await ac.post(
+                "/v1/chat/completions",
+                json={"model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "hi"}]},
+                headers={"Authorization": "Bearer sk-abc", "X-Trace": "t1"},
+            )
+
+    # Authorization and custom X-* headers are forwarded verbatim. We don't assert
+    # on framing headers (connection / content-length / accept-encoding) because
+    # httpx rebuilds them itself for the outgoing request.
+    auth_value = captured["headers"].get("Authorization") or captured["headers"].get("authorization")
+    assert auth_value == "Bearer sk-abc"
+    fwd_lower = {k.lower() for k in captured["headers"]}
+    assert "x-trace" in fwd_lower
+
 
-    with patch(ACOMPLETION_PATCH, new_callable=AsyncMock) as mock_acompletion:
-        mock_acompletion.side_effect = err
+@pytest.mark.asyncio
+async def test_forward_502_on_upstream_connection_failure():
+    def handler(request: httpx.Request) -> httpx.Response:
+        raise httpx.ConnectError("upstream is down")
 
-        transport = ASGITransport(app=test_app)
+    app = _build_app(ModelServiceConfig())
+    with _patch_httpx_with_handler(handler):
+        transport = ASGITransport(app=app)
         async with AsyncClient(transport=transport, base_url="http://test") as ac:
-            payload = {"model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "hello"}]}
-            response = await ac.post("/v1/chat/completions", json=payload)
+            r = await ac.post(
+                "/v1/chat/completions",
+                json={"model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "hi"}]},
+            )
 
-        body = response.json()
-        assert "error" in body
-        assert "context length exceeded" in body["error"]["message"]
-        assert body["error"]["type"] == "BadRequestError"
-        assert body["error"]["code"] == response.status_code
+    assert r.status_code == 502
+
+
+# ---------- Forward path: recording ----------
 
 
 @pytest.mark.asyncio
-async def test_replay_mode_returns_recorded_response_without_calling_litellm(tmp_path):
-    """In replay mode the proxy serves the next record directly from app.state.replay_cursor;
-    litellm.acompletion must never be invoked."""
-    from rock.sdk.model.server.integrations.traj_replayer import SequentialCursor
+async def test_forward_invokes_recorder_on_success(tmp_path):
+    """When app.state.recorder is set, success calls write a JSONL line with the
+    request and the upstream response verbatim."""
+    from rock.sdk.model.server.integrations.traj_recorder import TrajectoryRecorder
+
+    upstream_payload = _success_response_json(content="recorded reply")
+
+    def handler(request: httpx.Request) -> httpx.Response:
+        return httpx.Response(200, json=upstream_payload)
+
+    config = ModelServiceConfig()
+    app = _build_app(config)
+    traj_file = tmp_path / "traj.jsonl"
+    app.state.recorder = TrajectoryRecorder(traj_file=traj_file)
+
+    with (
+        _patch_httpx_with_handler(handler),
+        patch(
+            "rock.sdk.model.server.integrations.traj_recorder._get_or_create_metrics_monitor", return_value=MagicMock()
+        ),
+    ):
+        # Re-create the recorder so it picks up the patched monitor.
+        app.state.recorder = TrajectoryRecorder(traj_file=traj_file)
+        transport = ASGITransport(app=app)
+        async with AsyncClient(transport=transport, base_url="http://test") as ac:
+            await ac.post(
+                "/v1/chat/completions",
+                json={"model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "hi"}]},
+            )
+
+    line = traj_file.read_text(encoding="utf-8").strip()
+    record = json.loads(line)
+    assert record["status"] == "success"
+    assert record["model"] == "gpt-3.5-turbo"
+    assert record["stream"] is False
+    assert record["request"]["messages"][0]["content"] == "hi"
+    assert record["response"] == upstream_payload
+
 
+# ---------- Replay path ----------
+
+
+@pytest.mark.asyncio
+async def test_replay_returns_recorded_response_no_upstream_call(tmp_path):
     record = {
         "model": "gpt-3.5-turbo",
         "response": {
@@ -195,7 +320,6 @@ async def test_replay_mode_returns_recorded_response_without_calling_litellm(tmp
                     "finish_reason": "stop",
                 }
             ],
-            "usage": {"prompt_tokens": 1, "completion_tokens": 2, "total_tokens": 3},
         },
     }
     traj = tmp_path / "t.jsonl"
@@ -203,29 +327,21 @@ async def test_replay_mode_returns_recorded_response_without_calling_litellm(tmp
 
     config = ModelServiceConfig()
     config.replay_traj_path = str(traj)
+    app = _build_app(config, replay_cursor=SequentialCursor.load(traj))
 
-    local_app = FastAPI()
-    local_app.state.model_service_config = config
-    local_app.state.replay_cursor = SequentialCursor.load(traj)
-    local_app.include_router(proxy_router)
-
-    with patch(ACOMPLETION_PATCH, new_callable=AsyncMock) as mock_acompletion:
-        transport = ASGITransport(app=local_app)
-        async with AsyncClient(transport=transport, base_url="http://test") as ac:
-            payload = {"model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "hi"}]}
-            response = await ac.post("/v1/chat/completions", json=payload)
+    transport = ASGITransport(app=app)
+    async with AsyncClient(transport=transport, base_url="http://test") as ac:
+        r = await ac.post(
+            "/v1/chat/completions",
+            json={"model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "hi"}]},
+        )
 
-    assert response.status_code == 200
-    body = response.json()
-    assert body["choices"][0]["message"]["content"] == "recorded reply"
-    mock_acompletion.assert_not_called()
+    assert r.status_code == 200
+    assert r.json()["choices"][0]["message"]["content"] == "recorded reply"
 
 
 @pytest.mark.asyncio
-async def test_replay_mode_streaming_emits_recorded_response_as_sse(tmp_path):
-    """Replay + stream=True emits one SSE chunk (content moved into delta) plus [DONE]."""
-    from rock.sdk.model.server.integrations.traj_replayer import SequentialCursor
-
+async def test_replay_streaming_emits_recorded_response_as_sse(tmp_path):
     record = {
         "model": "gpt-3.5-turbo",
         "response": {
@@ -246,33 +362,24 @@ async def test_replay_mode_streaming_emits_recorded_response_as_sse(tmp_path):
 
     config = ModelServiceConfig()
     config.replay_traj_path = str(traj)
+    app = _build_app(config, replay_cursor=SequentialCursor.load(traj))
 
-    local_app = FastAPI()
-    local_app.state.model_service_config = config
-    local_app.state.replay_cursor = SequentialCursor.load(traj)
-    local_app.include_router(proxy_router)
-
-    transport = ASGITransport(app=local_app)
+    transport = ASGITransport(app=app)
     async with AsyncClient(transport=transport, base_url="http://test") as ac:
-        response = await ac.post(
+        r = await ac.post(
             "/v1/chat/completions",
             json={"model": "gpt-3.5-turbo", "stream": True, "messages": [{"role": "user", "content": "hi"}]},
         )
 
-    assert response.status_code == 200
-    body = response.text
+    body = r.text
     assert "data: [DONE]" in body
-    # The SSE chunk shape is chat.completion.chunk with message → delta, finish_reason preserved
     assert '"object": "chat.completion.chunk"' in body
     assert '"delta": {"role": "assistant", "content": "streamed reply"}' in body
     assert '"finish_reason": "tool_calls"' in body
 
 
 @pytest.mark.asyncio
-async def test_replay_mode_returns_404_when_cursor_exhausted(tmp_path):
-    """Cursor used up → 404 with a clear message; no litellm retries involved."""
-    from rock.sdk.model.server.integrations.traj_replayer import SequentialCursor
-
+async def test_replay_returns_404_when_cursor_exhausted(tmp_path):
     record = {
         "model": "gpt-3.5-turbo",
         "response": {
@@ -285,13 +392,9 @@ async def test_replay_mode_returns_404_when_cursor_exhausted(tmp_path):
 
     config = ModelServiceConfig()
     config.replay_traj_path = str(traj)
+    app = _build_app(config, replay_cursor=SequentialCursor.load(traj))
 
-    local_app = FastAPI()
-    local_app.state.model_service_config = config
-    local_app.state.replay_cursor = SequentialCursor.load(traj)
-    local_app.include_router(proxy_router)
-
-    transport = ASGITransport(app=local_app)
+    transport = ASGITransport(app=app)
     async with AsyncClient(transport=transport, base_url="http://test") as ac:
         await ac.post(
             "/v1/chat/completions",
@@ -306,42 +409,11 @@ async def test_replay_mode_returns_404_when_cursor_exhausted(tmp_path):
     assert "exhausted" in second.json()["detail"]
 
 
-@pytest.mark.asyncio
-async def test_chat_completions_extracts_bearer_token_and_strips_framing_headers():
-    """Bearer token goes to api_key kwarg; host / content-length / transfer-encoding /
-    Authorization are not forwarded as extra_headers (litellm regenerates Authorization
-    from api_key, so passing it both ways would conflict). Custom X-* headers pass through."""
-    captured = {}
-
-    async def capture(*args, **kwargs):
-        captured.update(kwargs)
-        return _fake_model_response()
-
-    with patch(ACOMPLETION_PATCH, new=capture):
-        transport = ASGITransport(app=test_app)
-        async with AsyncClient(transport=transport, base_url="http://test") as ac:
-            payload = {"model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "hi"}]}
-            await ac.post(
-                "/v1/chat/completions",
-                json=payload,
-                headers={"Authorization": "Bearer abc", "X-Trace": "t1"},
-            )
-
-    assert captured["api_key"] == "abc"
-
-    forwarded = captured["extra_headers"]
-    forwarded_lower = {k.lower() for k in forwarded}
-    assert "x-trace" in forwarded_lower
-    assert "authorization" not in forwarded_lower
-    assert "host" not in forwarded_lower
-    assert "content-length" not in forwarded_lower
-    assert "content-type" not in forwarded_lower
-    assert "transfer-encoding" not in forwarded_lower
+# ---------- Lifespan + Config ----------
 
 
 @pytest.mark.asyncio
 async def test_lifespan_initialization_with_config(tmp_path):
-    """Application initializes correctly when a valid config file is provided."""
     conf_file = tmp_path / "proxy.yml"
     conf_file.write_text(yaml.dump({"proxy_rules": {"my-model": "http://custom-url"}, "request_timeout": 50}))
 
@@ -349,73 +421,27 @@ async def test_lifespan_initialization_with_config(tmp_path):
     app = FastAPI(lifespan=lambda app: lifespan(app, config))
 
     async with lifespan(app, config):
-        app_config = app.state.model_service_config
-        assert app_config.proxy_rules["my-model"] == "http://custom-url"
-        assert app_config.request_timeout == 50
-        assert "gpt-3.5-turbo" not in app_config.proxy_rules
-
-
-@pytest.mark.asyncio
-async def test_lifespan_initialization_no_config():
-    """Defaults are loaded when no config file is provided."""
-    config = ModelServiceConfig()
-    app = FastAPI(lifespan=lambda app: lifespan(app, config))
-
-    async with lifespan(app, config):
-        app_config = app.state.model_service_config
-        assert "gpt-3.5-turbo" in app_config.proxy_rules
-        assert app_config.request_timeout == 120
+        assert app.state.model_service_config.proxy_rules["my-model"] == "http://custom-url"
+        assert app.state.model_service_config.request_timeout == 50
 
 
 @pytest.mark.asyncio
 async def test_lifespan_invalid_config_path():
-    """Non-existent config path → FileNotFoundError."""
     with pytest.raises(FileNotFoundError):
         ModelServiceConfig.from_file("/tmp/non_existent_file.yml")
 
 
-@pytest.mark.asyncio
-async def test_config_loads_host_and_port_from_file(tmp_path):
-    """ModelServiceConfig loads host and port from config file."""
-    conf_file = tmp_path / "proxy.yml"
-    conf_file.write_text(
-        yaml.dump({"host": "127.0.0.1", "port": 9000, "proxy_rules": {"my-model": "http://my-backend"}})
-    )
-
-    config = ModelServiceConfig.from_file(str(conf_file))
-
-    assert config.host == "127.0.0.1"
-    assert config.port == 9000
-    assert config.proxy_rules["my-model"] == "http://my-backend"
-
-
 def test_config_default_host_and_port():
     config = ModelServiceConfig()
     assert config.host == "0.0.0.0"
     assert config.port == 8080
 
 
-@pytest.mark.asyncio
-async def test_config_loads_retryable_status_codes_from_file(tmp_path):
-    conf_file = tmp_path / "proxy.yml"
-    conf_file.write_text(yaml.dump({"retryable_status_codes": [429, 500, 502, 503]}))
-
-    config = ModelServiceConfig.from_file(str(conf_file))
-    assert config.retryable_status_codes == [429, 500, 502, 503]
-
-
-def test_config_default_retryable_status_codes():
-    config = ModelServiceConfig()
-    assert config.retryable_status_codes == [429, 500]
-
-
 def test_config_default_traj_and_replay():
-    """New traj/replay defaults: recording on (append=True), replay off."""
     config = ModelServiceConfig()
     assert config.traj_enabled is True
     assert config.traj_file is None
     assert config.replay_traj_path is None
-    assert config.num_retries == 6
 
 
 @pytest.mark.asyncio
@@ -427,22 +453,16 @@ async def test_config_loads_traj_and_replay_from_file(tmp_path):
                 "traj_enabled": False,
                 "traj_file": "/tmp/my-traj.jsonl",
                 "replay_traj_path": "/tmp/in.jsonl",
-                "num_retries": 2,
             }
         )
     )
-
     config = ModelServiceConfig.from_file(str(conf_file))
     assert config.traj_enabled is False
     assert config.traj_file == "/tmp/my-traj.jsonl"
     assert config.replay_traj_path == "/tmp/in.jsonl"
-    assert config.num_retries == 2
 
 
 def test_cli_args_override_config_file(tmp_path):
-    """CLI arguments override config file settings."""
-    import argparse
-
     conf_file = tmp_path / "proxy.yml"
     conf_file.write_text(
         yaml.dump(
@@ -450,37 +470,28 @@ def test_cli_args_override_config_file(tmp_path):
                 "host": "192.168.1.1",
                 "port": 8080,
                 "proxy_base_url": "https://config-url.example.com/v1",
-                "retryable_status_codes": [429, 500],
                 "request_timeout": 60,
             }
         )
     )
-
     args = argparse.Namespace(
         config_file=str(conf_file),
         host="0.0.0.0",
         port=9000,
         proxy_base_url="https://cli-url.example.com/v1",
-        retryable_status_codes="502,503",
+        retryable_status_codes=None,
         request_timeout=30,
-        num_retries=4,
+        num_retries=None,
         traj_file=None,
     )
-
     config = create_config_from_args(args)
-
     assert config.host == "0.0.0.0"
     assert config.port == 9000
     assert config.proxy_base_url == "https://cli-url.example.com/v1"
-    assert config.retryable_status_codes == [502, 503]
     assert config.request_timeout == 30
-    assert config.num_retries == 4
 
 
 def test_cli_traj_file_enables_replay():
-    """--traj-file sets replay_enabled, replay_traj_path, and disables recording."""
-    import argparse
-
     args = argparse.Namespace(
         config_file=None,
         host=None,
@@ -491,100 +502,36 @@ def test_cli_traj_file_enables_replay():
         num_retries=None,
         traj_file="/tmp/in.jsonl",
     )
-
     config = create_config_from_args(args)
     assert config.replay_traj_path == "/tmp/in.jsonl"
     assert config.traj_enabled is False
 
 
-@pytest.mark.asyncio
-async def test_config_file_overrides_defaults(tmp_path):
-    conf_file = tmp_path / "proxy.yml"
-    conf_file.write_text(
-        yaml.dump(
-            {
-                "host": "10.0.0.1",
-                "port": 8888,
-                "request_timeout": 300,
-                "proxy_rules": {"test-model": "http://test-backend"},
-            }
-        )
-    )
-
-    config = ModelServiceConfig.from_file(str(conf_file))
-
-    assert config.host == "10.0.0.1"
-    assert config.port == 8888
-    assert config.request_timeout == 300
-    assert config.proxy_rules["test-model"] == "http://test-backend"
-    assert config.proxy_base_url is None
+# ---------- Metrics singleton + legacy record_traj (still used by local mode) ----------
 
 
 def test_metrics_monitor_is_singleton():
-    """_get_or_create_metrics_monitor returns the same instance on repeated calls."""
     import rock.sdk.model.server.utils as utils_module
 
     with patch("rock.sdk.model.server.utils.MetricsMonitor") as mock_cls:
-        mock_monitor = MagicMock()
-        mock_cls.create.return_value = mock_monitor
+        mock_cls.create.return_value = MagicMock()
         utils_module._metrics_monitor = None
-
         first = _get_or_create_metrics_monitor()
         second = _get_or_create_metrics_monitor()
-
         assert first is second
-        assert mock_cls.create.call_count == 1
-        utils_module._metrics_monitor = None
-
-
-def test_metrics_monitor_uses_env_endpoint():
-    """ROCK_METRICS_ENDPOINT env var is passed to MetricsMonitor.create()."""
-    import rock.sdk.model.server.utils as utils_module
-
-    custom_endpoint = "http://my-otel-collector:4318/v1/metrics"
-
-    with (
-        patch("rock.sdk.model.server.utils.MetricsMonitor") as mock_cls,
-        patch.dict("os.environ", {"ROCK_METRICS_ENDPOINT": custom_endpoint}),
-    ):
-        mock_monitor = MagicMock()
-        mock_cls.create.return_value = mock_monitor
-        utils_module._metrics_monitor = None
-        _get_or_create_metrics_monitor()
-        mock_cls.create.assert_called_once_with(metrics_endpoint=custom_endpoint)
-        utils_module._metrics_monitor = None
-
-
-def test_metrics_monitor_registers_gauge_and_counter():
-    """_get_or_create_metrics_monitor registers both metrics on first creation."""
-    import rock.sdk.model.server.utils as utils_module
-
-    with patch("rock.sdk.model.server.utils.MetricsMonitor") as mock_cls:
-        mock_monitor = MagicMock()
-        mock_cls.create.return_value = mock_monitor
-        utils_module._metrics_monitor = None
-        _get_or_create_metrics_monitor()
-
-        mock_monitor._register_gauge.assert_called_once_with(
-            MODEL_SERVICE_REQUEST_RT, "total execution time for request", "ms"
-        )
-        mock_monitor._register_counter.assert_called_once_with(
-            MODEL_SERVICE_REQUEST_COUNT, "total request count", "count"
-        )
         utils_module._metrics_monitor = None
 
 
 @pytest.mark.asyncio
-async def test_record_traj_reports_rt_and_count():
+async def test_record_traj_decorator_reports_rt_and_count():
     """Legacy record_traj decorator (still used by local mode) reports RT/count."""
     import rock.sdk.model.server.utils as utils_module
 
-    mock_monitor = MagicMock()
-
     with (
         patch("rock.sdk.model.server.utils.MetricsMonitor") as mock_cls,
-        patch.dict("os.environ", {"ROCK_SANDBOX_ID": "sandbox-test-001"}),
+        patch.dict("os.environ", {"ROCK_SANDBOX_ID": "sandbox-test"}),
     ):
+        mock_monitor = MagicMock()
         mock_cls.create.return_value = mock_monitor
         utils_module._metrics_monitor = None
 
@@ -594,42 +541,11 @@ async def fake_handler(body: dict):
 
         await fake_handler({"model": "gpt-4", "messages": []})
 
-        mock_monitor.record_gauge_by_name.assert_called_once()
         gauge_call = mock_monitor.record_gauge_by_name.call_args
         assert gauge_call[0][0] == MODEL_SERVICE_REQUEST_RT
-        assert gauge_call[1]["attributes"]["type"] == "chat_completions"
-        assert gauge_call[1]["attributes"]["sandbox_id"] == "sandbox-test-001"
+        assert gauge_call[1]["attributes"]["sandbox_id"] == "sandbox-test"
 
-        mock_monitor.record_counter_by_name.assert_called_once()
         counter_call = mock_monitor.record_counter_by_name.call_args
         assert counter_call[0][0] == MODEL_SERVICE_REQUEST_COUNT
-        assert counter_call[0][1] == 1
-        assert counter_call[1]["attributes"]["sandbox_id"] == "sandbox-test-001"
-
-        utils_module._metrics_monitor = None
-
-
-@pytest.mark.asyncio
-async def test_record_traj_sandbox_id_defaults_to_unknown():
-    """sandbox_id defaults to 'unknown' when ROCK_SANDBOX_ID is not set."""
-    import rock.sdk.model.server.utils as utils_module
-
-    mock_monitor = MagicMock()
-
-    with patch("rock.sdk.model.server.utils.MetricsMonitor") as mock_cls, patch.dict("os.environ", {}, clear=False):
-        os_env = __import__("os").environ
-        os_env.pop("ROCK_SANDBOX_ID", None)
-
-        mock_cls.create.return_value = mock_monitor
-        utils_module._metrics_monitor = None
-
-        @record_traj
-        async def fake_handler(body: dict):
-            return {"id": "resp-2", "choices": []}
-
-        await fake_handler({"model": "gpt-4", "messages": []})
-
-        gauge_call = mock_monitor.record_gauge_by_name.call_args
-        assert gauge_call[1]["attributes"]["sandbox_id"] == "unknown"
 
         utils_module._metrics_monitor = None
diff --git a/tests/unit/sdk/model/test_traj_recorder.py b/tests/unit/sdk/model/test_traj_recorder.py
index c9b1c20197..6eb3b49571 100644
--- a/tests/unit/sdk/model/test_traj_recorder.py
+++ b/tests/unit/sdk/model/test_traj_recorder.py
@@ -1,4 +1,4 @@
-"""Tests for TrajectoryRecorder (litellm CustomLogger that writes JSONL + emits OTLP metrics)."""
+"""Tests for TrajectoryRecorder (explicit-call API, no longer a litellm CustomLogger)."""
 
 import json
 from unittest.mock import MagicMock, patch
@@ -8,42 +8,6 @@
 from rock.sdk.model.server.integrations.traj_recorder import TrajectoryRecorder
 
 
-def _sample_payload(**overrides):
-    payload = {
-        "id": "chatcmpl-abc",
-        "trace_id": "trace-1",
-        "call_type": "acompletion",
-        "stream": False,
-        "status": "success",
-        "model": "gpt-3.5-turbo",
-        "model_id": None,
-        "model_group": None,
-        "api_base": "https://api.openai.com/v1",
-        "messages": [{"role": "user", "content": "hi"}],
-        "response": {
-            "id": "chatcmpl-abc",
-            "choices": [
-                {
-                    "index": 0,
-                    "message": {"role": "assistant", "content": "hello back"},
-                    "finish_reason": "stop",
-                }
-            ],
-        },
-        "model_parameters": {"temperature": 0.7},
-        "startTime": 100.0,
-        "endTime": 100.5,
-        "completionStartTime": 100.5,
-        "response_time": 0.5,
-        "total_tokens": 12,
-        "prompt_tokens": 4,
-        "completion_tokens": 8,
-        "metadata": {},
-    }
-    payload.update(overrides)
-    return payload
-
-
 @pytest.fixture
 def mock_monitor():
     monitor = MagicMock()
@@ -54,117 +18,124 @@ def mock_monitor():
         yield monitor
 
 
+def _make_recorder(traj_file) -> TrajectoryRecorder:
+    return TrajectoryRecorder(traj_file=traj_file)
+
+
 @pytest.mark.asyncio
 async def test_recorder_appends_each_call_as_jsonl_line(tmp_path, mock_monitor):
-    """Each successful call adds one JSONL line (always append-only)."""
     traj_file = tmp_path / "traj.jsonl"
-    recorder = TrajectoryRecorder(traj_file=traj_file)
-
-    payload_a = _sample_payload(id="a", trace_id="run-1")
-    payload_b = _sample_payload(id="b", trace_id="run-1")
-
-    await recorder.async_log_success_event(
-        kwargs={"standard_logging_object": payload_a}, response_obj=None, start_time=0, end_time=1
+    recorder = _make_recorder(traj_file)
+
+    await recorder.record(
+        request={"model": "gpt-4", "messages": [{"role": "user", "content": "hi"}]},
+        response={"id": "a", "choices": []},
+        status="success",
+        start_time=100.0,
+        end_time=100.5,
     )
-    await recorder.async_log_success_event(
-        kwargs={"standard_logging_object": payload_b}, response_obj=None, start_time=0, end_time=1
+    await recorder.record(
+        request={"model": "gpt-4", "messages": [{"role": "user", "content": "again"}]},
+        response={"id": "b", "choices": []},
+        status="success",
+        start_time=101.0,
+        end_time=101.2,
     )
 
     lines = traj_file.read_text(encoding="utf-8").strip().split("\n")
     assert len(lines) == 2
-    assert json.loads(lines[0])["id"] == "a"
-    assert json.loads(lines[1])["id"] == "b"
+    assert json.loads(lines[0])["response"]["id"] == "a"
+    assert json.loads(lines[1])["response"]["id"] == "b"
 
 
 @pytest.mark.asyncio
-async def test_recorder_emits_metrics_with_sandbox_id(tmp_path, mock_monitor):
+async def test_recorder_writes_request_and_response_verbatim(tmp_path, mock_monitor):
+    """Provider-specific fields (reasoning_content, citations, ...) survive untouched."""
     traj_file = tmp_path / "traj.jsonl"
-    recorder = TrajectoryRecorder(traj_file=traj_file)
-
-    with patch.dict("os.environ", {"ROCK_SANDBOX_ID": "sandbox-xyz"}):
-        await recorder.async_log_success_event(
-            kwargs={"standard_logging_object": _sample_payload()},
-            response_obj=None,
-            start_time=0,
-            end_time=1,
-        )
-
-    mock_monitor.record_gauge_by_name.assert_called_once()
-    gauge_args = mock_monitor.record_gauge_by_name.call_args
-    assert gauge_args.args[0] == "model_service.request.rt"
-    # response_time of 0.5s → 500 ms
-    assert gauge_args.args[1] == 500.0
-    assert gauge_args.kwargs["attributes"]["status"] == "success"
-    assert gauge_args.kwargs["attributes"]["sandbox_id"] == "sandbox-xyz"
-    assert gauge_args.kwargs["attributes"]["type"] == "chat_completions"
+    recorder = _make_recorder(traj_file)
+
+    request = {"model": "glm-5", "stream": True, "messages": [{"role": "user", "content": "你是谁"}]}
+    response = {
+        "id": "x",
+        "choices": [
+            {
+                "index": 0,
+                "message": {"role": "assistant", "content": "我是 GLM", "reasoning_content": "用户问..."},
+                "finish_reason": "stop",
+            }
+        ],
+    }
+    await recorder.record(request=request, response=response, status="success", start_time=0.0, end_time=1.0)
 
-    mock_monitor.record_counter_by_name.assert_called_once_with(
-        "model_service.request.count", 1, attributes=gauge_args.kwargs["attributes"]
-    )
+    record = json.loads(traj_file.read_text(encoding="utf-8").strip())
+    assert record["model"] == "glm-5"
+    assert record["stream"] is True
+    assert record["request"] == request
+    assert record["response"] == response
+    assert record["response_time"] == 1.0
 
 
 @pytest.mark.asyncio
-async def test_recorder_records_failure_with_failure_status(tmp_path, mock_monitor):
+async def test_recorder_emits_metrics_with_status_and_sandbox_id(tmp_path, mock_monitor):
     traj_file = tmp_path / "traj.jsonl"
-    recorder = TrajectoryRecorder(traj_file=traj_file)
+    recorder = _make_recorder(traj_file)
 
-    failed_payload = _sample_payload(status="failure", error_information={"error_class": "RateLimitError"})
-
-    await recorder.async_log_failure_event(
-        kwargs={"standard_logging_object": failed_payload},
-        response_obj=None,
-        start_time=0,
-        end_time=1,
-    )
+    with patch.dict("os.environ", {"ROCK_SANDBOX_ID": "sandbox-xyz"}):
+        await recorder.record(
+            request={"model": "gpt-4"},
+            response={"id": "x", "choices": []},
+            status="success",
+            start_time=0.0,
+            end_time=0.5,
+        )
 
-    lines = traj_file.read_text(encoding="utf-8").strip().split("\n")
-    assert len(lines) == 1
-    assert json.loads(lines[0])["status"] == "failure"
+    gauge_call = mock_monitor.record_gauge_by_name.call_args
+    assert gauge_call[0][0] == "model_service.request.rt"
+    assert gauge_call[0][1] == 500.0  # 0.5s -> 500 ms
+    assert gauge_call[1]["attributes"]["status"] == "success"
+    assert gauge_call[1]["attributes"]["sandbox_id"] == "sandbox-xyz"
+    assert gauge_call[1]["attributes"]["type"] == "chat_completions"
 
-    gauge_args = mock_monitor.record_gauge_by_name.call_args
-    assert gauge_args.kwargs["attributes"]["status"] == "failure"
+    mock_monitor.record_counter_by_name.assert_called_once_with(
+        "model_service.request.count", 1, attributes=gauge_call[1]["attributes"]
+    )
 
 
 @pytest.mark.asyncio
-async def test_recorder_skips_when_payload_missing(tmp_path, mock_monitor):
-    """If litellm doesn't attach a standard_logging_object, the recorder no-ops."""
+async def test_recorder_records_failure_with_error_text(tmp_path, mock_monitor):
     traj_file = tmp_path / "traj.jsonl"
-    recorder = TrajectoryRecorder(traj_file=traj_file)
+    recorder = _make_recorder(traj_file)
+
+    await recorder.record(
+        request={"model": "gpt-4"},
+        response=None,
+        status="failure",
+        start_time=0.0,
+        end_time=1.0,
+        error="upstream_status=429",
+    )
 
-    await recorder.async_log_success_event(kwargs={}, response_obj=None, start_time=0, end_time=1)
+    record = json.loads(traj_file.read_text(encoding="utf-8").strip())
+    assert record["status"] == "failure"
+    assert record["error"] == "upstream_status=429"
+    assert record["response"] is None
 
-    assert not traj_file.exists() or traj_file.read_text() == ""
-    mock_monitor.record_gauge_by_name.assert_not_called()
-    mock_monitor.record_counter_by_name.assert_not_called()
+    gauge_call = mock_monitor.record_gauge_by_name.call_args
+    assert gauge_call[1]["attributes"]["status"] == "failure"
 
 
 @pytest.mark.asyncio
 async def test_recorder_creates_parent_directory(tmp_path, mock_monitor):
     traj_file = tmp_path / "deep" / "nested" / "traj.jsonl"
-
-    recorder = TrajectoryRecorder(traj_file=traj_file)
-    await recorder.async_log_success_event(
-        kwargs={"standard_logging_object": _sample_payload()},
-        response_obj=None,
-        start_time=0,
-        end_time=1,
+    recorder = _make_recorder(traj_file)
+
+    await recorder.record(
+        request={"model": "gpt-4"},
+        response={"id": "x", "choices": []},
+        status="success",
+        start_time=0.0,
+        end_time=0.5,
     )
 
     assert traj_file.exists()
     assert traj_file.parent.is_dir()
-
-
-@pytest.mark.asyncio
-async def test_recorder_falls_back_to_start_end_time_when_response_time_missing(tmp_path, mock_monitor):
-    traj_file = tmp_path / "traj.jsonl"
-    recorder = TrajectoryRecorder(traj_file=traj_file)
-
-    payload = _sample_payload(startTime=10.0, endTime=10.25)
-    payload.pop("response_time", None)
-
-    await recorder.async_log_success_event(
-        kwargs={"standard_logging_object": payload}, response_obj=None, start_time=0, end_time=1
-    )
-
-    gauge_args = mock_monitor.record_gauge_by_name.call_args
-    assert abs(gauge_args.args[1] - 250.0) < 1e-6
diff --git a/uv.lock b/uv.lock
index 6a3efc2ba4..cfed10409c 100644
--- a/uv.lock
+++ b/uv.lock
@@ -1303,69 +1303,6 @@ wheels = [
     { url = "https://mirrors.aliyun.com/pypi/packages/2e/7a/c11883a98676e74a405d6503d65f58c3fa076ddd9c0cee6044884f6eac38/fastcore-1.8.15-py3-none-any.whl", hash = "sha256:d005d10d7ee5c2abb7ac0544da7c9f0a0a2f7706b48892a27c1906487ca6dea9" },
 ]
 
-[[package]]
-name = "fastuuid"
-version = "0.14.0"
-source = { registry = "https://mirrors.aliyun.com/pypi/simple/" }
-sdist = { url = "https://mirrors.aliyun.com/pypi/packages/c3/7d/d9daedf0f2ebcacd20d599928f8913e9d2aea1d56d2d355a93bfa2b611d7/fastuuid-0.14.0.tar.gz", hash = "sha256:178947fc2f995b38497a74172adee64fdeb8b7ec18f2a5934d037641ba265d26" }
-wheels = [
-    { url = "https://mirrors.aliyun.com/pypi/packages/ad/b2/731a6696e37cd20eed353f69a09f37a984a43c9713764ee3f7ad5f57f7f9/fastuuid-0.14.0-cp310-cp310-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl", hash = "sha256:6e6243d40f6c793c3e2ee14c13769e341b90be5ef0c23c82fa6515a96145181a" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/c5/79/c73c47be2a3b8734d16e628982653517f80bbe0570e27185d91af6096507/fastuuid-0.14.0-cp310-cp310-macosx_10_12_x86_64.whl", hash = "sha256:13ec4f2c3b04271f62be2e1ce7e95ad2dd1cf97e94503a3760db739afbd48f00" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/24/c5/84c1eea05977c8ba5173555b0133e3558dc628bcf868d6bf1689ff14aedc/fastuuid-0.14.0-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:b2fdd48b5e4236df145a149d7125badb28e0a383372add3fbaac9a6b7a394470" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/0e/23/4e362367b7fa17dbed646922f216b9921efb486e7abe02147e4b917359f8/fastuuid-0.14.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:f74631b8322d2780ebcf2d2d75d58045c3e9378625ec51865fe0b5620800c39d" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/b2/72/3985be633b5a428e9eaec4287ed4b873b7c4c53a9639a8b416637223c4cd/fastuuid-0.14.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:83cffc144dc93eb604b87b179837f2ce2af44871a7b323f2bfed40e8acb40ba8" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/b3/6d/6ef192a6df34e2266d5c9deb39cd3eea986df650cbcfeaf171aa52a059c3/fastuuid-0.14.0-cp310-cp310-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:1a771f135ab4523eb786e95493803942a5d1fc1610915f131b363f55af53b219" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/9d/11/8a2ea753c68d4fece29d5d7c6f3f903948cc6e82d1823bc9f7f7c0355db3/fastuuid-0.14.0-cp310-cp310-musllinux_1_1_aarch64.whl", hash = "sha256:4edc56b877d960b4eda2c4232f953a61490c3134da94f3c28af129fb9c62a4f6" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/23/42/7a32c93b6ce12642d9a152ee4753a078f372c9ebb893bc489d838dd4afd5/fastuuid-0.14.0-cp310-cp310-musllinux_1_1_i686.whl", hash = "sha256:bcc96ee819c282e7c09b2eed2b9bd13084e3b749fdb2faf58c318d498df2efbe" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/b9/e9/a5f6f686b46e3ed4ed3b93770111c233baac87dd6586a411b4988018ef1d/fastuuid-0.14.0-cp310-cp310-musllinux_1_1_x86_64.whl", hash = "sha256:7a3c0bca61eacc1843ea97b288d6789fbad7400d16db24e36a66c28c268cfe3d" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/b4/c9/18abc73c9c5b7fc0e476c1733b678783b2e8a35b0be9babd423571d44e98/fastuuid-0.14.0-cp310-cp310-win32.whl", hash = "sha256:7f2f3efade4937fae4e77efae1af571902263de7b78a0aee1a1653795a093b2a" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/5e/8a/d9e33f4eb4d4f6d9f2c5c7d7e96b5cdbb535c93f3b1ad6acce97ee9d4bf8/fastuuid-0.14.0-cp310-cp310-win_amd64.whl", hash = "sha256:ae64ba730d179f439b0736208b4c279b8bc9c089b102aec23f86512ea458c8a4" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/98/f3/12481bda4e5b6d3e698fbf525df4443cc7dce746f246b86b6fcb2fba1844/fastuuid-0.14.0-cp311-cp311-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl", hash = "sha256:73946cb950c8caf65127d4e9a325e2b6be0442a224fd51ba3b6ac44e1912ce34" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/59/19/2fc58a1446e4d72b655648eb0879b04e88ed6fa70d474efcf550f640f6ec/fastuuid-0.14.0-cp311-cp311-macosx_10_12_x86_64.whl", hash = "sha256:12ac85024637586a5b69645e7ed986f7535106ed3013640a393a03e461740cb7" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/78/29/3c74756e5b02c40cfcc8b1d8b5bac4edbd532b55917a6bcc9113550e99d1/fastuuid-0.14.0-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:05a8dde1f395e0c9b4be515b7a521403d1e8349443e7641761af07c7ad1624b1" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/52/96/d761da3fccfa84f0f353ce6e3eb8b7f76b3aa21fd25e1b00a19f9c80a063/fastuuid-0.14.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:09378a05020e3e4883dfdab438926f31fea15fd17604908f3d39cbeb22a0b4dc" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/fc/c2/f84c90167cc7765cb82b3ff7808057608b21c14a38531845d933a4637307/fastuuid-0.14.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:bbb0c4b15d66b435d2538f3827f05e44e2baafcc003dd7d8472dc67807ab8fd8" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/af/7b/4bacd03897b88c12348e7bd77943bac32ccf80ff98100598fcff74f75f2e/fastuuid-0.14.0-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:cd5a7f648d4365b41dbf0e38fe8da4884e57bed4e77c83598e076ac0c93995e7" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/c0/a2/584f2c29641df8bd810d00c1f21d408c12e9ad0c0dafdb8b7b29e5ddf787/fastuuid-0.14.0-cp311-cp311-musllinux_1_1_aarch64.whl", hash = "sha256:c0a94245afae4d7af8c43b3159d5e3934c53f47140be0be624b96acd672ceb73" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/24/68/c6b77443bb7764c760e211002c8638c0c7cce11cb584927e723215ba1398/fastuuid-0.14.0-cp311-cp311-musllinux_1_1_i686.whl", hash = "sha256:2b29e23c97e77c3a9514d70ce343571e469098ac7f5a269320a0f0b3e193ab36" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/5a/87/93f553111b33f9bb83145be12868c3c475bf8ea87c107063d01377cc0e8e/fastuuid-0.14.0-cp311-cp311-musllinux_1_1_x86_64.whl", hash = "sha256:1e690d48f923c253f28151b3a6b4e335f2b06bf669c68a02665bc150b7839e94" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/9e/8c/a04d486ca55b5abb7eaa65b39df8d891b7b1635b22db2163734dc273579a/fastuuid-0.14.0-cp311-cp311-win32.whl", hash = "sha256:a6f46790d59ab38c6aa0e35c681c0484b50dc0acf9e2679c005d61e019313c24" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/9c/b2/2d40bf00820de94b9280366a122cbaa60090c8cf59e89ac3938cf5d75895/fastuuid-0.14.0-cp311-cp311-win_amd64.whl", hash = "sha256:e150eab56c95dc9e3fefc234a0eedb342fac433dacc273cd4d150a5b0871e1fa" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/02/a2/e78fcc5df65467f0d207661b7ef86c5b7ac62eea337c0c0fcedbeee6fb13/fastuuid-0.14.0-cp312-cp312-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl", hash = "sha256:77e94728324b63660ebf8adb27055e92d2e4611645bf12ed9d88d30486471d0a" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/2b/b3/c846f933f22f581f558ee63f81f29fa924acd971ce903dab1a9b6701816e/fastuuid-0.14.0-cp312-cp312-macosx_10_12_x86_64.whl", hash = "sha256:caa1f14d2102cb8d353096bc6ef6c13b2c81f347e6ab9d6fbd48b9dea41c153d" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/54/ea/682551030f8c4fa9a769d9825570ad28c0c71e30cf34020b85c1f7ee7382/fastuuid-0.14.0-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:d23ef06f9e67163be38cece704170486715b177f6baae338110983f99a72c070" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/14/dd/5927f0a523d8e6a76b70968e6004966ee7df30322f5fc9b6cdfb0276646a/fastuuid-0.14.0-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:0c9ec605ace243b6dbe3bd27ebdd5d33b00d8d1d3f580b39fdd15cd96fd71796" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/16/6e/c0fb547eef61293153348f12e0f75a06abb322664b34a1573a7760501336/fastuuid-0.14.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:808527f2407f58a76c916d6aa15d58692a4a019fdf8d4c32ac7ff303b7d7af09" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/2d/b1/b9c75e03b768f61cf2e84ee193dc18601aeaf89a4684b20f2f0e9f52b62c/fastuuid-0.14.0-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:2fb3c0d7fef6674bbeacdd6dbd386924a7b60b26de849266d1ff6602937675c8" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/fc/fa/f7395fdac07c7a54f18f801744573707321ca0cee082e638e36452355a9d/fastuuid-0.14.0-cp312-cp312-musllinux_1_1_aarch64.whl", hash = "sha256:ab3f5d36e4393e628a4df337c2c039069344db5f4b9d2a3c9cea48284f1dd741" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/66/49/c9fd06a4a0b1f0f048aacb6599e7d96e5d6bc6fa680ed0d46bf111929d1b/fastuuid-0.14.0-cp312-cp312-musllinux_1_1_i686.whl", hash = "sha256:b9a0ca4f03b7e0b01425281ffd44e99d360e15c895f1907ca105854ed85e2057" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/be/9c/909e8c95b494e8e140e8be6165d5fc3f61fdc46198c1554df7b3e1764471/fastuuid-0.14.0-cp312-cp312-musllinux_1_1_x86_64.whl", hash = "sha256:3acdf655684cc09e60fb7e4cf524e8f42ea760031945aa8086c7eae2eeeabeb8" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/90/eb/d29d17521976e673c55ef7f210d4cdd72091a9ec6755d0fd4710d9b3c871/fastuuid-0.14.0-cp312-cp312-win32.whl", hash = "sha256:9579618be6280700ae36ac42c3efd157049fe4dd40ca49b021280481c78c3176" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/cc/fc/f5c799a6ea6d877faec0472d0b27c079b47c86b1cdc577720a5386483b36/fastuuid-0.14.0-cp312-cp312-win_amd64.whl", hash = "sha256:d9e4332dc4ba054434a9594cbfaf7823b57993d7d8e7267831c3e059857cf397" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/a5/83/ae12dd39b9a39b55d7f90abb8971f1a5f3c321fd72d5aa83f90dc67fe9ed/fastuuid-0.14.0-cp313-cp313-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl", hash = "sha256:77a09cb7427e7af74c594e409f7731a0cf887221de2f698e1ca0ebf0f3139021" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/53/b0/a4b03ff5d00f563cc7546b933c28cb3f2a07344b2aec5834e874f7d44143/fastuuid-0.14.0-cp313-cp313-macosx_10_12_x86_64.whl", hash = "sha256:9bd57289daf7b153bfa3e8013446aa144ce5e8c825e9e366d455155ede5ea2dc" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/9c/6d/64aee0a0f6a58eeabadd582e55d0d7d70258ffdd01d093b30c53d668303b/fastuuid-0.14.0-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:ac60fc860cdf3c3f327374db87ab8e064c86566ca8c49d2e30df15eda1b0c2d5" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/60/f5/a7e9cda8369e4f7919d36552db9b2ae21db7915083bc6336f1b0082c8b2e/fastuuid-0.14.0-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:ab32f74bd56565b186f036e33129da77db8be09178cd2f5206a5d4035fb2a23f" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/f0/d3/8ce11827c783affffd5bd4d6378b28eb6cc6d2ddf41474006b8d62e7448e/fastuuid-0.14.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:33e678459cf4addaedd9936bbb038e35b3f6b2061330fd8f2f6a1d80414c0f87" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/a2/51/680fb6352d0bbade04036da46264a8001f74b7484e2fd1f4da9e3db1c666/fastuuid-0.14.0-cp313-cp313-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:1e3cc56742f76cd25ecb98e4b82a25f978ccffba02e4bdce8aba857b6d85d87b" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/fa/7c/2014b5785bd8ebdab04ec857635ebd84d5ee4950186a577db9eff0fb8ff6/fastuuid-0.14.0-cp313-cp313-musllinux_1_1_aarch64.whl", hash = "sha256:cb9a030f609194b679e1660f7e32733b7a0f332d519c5d5a6a0a580991290022" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/01/d2/524d4ceeba9160e7a9bc2ea3e8f4ccf1ad78f3bde34090ca0c51f09a5e91/fastuuid-0.14.0-cp313-cp313-musllinux_1_1_i686.whl", hash = "sha256:09098762aad4f8da3a888eb9ae01c84430c907a297b97166b8abc07b640f2995" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/bc/17/354d04951ce114bf4afc78e27a18cfbd6ee319ab1829c2d5fb5e94063ac6/fastuuid-0.14.0-cp313-cp313-musllinux_1_1_x86_64.whl", hash = "sha256:1383fff584fa249b16329a059c68ad45d030d5a4b70fb7c73a08d98fd53bcdab" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/fb/be/d7be8670151d16d88f15bb121c5b66cdb5ea6a0c2a362d0dcf30276ade53/fastuuid-0.14.0-cp313-cp313-win32.whl", hash = "sha256:a0809f8cc5731c066c909047f9a314d5f536c871a7a22e815cc4967c110ac9ad" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/22/1d/5573ef3624ceb7abf4a46073d3554e37191c868abc3aecd5289a72f9810a/fastuuid-0.14.0-cp313-cp313-win_amd64.whl", hash = "sha256:0df14e92e7ad3276327631c9e7cec09e32572ce82089c55cb1bb8df71cf394ed" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/16/c9/8c7660d1fe3862e3f8acabd9be7fc9ad71eb270f1c65cce9a2b7a31329ab/fastuuid-0.14.0-cp314-cp314-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl", hash = "sha256:b852a870a61cfc26c884af205d502881a2e59cc07076b60ab4a951cc0c94d1ad" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/4c/f4/a989c82f9a90d0ad995aa957b3e572ebef163c5299823b4027986f133dfb/fastuuid-0.14.0-cp314-cp314-macosx_10_12_x86_64.whl", hash = "sha256:c7502d6f54cd08024c3ea9b3514e2d6f190feb2f46e6dbcd3747882264bb5f7b" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/da/6c/a1a24f73574ac995482b1326cf7ab41301af0fabaa3e37eeb6b3df00e6e2/fastuuid-0.14.0-cp314-cp314-macosx_11_0_arm64.whl", hash = "sha256:1ca61b592120cf314cfd66e662a5b54a578c5a15b26305e1b8b618a6f22df714" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/1a/20/2a9b59185ba7a6c7b37808431477c2d739fcbdabbf63e00243e37bd6bf49/fastuuid-0.14.0-cp314-cp314-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:aa75b6657ec129d0abded3bec745e6f7ab642e6dba3a5272a68247e85f5f316f" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/ef/33/4105ca574f6ded0af6a797d39add041bcfb468a1255fbbe82fcb6f592da2/fastuuid-0.14.0-cp314-cp314-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:a8a0dfea3972200f72d4c7df02c8ac70bad1bb4c58d7e0ec1e6f341679073a7f" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/fe/8c/fca59f8e21c4deb013f574eae05723737ddb1d2937ce87cb2a5d20992dc3/fastuuid-0.14.0-cp314-cp314-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:1bf539a7a95f35b419f9ad105d5a8a35036df35fdafae48fb2fd2e5f318f0d75" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/cb/e2/f78c271b909c034d429218f2798ca4e89eeda7983f4257d7865976ddbb6c/fastuuid-0.14.0-cp314-cp314-musllinux_1_1_aarch64.whl", hash = "sha256:9a133bf9cc78fdbd1179cb58a59ad0100aa32d8675508150f3658814aeefeaa4" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/1e/f0/5ff209d865897667a2ff3e7a572267a9ced8f7313919f6d6043aed8b1caa/fastuuid-0.14.0-cp314-cp314-musllinux_1_1_i686.whl", hash = "sha256:f54d5b36c56a2d5e1a31e73b950b28a0d83eb0c37b91d10408875a5a29494bad" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/e0/c8/2ce1c78f983a2c4987ea865d9516dbdfb141a120fd3abb977ae6f02ba7ca/fastuuid-0.14.0-cp314-cp314-musllinux_1_1_x86_64.whl", hash = "sha256:ec27778c6ca3393ef662e2762dba8af13f4ec1aaa32d08d77f71f2a70ae9feb8" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/df/60/dad662ec9a33b4a5fe44f60699258da64172c39bd041da2994422cdc40fe/fastuuid-0.14.0-cp314-cp314-win32.whl", hash = "sha256:e23fc6a83f112de4be0cc1990e5b127c27663ae43f866353166f87df58e73d06" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/1f/f6/da4db31001e854025ffd26bc9ba0740a9cbba2c3259695f7c5834908b336/fastuuid-0.14.0-cp314-cp314-win_amd64.whl", hash = "sha256:df61342889d0f5e7a32f7284e55ef95103f2110fee433c2ae7c2c0956d76ac8a" },
-]
-
 [[package]]
 name = "filelock"
 version = "3.20.0"
@@ -1991,18 +1928,6 @@ wheels = [
     { url = "https://mirrors.aliyun.com/pypi/packages/cb/b1/3846dd7f199d53cb17f49cba7e651e9ce294d8497c8c150530ed11865bb8/iniconfig-2.3.0-py3-none-any.whl", hash = "sha256:f631c04d2c48c52b84d0d0549c99ff3859c98df65b3101406327ecc7d53fbf12" },
 ]
 
-[[package]]
-name = "jinja2"
-version = "3.1.6"
-source = { registry = "https://mirrors.aliyun.com/pypi/simple/" }
-dependencies = [
-    { name = "markupsafe" },
-]
-sdist = { url = "https://mirrors.aliyun.com/pypi/packages/df/bf/f7da0350254c0ed7c72f3e33cef02e048281fec7ecec5f032d4aac52226b/jinja2-3.1.6.tar.gz", hash = "sha256:0137fb05990d35f1275a587e9aee6d56da821fc83491a0fb838183be43f66d6d" }
-wheels = [
-    { url = "https://mirrors.aliyun.com/pypi/packages/62/a1/3d680cbfd5f4b8f15abc1d571870c5fc3e594bb582bc3b64ea099db13e56/jinja2-3.1.6-py3-none-any.whl", hash = "sha256:85ece4451f492d0c13c5dd7c13a64681a86afae63a5f347908daf103ce6d2f67" },
-]
-
 [[package]]
 name = "jiter"
 version = "0.14.0"
@@ -2309,29 +2234,6 @@ antlr4-13-2 = [
     { name = "antlr4-python3-runtime" },
 ]
 
-[[package]]
-name = "litellm"
-version = "1.82.6"
-source = { registry = "https://mirrors.aliyun.com/pypi/simple/" }
-dependencies = [
-    { name = "aiohttp" },
-    { name = "click" },
-    { name = "fastuuid" },
-    { name = "httpx" },
-    { name = "importlib-metadata" },
-    { name = "jinja2" },
-    { name = "jsonschema" },
-    { name = "openai" },
-    { name = "pydantic" },
-    { name = "python-dotenv" },
-    { name = "tiktoken" },
-    { name = "tokenizers" },
-]
-sdist = { url = "https://mirrors.aliyun.com/pypi/packages/29/75/1c537aa458426a9127a92bc2273787b2f987f4e5044e21f01f2eed5244fd/litellm-1.82.6.tar.gz", hash = "sha256:2aa1c2da21fe940c33613aa447119674a3ad4d2ad5eb064e4d5ce5ee42420136" }
-wheels = [
-    { url = "https://mirrors.aliyun.com/pypi/packages/02/6c/5327667e6dbe9e98cbfbd4261c8e91386a52e38f41419575854248bbab6a/litellm-1.82.6-py3-none-any.whl", hash = "sha256:164a3ef3e19f309e3cabc199bef3d2045212712fefdfa25fc7f75884a5b5b205" },
-]
-
 [[package]]
 name = "magiccube"
 version = "0.3.0"
@@ -2356,91 +2258,6 @@ wheels = [
     { url = "https://mirrors.aliyun.com/pypi/packages/94/54/e7d793b573f298e1c9013b8c4dade17d481164aa517d1d7148619c2cedbf/markdown_it_py-4.0.0-py3-none-any.whl", hash = "sha256:87327c59b172c5011896038353a81343b6754500a08cd7a4973bb48c6d578147" },
 ]
 
-[[package]]
-name = "markupsafe"
-version = "3.0.3"
-source = { registry = "https://mirrors.aliyun.com/pypi/simple/" }
-sdist = { url = "https://mirrors.aliyun.com/pypi/packages/7e/99/7690b6d4034fffd95959cbe0c02de8deb3098cc577c67bb6a24fe5d7caa7/markupsafe-3.0.3.tar.gz", hash = "sha256:722695808f4b6457b320fdc131280796bdceb04ab50fe1795cd540799ebe1698" }
-wheels = [
-    { url = "https://mirrors.aliyun.com/pypi/packages/e8/4b/3541d44f3937ba468b75da9eebcae497dcf67adb65caa16760b0a6807ebb/markupsafe-3.0.3-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:2f981d352f04553a7171b8e44369f2af4055f888dfb147d55e42d29e29e74559" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/98/1b/fbd8eed11021cabd9226c37342fa6ca4e8a98d8188a8d9b66740494960e4/markupsafe-3.0.3-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:e1c1493fb6e50ab01d20a22826e57520f1284df32f2d8601fdd90b6304601419" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/40/01/e560d658dc0bb8ab762670ece35281dec7b6c1b33f5fbc09ebb57a185519/markupsafe-3.0.3-cp310-cp310-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:1ba88449deb3de88bd40044603fafffb7bc2b055d626a330323a9ed736661695" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/af/cd/ce6e848bbf2c32314c9b237839119c5a564a59725b53157c856e90937b7a/markupsafe-3.0.3-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:f42d0984e947b8adf7dd6dde396e720934d12c506ce84eea8476409563607591" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/c9/2a/b5c12c809f1c3045c4d580b035a743d12fcde53cf685dbc44660826308da/markupsafe-3.0.3-cp310-cp310-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:c0c0b3ade1c0b13b936d7970b1d37a57acde9199dc2aecc4c336773e1d86049c" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/cf/e3/9427a68c82728d0a88c50f890d0fc072a1484de2f3ac1ad0bfc1a7214fd5/markupsafe-3.0.3-cp310-cp310-musllinux_1_2_aarch64.whl", hash = "sha256:0303439a41979d9e74d18ff5e2dd8c43ed6c6001fd40e5bf2e43f7bd9bbc523f" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/bc/36/23578f29e9e582a4d0278e009b38081dbe363c5e7165113fad546918a232/markupsafe-3.0.3-cp310-cp310-musllinux_1_2_riscv64.whl", hash = "sha256:d2ee202e79d8ed691ceebae8e0486bd9a2cd4794cec4824e1c99b6f5009502f6" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/56/21/dca11354e756ebd03e036bd8ad58d6d7168c80ce1fe5e75218e4945cbab7/markupsafe-3.0.3-cp310-cp310-musllinux_1_2_x86_64.whl", hash = "sha256:177b5253b2834fe3678cb4a5f0059808258584c559193998be2601324fdeafb1" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/87/99/faba9369a7ad6e4d10b6a5fbf71fa2a188fe4a593b15f0963b73859a1bbd/markupsafe-3.0.3-cp310-cp310-win32.whl", hash = "sha256:2a15a08b17dd94c53a1da0438822d70ebcd13f8c3a95abe3a9ef9f11a94830aa" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/d6/25/55dc3ab959917602c96985cb1253efaa4ff42f71194bddeb61eb7278b8be/markupsafe-3.0.3-cp310-cp310-win_amd64.whl", hash = "sha256:c4ffb7ebf07cfe8931028e3e4c85f0357459a3f9f9490886198848f4fa002ec8" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/d0/9e/0a02226640c255d1da0b8d12e24ac2aa6734da68bff14c05dd53b94a0fc3/markupsafe-3.0.3-cp310-cp310-win_arm64.whl", hash = "sha256:e2103a929dfa2fcaf9bb4e7c091983a49c9ac3b19c9061b6d5427dd7d14d81a1" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/08/db/fefacb2136439fc8dd20e797950e749aa1f4997ed584c62cfb8ef7c2be0e/markupsafe-3.0.3-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:1cc7ea17a6824959616c525620e387f6dd30fec8cb44f649e31712db02123dad" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/e1/2e/5898933336b61975ce9dc04decbc0a7f2fee78c30353c5efba7f2d6ff27a/markupsafe-3.0.3-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:4bd4cd07944443f5a265608cc6aab442e4f74dff8088b0dfc8238647b8f6ae9a" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/1d/09/adf2df3699d87d1d8184038df46a9c80d78c0148492323f4693df54e17bb/markupsafe-3.0.3-cp311-cp311-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:6b5420a1d9450023228968e7e6a9ce57f65d148ab56d2313fcd589eee96a7a50" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/30/ac/0273f6fcb5f42e314c6d8cd99effae6a5354604d461b8d392b5ec9530a54/markupsafe-3.0.3-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:0bf2a864d67e76e5c9a34dc26ec616a66b9888e25e7b9460e1c76d3293bd9dbf" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/19/ae/31c1be199ef767124c042c6c3e904da327a2f7f0cd63a0337e1eca2967a8/markupsafe-3.0.3-cp311-cp311-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:bc51efed119bc9cfdf792cdeaa4d67e8f6fcccab66ed4bfdd6bde3e59bfcbb2f" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/b2/76/7edcab99d5349a4532a459e1fe64f0b0467a3365056ae550d3bcf3f79e1e/markupsafe-3.0.3-cp311-cp311-musllinux_1_2_aarch64.whl", hash = "sha256:068f375c472b3e7acbe2d5318dea141359e6900156b5b2ba06a30b169086b91a" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/a4/28/6e74cdd26d7514849143d69f0bf2399f929c37dc2b31e6829fd2045b2765/markupsafe-3.0.3-cp311-cp311-musllinux_1_2_riscv64.whl", hash = "sha256:7be7b61bb172e1ed687f1754f8e7484f1c8019780f6f6b0786e76bb01c2ae115" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/62/7e/a145f36a5c2945673e590850a6f8014318d5577ed7e5920a4b3448e0865d/markupsafe-3.0.3-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:f9e130248f4462aaa8e2552d547f36ddadbeaa573879158d721bbd33dfe4743a" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/0f/62/d9c46a7f5c9adbeeeda52f5b8d802e1094e9717705a645efc71b0913a0a8/markupsafe-3.0.3-cp311-cp311-win32.whl", hash = "sha256:0db14f5dafddbb6d9208827849fad01f1a2609380add406671a26386cdf15a19" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/83/8a/4414c03d3f891739326e1783338e48fb49781cc915b2e0ee052aa490d586/markupsafe-3.0.3-cp311-cp311-win_amd64.whl", hash = "sha256:de8a88e63464af587c950061a5e6a67d3632e36df62b986892331d4620a35c01" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/35/73/893072b42e6862f319b5207adc9ae06070f095b358655f077f69a35601f0/markupsafe-3.0.3-cp311-cp311-win_arm64.whl", hash = "sha256:3b562dd9e9ea93f13d53989d23a7e775fdfd1066c33494ff43f5418bc8c58a5c" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/5a/72/147da192e38635ada20e0a2e1a51cf8823d2119ce8883f7053879c2199b5/markupsafe-3.0.3-cp312-cp312-macosx_10_13_x86_64.whl", hash = "sha256:d53197da72cc091b024dd97249dfc7794d6a56530370992a5e1a08983ad9230e" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/9a/81/7e4e08678a1f98521201c3079f77db69fb552acd56067661f8c2f534a718/markupsafe-3.0.3-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:1872df69a4de6aead3491198eaf13810b565bdbeec3ae2dc8780f14458ec73ce" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/1e/2c/799f4742efc39633a1b54a92eec4082e4f815314869865d876824c257c1e/markupsafe-3.0.3-cp312-cp312-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:3a7e8ae81ae39e62a41ec302f972ba6ae23a5c5396c8e60113e9066ef893da0d" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/3c/2e/8d0c2ab90a8c1d9a24f0399058ab8519a3279d1bd4289511d74e909f060e/markupsafe-3.0.3-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:d6dd0be5b5b189d31db7cda48b91d7e0a9795f31430b7f271219ab30f1d3ac9d" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/2c/54/887f3092a85238093a0b2154bd629c89444f395618842e8b0c41783898ea/markupsafe-3.0.3-cp312-cp312-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:94c6f0bb423f739146aec64595853541634bde58b2135f27f61c1ffd1cd4d16a" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/c9/2f/336b8c7b6f4a4d95e91119dc8521402461b74a485558d8f238a68312f11c/markupsafe-3.0.3-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:be8813b57049a7dc738189df53d69395eba14fb99345e0a5994914a3864c8a4b" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/32/43/67935f2b7e4982ffb50a4d169b724d74b62a3964bc1a9a527f5ac4f1ee2b/markupsafe-3.0.3-cp312-cp312-musllinux_1_2_riscv64.whl", hash = "sha256:83891d0e9fb81a825d9a6d61e3f07550ca70a076484292a70fde82c4b807286f" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/89/e0/4486f11e51bbba8b0c041098859e869e304d1c261e59244baa3d295d47b7/markupsafe-3.0.3-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:77f0643abe7495da77fb436f50f8dab76dbc6e5fd25d39589a0f1fe6548bfa2b" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/2f/e1/78ee7a023dac597a5825441ebd17170785a9dab23de95d2c7508ade94e0e/markupsafe-3.0.3-cp312-cp312-win32.whl", hash = "sha256:d88b440e37a16e651bda4c7c2b930eb586fd15ca7406cb39e211fcff3bf3017d" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/aa/5b/bec5aa9bbbb2c946ca2733ef9c4ca91c91b6a24580193e891b5f7dbe8e1e/markupsafe-3.0.3-cp312-cp312-win_amd64.whl", hash = "sha256:26a5784ded40c9e318cfc2bdb30fe164bdb8665ded9cd64d500a34fb42067b1c" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/e5/f1/216fc1bbfd74011693a4fd837e7026152e89c4bcf3e77b6692fba9923123/markupsafe-3.0.3-cp312-cp312-win_arm64.whl", hash = "sha256:35add3b638a5d900e807944a078b51922212fb3dedb01633a8defc4b01a3c85f" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/38/2f/907b9c7bbba283e68f20259574b13d005c121a0fa4c175f9bed27c4597ff/markupsafe-3.0.3-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:e1cf1972137e83c5d4c136c43ced9ac51d0e124706ee1c8aa8532c1287fa8795" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/9c/d9/5f7756922cdd676869eca1c4e3c0cd0df60ed30199ffd775e319089cb3ed/markupsafe-3.0.3-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:116bb52f642a37c115f517494ea5feb03889e04df47eeff5b130b1808ce7c219" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/00/07/575a68c754943058c78f30db02ee03a64b3c638586fba6a6dd56830b30a3/markupsafe-3.0.3-cp313-cp313-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:133a43e73a802c5562be9bbcd03d090aa5a1fe899db609c29e8c8d815c5f6de6" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/a9/21/9b05698b46f218fc0e118e1f8168395c65c8a2c750ae2bab54fc4bd4e0e8/markupsafe-3.0.3-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:ccfcd093f13f0f0b7fdd0f198b90053bf7b2f02a3927a30e63f3ccc9df56b676" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/7f/71/544260864f893f18b6827315b988c146b559391e6e7e8f7252839b1b846a/markupsafe-3.0.3-cp313-cp313-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:509fa21c6deb7a7a273d629cf5ec029bc209d1a51178615ddf718f5918992ab9" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/c2/28/b50fc2f74d1ad761af2f5dcce7492648b983d00a65b8c0e0cb457c82ebbe/markupsafe-3.0.3-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:a4afe79fb3de0b7097d81da19090f4df4f8d3a2b3adaa8764138aac2e44f3af1" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/ed/76/104b2aa106a208da8b17a2fb72e033a5a9d7073c68f7e508b94916ed47a9/markupsafe-3.0.3-cp313-cp313-musllinux_1_2_riscv64.whl", hash = "sha256:795e7751525cae078558e679d646ae45574b47ed6e7771863fcc079a6171a0fc" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/b5/99/16a5eb2d140087ebd97180d95249b00a03aa87e29cc224056274f2e45fd6/markupsafe-3.0.3-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:8485f406a96febb5140bfeca44a73e3ce5116b2501ac54fe953e488fb1d03b12" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/19/bc/e7140ed90c5d61d77cea142eed9f9c303f4c4806f60a1044c13e3f1471d0/markupsafe-3.0.3-cp313-cp313-win32.whl", hash = "sha256:bdd37121970bfd8be76c5fb069c7751683bdf373db1ed6c010162b2a130248ed" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/05/73/c4abe620b841b6b791f2edc248f556900667a5a1cf023a6646967ae98335/markupsafe-3.0.3-cp313-cp313-win_amd64.whl", hash = "sha256:9a1abfdc021a164803f4d485104931fb8f8c1efd55bc6b748d2f5774e78b62c5" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/f0/3a/fa34a0f7cfef23cf9500d68cb7c32dd64ffd58a12b09225fb03dd37d5b80/markupsafe-3.0.3-cp313-cp313-win_arm64.whl", hash = "sha256:7e68f88e5b8799aa49c85cd116c932a1ac15caaa3f5db09087854d218359e485" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/e4/d7/e05cd7efe43a88a17a37b3ae96e79a19e846f3f456fe79c57ca61356ef01/markupsafe-3.0.3-cp313-cp313t-macosx_10_13_x86_64.whl", hash = "sha256:218551f6df4868a8d527e3062d0fb968682fe92054e89978594c28e642c43a73" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/99/9e/e412117548182ce2148bdeacdda3bb494260c0b0184360fe0d56389b523b/markupsafe-3.0.3-cp313-cp313t-macosx_11_0_arm64.whl", hash = "sha256:3524b778fe5cfb3452a09d31e7b5adefeea8c5be1d43c4f810ba09f2ceb29d37" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/bc/e6/fa0ffcda717ef64a5108eaa7b4f5ed28d56122c9a6d70ab8b72f9f715c80/markupsafe-3.0.3-cp313-cp313t-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:4e885a3d1efa2eadc93c894a21770e4bc67899e3543680313b09f139e149ab19" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/96/ec/2102e881fe9d25fc16cb4b25d5f5cde50970967ffa5dddafdb771237062d/markupsafe-3.0.3-cp313-cp313t-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:8709b08f4a89aa7586de0aadc8da56180242ee0ada3999749b183aa23df95025" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/4b/30/6f2fce1f1f205fc9323255b216ca8a235b15860c34b6798f810f05828e32/markupsafe-3.0.3-cp313-cp313t-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:b8512a91625c9b3da6f127803b166b629725e68af71f8184ae7e7d54686a56d6" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/58/47/4a0ccea4ab9f5dcb6f79c0236d954acb382202721e704223a8aafa38b5c8/markupsafe-3.0.3-cp313-cp313t-musllinux_1_2_aarch64.whl", hash = "sha256:9b79b7a16f7fedff2495d684f2b59b0457c3b493778c9eed31111be64d58279f" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/6a/70/3780e9b72180b6fecb83a4814d84c3bf4b4ae4bf0b19c27196104149734c/markupsafe-3.0.3-cp313-cp313t-musllinux_1_2_riscv64.whl", hash = "sha256:12c63dfb4a98206f045aa9563db46507995f7ef6d83b2f68eda65c307c6829eb" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/98/c5/c03c7f4125180fc215220c035beac6b9cb684bc7a067c84fc69414d315f5/markupsafe-3.0.3-cp313-cp313t-musllinux_1_2_x86_64.whl", hash = "sha256:8f71bc33915be5186016f675cd83a1e08523649b0e33efdb898db577ef5bb009" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/80/d6/2d1b89f6ca4bff1036499b1e29a1d02d282259f3681540e16563f27ebc23/markupsafe-3.0.3-cp313-cp313t-win32.whl", hash = "sha256:69c0b73548bc525c8cb9a251cddf1931d1db4d2258e9599c28c07ef3580ef354" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/2b/98/e48a4bfba0a0ffcf9925fe2d69240bfaa19c6f7507b8cd09c70684a53c1e/markupsafe-3.0.3-cp313-cp313t-win_amd64.whl", hash = "sha256:1b4b79e8ebf6b55351f0d91fe80f893b4743f104bff22e90697db1590e47a218" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/0e/72/e3cc540f351f316e9ed0f092757459afbc595824ca724cbc5a5d4263713f/markupsafe-3.0.3-cp313-cp313t-win_arm64.whl", hash = "sha256:ad2cf8aa28b8c020ab2fc8287b0f823d0a7d8630784c31e9ee5edea20f406287" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/33/8a/8e42d4838cd89b7dde187011e97fe6c3af66d8c044997d2183fbd6d31352/markupsafe-3.0.3-cp314-cp314-macosx_10_13_x86_64.whl", hash = "sha256:eaa9599de571d72e2daf60164784109f19978b327a3910d3e9de8c97b5b70cfe" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/b5/64/7660f8a4a8e53c924d0fa05dc3a55c9cee10bbd82b11c5afb27d44b096ce/markupsafe-3.0.3-cp314-cp314-macosx_11_0_arm64.whl", hash = "sha256:c47a551199eb8eb2121d4f0f15ae0f923d31350ab9280078d1e5f12b249e0026" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/da/ef/e648bfd021127bef5fa12e1720ffed0c6cbb8310c8d9bea7266337ff06de/markupsafe-3.0.3-cp314-cp314-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:f34c41761022dd093b4b6896d4810782ffbabe30f2d443ff5f083e0cbbb8c737" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/41/3c/a36c2450754618e62008bf7435ccb0f88053e07592e6028a34776213d877/markupsafe-3.0.3-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:457a69a9577064c05a97c41f4e65148652db078a3a509039e64d3467b9e7ef97" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/bc/20/b7fdf89a8456b099837cd1dc21974632a02a999ec9bf7ca3e490aacd98e7/markupsafe-3.0.3-cp314-cp314-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:e8afc3f2ccfa24215f8cb28dcf43f0113ac3c37c2f0f0806d8c70e4228c5cf4d" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/9a/a7/591f592afdc734f47db08a75793a55d7fbcc6902a723ae4cfbab61010cc5/markupsafe-3.0.3-cp314-cp314-musllinux_1_2_aarch64.whl", hash = "sha256:ec15a59cf5af7be74194f7ab02d0f59a62bdcf1a537677ce67a2537c9b87fcda" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/7d/33/45b24e4f44195b26521bc6f1a82197118f74df348556594bd2262bda1038/markupsafe-3.0.3-cp314-cp314-musllinux_1_2_riscv64.whl", hash = "sha256:0eb9ff8191e8498cca014656ae6b8d61f39da5f95b488805da4bb029cccbfbaf" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/ff/0e/53dfaca23a69fbfbbf17a4b64072090e70717344c52eaaaa9c5ddff1e5f0/markupsafe-3.0.3-cp314-cp314-musllinux_1_2_x86_64.whl", hash = "sha256:2713baf880df847f2bece4230d4d094280f4e67b1e813eec43b4c0e144a34ffe" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/46/11/f333a06fc16236d5238bfe74daccbca41459dcd8d1fa952e8fbd5dccfb70/markupsafe-3.0.3-cp314-cp314-win32.whl", hash = "sha256:729586769a26dbceff69f7a7dbbf59ab6572b99d94576a5592625d5b411576b9" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/28/52/182836104b33b444e400b14f797212f720cbc9ed6ba34c800639d154e821/markupsafe-3.0.3-cp314-cp314-win_amd64.whl", hash = "sha256:bdc919ead48f234740ad807933cdf545180bfbe9342c2bb451556db2ed958581" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/6f/18/acf23e91bd94fd7b3031558b1f013adfa21a8e407a3fdb32745538730382/markupsafe-3.0.3-cp314-cp314-win_arm64.whl", hash = "sha256:5a7d5dc5140555cf21a6fefbdbf8723f06fcd2f63ef108f2854de715e4422cb4" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/3c/f0/57689aa4076e1b43b15fdfa646b04653969d50cf30c32a102762be2485da/markupsafe-3.0.3-cp314-cp314t-macosx_10_13_x86_64.whl", hash = "sha256:1353ef0c1b138e1907ae78e2f6c63ff67501122006b0f9abad68fda5f4ffc6ab" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/89/c3/2e67a7ca217c6912985ec766c6393b636fb0c2344443ff9d91404dc4c79f/markupsafe-3.0.3-cp314-cp314t-macosx_11_0_arm64.whl", hash = "sha256:1085e7fbddd3be5f89cc898938f42c0b3c711fdcb37d75221de2666af647c175" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/f0/00/be561dce4e6ca66b15276e184ce4b8aec61fe83662cce2f7d72bd3249d28/markupsafe-3.0.3-cp314-cp314t-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:1b52b4fb9df4eb9ae465f8d0c228a00624de2334f216f178a995ccdcf82c4634" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/50/09/c419f6f5a92e5fadde27efd190eca90f05e1261b10dbd8cbcb39cd8ea1dc/markupsafe-3.0.3-cp314-cp314t-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:fed51ac40f757d41b7c48425901843666a6677e3e8eb0abcff09e4ba6e664f50" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/22/44/a0681611106e0b2921b3033fc19bc53323e0b50bc70cffdd19f7d679bb66/markupsafe-3.0.3-cp314-cp314t-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:f190daf01f13c72eac4efd5c430a8de82489d9cff23c364c3ea822545032993e" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/5f/57/1b0b3f100259dc9fffe780cfb60d4be71375510e435efec3d116b6436d43/markupsafe-3.0.3-cp314-cp314t-musllinux_1_2_aarch64.whl", hash = "sha256:e56b7d45a839a697b5eb268c82a71bd8c7f6c94d6fd50c3d577fa39a9f1409f5" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/26/6a/4bf6d0c97c4920f1597cc14dd720705eca0bf7c787aebc6bb4d1bead5388/markupsafe-3.0.3-cp314-cp314t-musllinux_1_2_riscv64.whl", hash = "sha256:f3e98bb3798ead92273dc0e5fd0f31ade220f59a266ffd8a4f6065e0a3ce0523" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/14/c7/ca723101509b518797fedc2fdf79ba57f886b4aca8a7d31857ba3ee8281f/markupsafe-3.0.3-cp314-cp314t-musllinux_1_2_x86_64.whl", hash = "sha256:5678211cb9333a6468fb8d8be0305520aa073f50d17f089b5b4b477ea6e67fdc" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/fb/df/5bd7a48c256faecd1d36edc13133e51397e41b73bb77e1a69deab746ebac/markupsafe-3.0.3-cp314-cp314t-win32.whl", hash = "sha256:915c04ba3851909ce68ccc2b8e2cd691618c4dc4c4232fb7982bca3f41fd8c3d" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/1a/8a/0402ba61a2f16038b48b39bccca271134be00c5c9f0f623208399333c448/markupsafe-3.0.3-cp314-cp314t-win_amd64.whl", hash = "sha256:4faffd047e07c38848ce017e8725090413cd80cbc23d86e55c587bf979e579c9" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/70/bc/6f1c2f612465f5fa89b95bead1f44dcb607670fd42891d8fdcd5d039f4f4/markupsafe-3.0.3-cp314-cp314t-win_arm64.whl", hash = "sha256:32001d6a8fc98c8cb5c947787c5d08b0a50663d139f1305bac5885d98d9b40fa" },
-]
-
 [[package]]
 name = "math-verify"
 version = "0.8.0"
@@ -4432,7 +4249,8 @@ builder = [
 model-service = [
     { name = "alibabacloud-cr20181201" },
     { name = "fastapi" },
-    { name = "litellm" },
+    { name = "httpx" },
+    { name = "openai" },
     { name = "psutil" },
     { name = "swebench" },
     { name = "uvicorn" },
@@ -4495,11 +4313,12 @@ requires-dist = [
     { name = "gem-llm", marker = "extra == 'rocklet'", specifier = ">=0.1.0" },
     { name = "gem-llm", marker = "extra == 'sandbox-actor'", specifier = ">=0.1.0" },
     { name = "httpx" },
+    { name = "httpx", marker = "extra == 'model-service'" },
     { name = "kubernetes", marker = "extra == 'admin'", specifier = ">=35.0.0" },
-    { name = "litellm", marker = "extra == 'model-service'", specifier = ">=1.50.0" },
     { name = "nacos-sdk-python", marker = "extra == 'admin'", specifier = ">=0.1.14" },
     { name = "nacos-sdk-python", marker = "extra == 'sandbox-actor'", specifier = ">=0.1.14" },
     { name = "numpy", marker = "extra == 'rocklet'", specifier = "<=2.2.6" },
+    { name = "openai", marker = "extra == 'model-service'", specifier = ">=1.50.0" },
     { name = "opentelemetry-api" },
     { name = "opentelemetry-exporter-otlp" },
     { name = "opentelemetry-exporter-prometheus" },
@@ -4949,94 +4768,6 @@ wheels = [
     { url = "https://mirrors.aliyun.com/pypi/packages/e5/30/643397144bfbfec6f6ef821f36f33e57d35946c44a2352d3c9f0ae847619/tenacity-9.1.2-py3-none-any.whl", hash = "sha256:f77bf36710d8b73a50b2dd155c97b870017ad21afe6ab300326b0371b3b05138" },
 ]
 
-[[package]]
-name = "tiktoken"
-version = "0.12.0"
-source = { registry = "https://mirrors.aliyun.com/pypi/simple/" }
-dependencies = [
-    { name = "regex" },
-    { name = "requests" },
-]
-sdist = { url = "https://mirrors.aliyun.com/pypi/packages/7d/ab/4d017d0f76ec3171d469d80fc03dfbb4e48a4bcaddaa831b31d526f05edc/tiktoken-0.12.0.tar.gz", hash = "sha256:b18ba7ee2b093863978fcb14f74b3707cdc8d4d4d3836853ce7ec60772139931" }
-wheels = [
-    { url = "https://mirrors.aliyun.com/pypi/packages/89/b3/2cb7c17b6c4cf8ca983204255d3f1d95eda7213e247e6947a0ee2c747a2c/tiktoken-0.12.0-cp310-cp310-macosx_10_12_x86_64.whl", hash = "sha256:3de02f5a491cfd179aec916eddb70331814bd6bf764075d39e21d5862e533970" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/27/0f/df139f1df5f6167194ee5ab24634582ba9a1b62c6b996472b0277ec80f66/tiktoken-0.12.0-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:b6cfb6d9b7b54d20af21a912bfe63a2727d9cfa8fbda642fd8322c70340aad16" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/ef/5d/26a691f28ab220d5edc09b9b787399b130f24327ef824de15e5d85ef21aa/tiktoken-0.12.0-cp310-cp310-manylinux_2_28_aarch64.whl", hash = "sha256:cde24cdb1b8a08368f709124f15b36ab5524aac5fa830cc3fdce9c03d4fb8030" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/b2/94/443fab3d4e5ebecac895712abd3849b8da93b7b7dec61c7db5c9c7ebe40c/tiktoken-0.12.0-cp310-cp310-manylinux_2_28_x86_64.whl", hash = "sha256:6de0da39f605992649b9cfa6f84071e3f9ef2cec458d08c5feb1b6f0ff62e134" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/54/35/388f941251b2521c70dd4c5958e598ea6d2c88e28445d2fb8189eecc1dfc/tiktoken-0.12.0-cp310-cp310-musllinux_1_2_aarch64.whl", hash = "sha256:6faa0534e0eefbcafaccb75927a4a380463a2eaa7e26000f0173b920e98b720a" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/f8/00/c6681c7f833dd410576183715a530437a9873fa910265817081f65f9105f/tiktoken-0.12.0-cp310-cp310-musllinux_1_2_x86_64.whl", hash = "sha256:82991e04fc860afb933efb63957affc7ad54f83e2216fe7d319007dab1ba5892" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/5f/d2/82e795a6a9bafa034bf26a58e68fe9a89eeaaa610d51dbeb22106ba04f0a/tiktoken-0.12.0-cp310-cp310-win_amd64.whl", hash = "sha256:6fb2995b487c2e31acf0a9e17647e3b242235a20832642bb7a9d1a181c0c1bb1" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/de/46/21ea696b21f1d6d1efec8639c204bdf20fde8bafb351e1355c72c5d7de52/tiktoken-0.12.0-cp311-cp311-macosx_10_12_x86_64.whl", hash = "sha256:6e227c7f96925003487c33b1b32265fad2fbcec2b7cf4817afb76d416f40f6bb" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/c9/d9/35c5d2d9e22bb2a5f74ba48266fb56c63d76ae6f66e02feb628671c0283e/tiktoken-0.12.0-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:c06cf0fcc24c2cb2adb5e185c7082a82cba29c17575e828518c2f11a01f445aa" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/01/84/961106c37b8e49b9fdcf33fe007bb3a8fdcc380c528b20cc7fbba80578b8/tiktoken-0.12.0-cp311-cp311-manylinux_2_28_aarch64.whl", hash = "sha256:f18f249b041851954217e9fd8e5c00b024ab2315ffda5ed77665a05fa91f42dc" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/6a/d0/3d9275198e067f8b65076a68894bb52fd253875f3644f0a321a720277b8a/tiktoken-0.12.0-cp311-cp311-manylinux_2_28_x86_64.whl", hash = "sha256:47a5bc270b8c3db00bb46ece01ef34ad050e364b51d406b6f9730b64ac28eded" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/78/db/a58e09687c1698a7c592e1038e01c206569b86a0377828d51635561f8ebf/tiktoken-0.12.0-cp311-cp311-musllinux_1_2_aarch64.whl", hash = "sha256:508fa71810c0efdcd1b898fda574889ee62852989f7c1667414736bcb2b9a4bd" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/9e/1b/a9e4d2bf91d515c0f74afc526fd773a812232dd6cda33ebea7f531202325/tiktoken-0.12.0-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:a1af81a6c44f008cba48494089dd98cccb8b313f55e961a52f5b222d1e507967" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/9d/15/963819345f1b1fb0809070a79e9dd96938d4ca41297367d471733e79c76c/tiktoken-0.12.0-cp311-cp311-win_amd64.whl", hash = "sha256:3e68e3e593637b53e56f7237be560f7a394451cb8c11079755e80ae64b9e6def" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/a4/85/be65d39d6b647c79800fd9d29241d081d4eeb06271f383bb87200d74cf76/tiktoken-0.12.0-cp312-cp312-macosx_10_13_x86_64.whl", hash = "sha256:b97f74aca0d78a1ff21b8cd9e9925714c15a9236d6ceacf5c7327c117e6e21e8" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/4a/42/6573e9129bc55c9bf7300b3a35bef2c6b9117018acca0dc760ac2d93dffe/tiktoken-0.12.0-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:2b90f5ad190a4bb7c3eb30c5fa32e1e182ca1ca79f05e49b448438c3e225a49b" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/66/c5/ed88504d2f4a5fd6856990b230b56d85a777feab84e6129af0822f5d0f70/tiktoken-0.12.0-cp312-cp312-manylinux_2_28_aarch64.whl", hash = "sha256:65b26c7a780e2139e73acc193e5c63ac754021f160df919add909c1492c0fb37" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/f4/90/3dae6cc5436137ebd38944d396b5849e167896fc2073da643a49f372dc4f/tiktoken-0.12.0-cp312-cp312-manylinux_2_28_x86_64.whl", hash = "sha256:edde1ec917dfd21c1f2f8046b86348b0f54a2c0547f68149d8600859598769ad" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/a3/fe/26df24ce53ffde419a42f5f53d755b995c9318908288c17ec3f3448313a3/tiktoken-0.12.0-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:35a2f8ddd3824608b3d650a000c1ef71f730d0c56486845705a8248da00f9fe5" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/20/cc/b064cae1a0e9fac84b0d2c46b89f4e57051a5f41324e385d10225a984c24/tiktoken-0.12.0-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:83d16643edb7fa2c99eff2ab7733508aae1eebb03d5dfc46f5565862810f24e3" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/81/10/b8523105c590c5b8349f2587e2fdfe51a69544bd5a76295fc20f2374f470/tiktoken-0.12.0-cp312-cp312-win_amd64.whl", hash = "sha256:ffc5288f34a8bc02e1ea7047b8d041104791d2ddbf42d1e5fa07822cbffe16bd" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/00/61/441588ee21e6b5cdf59d6870f86beb9789e532ee9718c251b391b70c68d6/tiktoken-0.12.0-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:775c2c55de2310cc1bc9a3ad8826761cbdc87770e586fd7b6da7d4589e13dab3" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/1f/05/dcf94486d5c5c8d34496abe271ac76c5b785507c8eae71b3708f1ad9b45a/tiktoken-0.12.0-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:a01b12f69052fbe4b080a2cfb867c4de12c704b56178edf1d1d7b273561db160" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/a0/70/5163fe5359b943f8db9946b62f19be2305de8c3d78a16f629d4165e2f40e/tiktoken-0.12.0-cp313-cp313-manylinux_2_28_aarch64.whl", hash = "sha256:01d99484dc93b129cd0964f9d34eee953f2737301f18b3c7257bf368d7615baa" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/0c/da/c028aa0babf77315e1cef357d4d768800c5f8a6de04d0eac0f377cb619fa/tiktoken-0.12.0-cp313-cp313-manylinux_2_28_x86_64.whl", hash = "sha256:4a1a4fcd021f022bfc81904a911d3df0f6543b9e7627b51411da75ff2fe7a1be" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/a0/5a/886b108b766aa53e295f7216b509be95eb7d60b166049ce2c58416b25f2a/tiktoken-0.12.0-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:981a81e39812d57031efdc9ec59fa32b2a5a5524d20d4776574c4b4bd2e9014a" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/f4/f8/4db272048397636ac7a078d22773dd2795b1becee7bc4922fe6207288d57/tiktoken-0.12.0-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:9baf52f84a3f42eef3ff4e754a0db79a13a27921b457ca9832cf944c6be4f8f3" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/8e/32/45d02e2e0ea2be3a9ed22afc47d93741247e75018aac967b713b2941f8ea/tiktoken-0.12.0-cp313-cp313-win_amd64.whl", hash = "sha256:b8a0cd0c789a61f31bf44851defbd609e8dd1e2c8589c614cc1060940ef1f697" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/ce/76/994fc868f88e016e6d05b0da5ac24582a14c47893f4474c3e9744283f1d5/tiktoken-0.12.0-cp313-cp313t-macosx_10_13_x86_64.whl", hash = "sha256:d5f89ea5680066b68bcb797ae85219c72916c922ef0fcdd3480c7d2315ffff16" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/f6/b8/57ef1456504c43a849821920d582a738a461b76a047f352f18c0b26c6516/tiktoken-0.12.0-cp313-cp313t-macosx_11_0_arm64.whl", hash = "sha256:b4e7ed1c6a7a8a60a3230965bdedba8cc58f68926b835e519341413370e0399a" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/72/90/13da56f664286ffbae9dbcfadcc625439142675845baa62715e49b87b68b/tiktoken-0.12.0-cp313-cp313t-manylinux_2_28_aarch64.whl", hash = "sha256:fc530a28591a2d74bce821d10b418b26a094bf33839e69042a6e86ddb7a7fb27" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/05/df/4f80030d44682235bdaecd7346c90f67ae87ec8f3df4a3442cb53834f7e4/tiktoken-0.12.0-cp313-cp313t-manylinux_2_28_x86_64.whl", hash = "sha256:06a9f4f49884139013b138920a4c393aa6556b2f8f536345f11819389c703ebb" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/22/1f/ae535223a8c4ef4c0c1192e3f9b82da660be9eb66b9279e95c99288e9dab/tiktoken-0.12.0-cp313-cp313t-musllinux_1_2_aarch64.whl", hash = "sha256:04f0e6a985d95913cabc96a741c5ffec525a2c72e9df086ff17ebe35985c800e" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/78/a7/f8ead382fce0243cb625c4f266e66c27f65ae65ee9e77f59ea1653b6d730/tiktoken-0.12.0-cp313-cp313t-musllinux_1_2_x86_64.whl", hash = "sha256:0ee8f9ae00c41770b5f9b0bb1235474768884ae157de3beb5439ca0fd70f3e25" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/93/e0/6cc82a562bc6365785a3ff0af27a2a092d57c47d7a81d9e2295d8c36f011/tiktoken-0.12.0-cp313-cp313t-win_amd64.whl", hash = "sha256:dc2dd125a62cb2b3d858484d6c614d136b5b848976794edfb63688d539b8b93f" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/72/05/3abc1db5d2c9aadc4d2c76fa5640134e475e58d9fbb82b5c535dc0de9b01/tiktoken-0.12.0-cp314-cp314-macosx_10_13_x86_64.whl", hash = "sha256:a90388128df3b3abeb2bfd1895b0681412a8d7dc644142519e6f0a97c2111646" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/e3/7b/50c2f060412202d6c95f32b20755c7a6273543b125c0985d6fa9465105af/tiktoken-0.12.0-cp314-cp314-macosx_11_0_arm64.whl", hash = "sha256:da900aa0ad52247d8794e307d6446bd3cdea8e192769b56276695d34d2c9aa88" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/14/27/bf795595a2b897e271771cd31cb847d479073497344c637966bdf2853da1/tiktoken-0.12.0-cp314-cp314-manylinux_2_28_aarch64.whl", hash = "sha256:285ba9d73ea0d6171e7f9407039a290ca77efcdb026be7769dccc01d2c8d7fff" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/f5/de/9341a6d7a8f1b448573bbf3425fa57669ac58258a667eb48a25dfe916d70/tiktoken-0.12.0-cp314-cp314-manylinux_2_28_x86_64.whl", hash = "sha256:d186a5c60c6a0213f04a7a802264083dea1bbde92a2d4c7069e1a56630aef830" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/75/0d/881866647b8d1be4d67cb24e50d0c26f9f807f994aa1510cb9ba2fe5f612/tiktoken-0.12.0-cp314-cp314-musllinux_1_2_aarch64.whl", hash = "sha256:604831189bd05480f2b885ecd2d1986dc7686f609de48208ebbbddeea071fc0b" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/b3/1e/b651ec3059474dab649b8d5b69f5c65cd8fcd8918568c1935bd4136c9392/tiktoken-0.12.0-cp314-cp314-musllinux_1_2_x86_64.whl", hash = "sha256:8f317e8530bb3a222547b85a58583238c8f74fd7a7408305f9f63246d1a0958b" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/80/57/ce64fd16ac390fafde001268c364d559447ba09b509181b2808622420eec/tiktoken-0.12.0-cp314-cp314-win_amd64.whl", hash = "sha256:399c3dd672a6406719d84442299a490420b458c44d3ae65516302a99675888f3" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/ac/a4/72eed53e8976a099539cdd5eb36f241987212c29629d0a52c305173e0a68/tiktoken-0.12.0-cp314-cp314t-macosx_10_13_x86_64.whl", hash = "sha256:c2c714c72bc00a38ca969dae79e8266ddec999c7ceccd603cc4f0d04ccd76365" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/e6/d7/0110b8f54c008466b19672c615f2168896b83706a6611ba6e47313dbc6e9/tiktoken-0.12.0-cp314-cp314t-macosx_11_0_arm64.whl", hash = "sha256:cbb9a3ba275165a2cb0f9a83f5d7025afe6b9d0ab01a22b50f0e74fee2ad253e" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/5f/77/4f268c41a3957c418b084dd576ea2fad2e95da0d8e1ab705372892c2ca22/tiktoken-0.12.0-cp314-cp314t-manylinux_2_28_aarch64.whl", hash = "sha256:dfdfaa5ffff8993a3af94d1125870b1d27aed7cb97aa7eb8c1cefdbc87dbee63" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/4e/2b/fc46c90fe5028bd094cd6ee25a7db321cb91d45dc87531e2bdbb26b4867a/tiktoken-0.12.0-cp314-cp314t-manylinux_2_28_x86_64.whl", hash = "sha256:584c3ad3d0c74f5269906eb8a659c8bfc6144a52895d9261cdaf90a0ae5f4de0" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/28/c0/3c7a39ff68022ddfd7d93f3337ad90389a342f761c4d71de99a3ccc57857/tiktoken-0.12.0-cp314-cp314t-musllinux_1_2_aarch64.whl", hash = "sha256:54c891b416a0e36b8e2045b12b33dd66fb34a4fe7965565f1b482da50da3e86a" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/ab/0d/c1ad6f4016a3968c048545f5d9b8ffebf577774b2ede3e2e352553b685fe/tiktoken-0.12.0-cp314-cp314t-musllinux_1_2_x86_64.whl", hash = "sha256:5edb8743b88d5be814b1a8a8854494719080c28faaa1ccbef02e87354fe71ef0" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/af/df/c7891ef9d2712ad774777271d39fdef63941ffba0a9d59b7ad1fd2765e57/tiktoken-0.12.0-cp314-cp314t-win_amd64.whl", hash = "sha256:f61c0aea5565ac82e2ec50a05e02a6c44734e91b51c10510b084ea1b8e633a71" },
-]
-
-[[package]]
-name = "tokenizers"
-version = "0.23.1"
-source = { registry = "https://mirrors.aliyun.com/pypi/simple/" }
-dependencies = [
-    { name = "huggingface-hub" },
-]
-sdist = { url = "https://mirrors.aliyun.com/pypi/packages/c1/60/21f715d9faba5f5407ff759472ade058ec4a507ad62bcea47cb847239a73/tokenizers-0.23.1.tar.gz", hash = "sha256:1feeeadf865a7915adc25445dea30e9933e593c31bb96c277cee36de227c8bfa" }
-wheels = [
-    { url = "https://mirrors.aliyun.com/pypi/packages/87/39/b87a87d5bb9470610b80a2d31df42fcffeaf35118b8b97952b2aff598cc7/tokenizers-0.23.1-cp310-abi3-macosx_10_12_x86_64.whl", hash = "sha256:e03d6ffcbe0d56ee9c1ccd070e70a13fa750727c0277e138152acbc0252c2224" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/e2/6a/068ed9f6e444c9d7e9d55ce134181325700f3d7f30410721bdc8f848d727/tokenizers-0.23.1-cp310-abi3-macosx_11_0_arm64.whl", hash = "sha256:e0948bbb1ac1d7cdfc9fb6d62c596e3b7550036ad60ecd654a66ad273326324e" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/6c/36/e006edf031154cba92b8416057d92c3abe3635e4c4b0aa0b5b9bb39dde70/tokenizers-0.23.1-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:1bf13402aff9bc533c89cb849ec3b412dc3fbeacc9744840e423d7bf3f7dc0e3" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/a2/ef/7735d226f9c7f874a6bee5e3f27fb25ecabdf207d37b8cf45286d0795893/tokenizers-0.23.1-cp310-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:f836ca703b89ae07919a309f9651f7a88fd5a33d5f718ba5ad0870ec0256bad6" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/b9/d9/24827036f6e21297bfffda0768e58eb6096a4f411e932964a01707857931/tokenizers-0.23.1-cp310-abi3-manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:ae848657742035523fdf261773630cb819a26995fcd3d9ecae0c1daf6e5a4959" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/0c/9a/22f3582b3a4f49358293a5206e25317621ee4526bfe9cdaa0f07a12e770e/tokenizers-0.23.1-cp310-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:53b09e85775d5187941e7bab30e941b4134ab4a7dd8c68e783d231fb7ca27c51" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/7e/65/b8f8814eef95800f20721384136d9a1d22241d50b2874357cb70542c392f/tokenizers-0.23.1-cp310-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:ea5a0ce170074329faaa8ea3f6400ecde604b6678192688533af80980daae71a" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/0d/d5/1353e5f677ec27c2494fb6a6725e82d56c985f53e90ec511369e7e4f02c6/tokenizers-0.23.1-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:5075b405006415ea148a992d093699c66eb01952bf59f4d5727089a98bda45a4" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/71/89/39b6b8fc073fb6d413d0147aa333dc7eff7be65639ac9d19930a0b21bf33/tokenizers-0.23.1-cp310-abi3-manylinux_2_31_riscv64.whl", hash = "sha256:56f3a77de629917652f876294dc9fe6bad4a0c43bc229dc72e59bb23a0f4729a" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/0f/80/127c854da64827e5b79264ce524993a90dddcb320e5cd42412c5c02f9e8a/tokenizers-0.23.1-cp310-abi3-musllinux_1_2_aarch64.whl", hash = "sha256:9d10a6d957ef01896dc274e890eee27d41bd0e74ef31e60616f0fc311345184e" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/fe/ba/44c2502feb1a058f096ddfb4e0996ef3225a01a388e1a9b094e91689fe93/tokenizers-0.23.1-cp310-abi3-musllinux_1_2_armv7l.whl", hash = "sha256:1974288a609c343774f1b897c8b482c791ab17b75ab5c8c2b1737565c1d82288" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/9e/c1/464019a9fb059870bfe4eebb4ba12208f3042035e258bf5e782906bd3847/tokenizers-0.23.1-cp310-abi3-musllinux_1_2_i686.whl", hash = "sha256:120468fb4c24faf0543c835a4fabafa4deb3f20a035c9b6e83d0b553a97615d4" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/79/94/3ac1432bda31626071e9b6a12709b97ae05131c804b94c8f3ac622c5da32/tokenizers-0.23.1-cp310-abi3-musllinux_1_2_x86_64.whl", hash = "sha256:e3d8f40ea6268047de7046906326abed5134f27d4e8447b23763afe5808c8a96" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/6a/dd/631b21433c771b1382535326f0eca80b9c9cee2e64961dd993bc9ac4669e/tokenizers-0.23.1-cp310-abi3-win32.whl", hash = "sha256:93120a930b919416da7cd10a2f606ac9919cc69cacae7980fa2140e277660948" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/97/c9/2553f72aaf65a2797d4229e37fa7fbe38ffbf3e32912d31bdd78b3323e59/tokenizers-0.23.1-cp310-abi3-win_amd64.whl", hash = "sha256:e7bfaf995c1bdbbd21d13539decb6650967013759318627d85daeb7881af16b7" },
-    { url = "https://mirrors.aliyun.com/pypi/packages/cd/2b/2be299bab55fc595e3d38567edb1a87f86e594842968fa9515a07bdcf422/tokenizers-0.23.1-cp310-abi3-win_arm64.whl", hash = "sha256:a26197957d8e4425dfba746315f3c425ea00cfa8367c5fbc4ec73447893dcea9" },
-]
-
 [[package]]
 name = "toml"
 version = "0.10.2"

From 164e4272870735fb2028325d67c901d037a9f1aa Mon Sep 17 00:00:00 2001
From: "pengshixin.psx" <pengshixin.psx@alibaba-inc.com>
Date: Tue, 12 May 2026 06:08:27 +0000
Subject: [PATCH 05/25] refactor: use openai sdk

---
 rock/sdk/model/server/api/proxy.py     |  64 ++-------
 rock/sdk/model/server/config.py        |   3 -
 rock/sdk/model/server/main.py          |  23 ++--
 rock/sdk/model/server/sse_utils.py     |  92 +++++++++++++
 tests/unit/sdk/model/test_proxy.py     |   4 -
 tests/unit/sdk/model/test_sse_utils.py | 172 +++++++++++++++++++++++++
 6 files changed, 287 insertions(+), 71 deletions(-)
 create mode 100644 rock/sdk/model/server/sse_utils.py
 create mode 100644 tests/unit/sdk/model/test_sse_utils.py

diff --git a/rock/sdk/model/server/api/proxy.py b/rock/sdk/model/server/api/proxy.py
index 8161c4adc0..bd000cd80d 100644
--- a/rock/sdk/model/server/api/proxy.py
+++ b/rock/sdk/model/server/api/proxy.py
@@ -21,7 +21,6 @@
 
 import json
 import time
-import uuid
 from collections.abc import AsyncIterator
 from typing import Any
 
@@ -35,6 +34,12 @@
 from rock.sdk.model.server.config import ModelServiceConfig
 from rock.sdk.model.server.integrations.traj_recorder import TrajectoryRecorder
 from rock.sdk.model.server.integrations.traj_replayer import SequentialCursor, TrajectoryExhausted
+from rock.sdk.model.server.sse_utils import (
+    SSE_DONE,
+    completion_to_chunk_dict,
+    encode_sse_event,
+    parse_sse_data_chunks,
+)
 
 logger = init_logger(__name__)
 
@@ -84,56 +89,10 @@ def _filter_headers(headers) -> dict[str, str]:
     return out
 
 
-def _completion_to_chunk(response: dict, *, model: str) -> dict:
-    """Convert a recorded ``chat.completion`` response into a single
-    ``chat.completion.chunk`` shape (move ``message`` → ``delta``). Used only by
-    the replay streaming path."""
-    choices_in = response.get("choices") or []
-    choices_out = []
-    for choice in choices_in:
-        delta = dict(choice.get("message") or {})
-        choices_out.append(
-            {
-                "index": choice.get("index", 0),
-                "delta": delta,
-                "finish_reason": choice.get("finish_reason"),
-                "logprobs": choice.get("logprobs"),
-            }
-        )
-    return {
-        "id": response.get("id") or f"chatcmpl-{uuid.uuid4()}",
-        "object": "chat.completion.chunk",
-        "created": response.get("created") or int(time.time()),
-        "model": response.get("model") or model,
-        "choices": choices_out,
-    }
-
-
 async def _replay_sse_iter(response: dict, *, model: str) -> AsyncIterator[bytes]:
     """Emit a recorded response as one SSE chunk + ``[DONE]``."""
-    chunk = _completion_to_chunk(response, model=model)
-    yield f"data: {json.dumps(chunk, ensure_ascii=False)}\n\n".encode()
-    yield b"data: [DONE]\n\n"
-
-
-def _parse_sse_chunks_into_state(buffer: bytes, state: ChatCompletionStreamState) -> bytes:
-    """Pull complete SSE events out of ``buffer`` and feed each ``data:`` line
-    (other than ``[DONE]``) to the openai stream-state aggregator. Returns the
-    leftover bytes that did not yet form a complete event."""
-    while b"\n\n" in buffer:
-        event, buffer = buffer.split(b"\n\n", 1)
-        for raw_line in event.split(b"\n"):
-            line = raw_line.decode("utf-8", errors="replace").strip()
-            if not line.startswith("data:"):
-                continue
-            payload = line[len("data:") :].strip()
-            if not payload or payload == "[DONE]":
-                continue
-            try:
-                state.handle_chunk(ChatCompletionChunk.model_validate(json.loads(payload)))
-            except Exception as exc:  # parser error: forward continues, traj will be partial
-                logger.debug(f"[record] chunk parse failed (forward continues): {exc}")
-    return buffer
+    yield encode_sse_event(completion_to_chunk_dict(response, model=model))
+    yield SSE_DONE
 
 
 async def _forward_stream_and_record(
@@ -158,7 +117,12 @@ async def _forward_stream_and_record(
                 upstream_status = r.status_code
                 async for chunk in r.aiter_bytes():
                     yield chunk
-                    parse_buffer = _parse_sse_chunks_into_state(parse_buffer + chunk, state)
+                    chunk_dicts, parse_buffer = parse_sse_data_chunks(parse_buffer + chunk)
+                    for chunk_dict in chunk_dicts:
+                        try:
+                            state.handle_chunk(ChatCompletionChunk.model_validate(chunk_dict))
+                        except Exception as exc:  # parser error: forward continues, traj will be partial
+                            logger.debug(f"[record] chunk parse failed (forward continues): {exc}")
     except httpx.RequestError as exc:
         # Connection died mid-stream. The bytes already sent reach the client;
         # we still try to record what we got.
diff --git a/rock/sdk/model/server/config.py b/rock/sdk/model/server/config.py
index 8c992fb4b3..3923d43c6a 100644
--- a/rock/sdk/model/server/config.py
+++ b/rock/sdk/model/server/config.py
@@ -54,9 +54,6 @@ class ModelServiceConfig(BaseModel):
     num_retries: int = Field(default=6)
     """Number of retries for retryable failures (passed through to litellm)."""
 
-    traj_enabled: bool = Field(default=True)
-    """When True, write each chat/completions call as a JSONL trajectory line."""
-
     traj_file: str | None = Field(default=None)
     """Override default trajectory file path. None → uses TRAJ_FILE (LOG_DIR/LLMTraj.jsonl)."""
 
diff --git a/rock/sdk/model/server/main.py b/rock/sdk/model/server/main.py
index 31b4918e55..1ba0b54a7b 100644
--- a/rock/sdk/model/server/main.py
+++ b/rock/sdk/model/server/main.py
@@ -56,14 +56,11 @@ def _configure_proxy_integrations(app: FastAPI, config: ModelServiceConfig) -> N
     """Wire up record/replay integrations and attach them to ``app.state``.
 
     - Replay mode (``replay_traj_path`` set): load the trajectory into a
-      ``SequentialCursor`` and stash it as ``app.state.replay_cursor``.
-    - Forward/record mode (default): if ``traj_enabled`` is True, attach a
-      ``TrajectoryRecorder`` instance as ``app.state.recorder``. The proxy
-      handler invokes it explicitly after each forwarded call.
-
-    Replay and record are mutually exclusive — in replay mode we don't record,
-    since replayed responses round-tripping back into the source file would
-    inflate metrics and corrupt the trajectory.
+      ``SequentialCursor`` and stash it as ``app.state.replay_cursor``. No
+      recorder is attached — replaying back into the source file would corrupt it.
+    - Forward mode (default): attach a ``TrajectoryRecorder`` instance as
+      ``app.state.recorder``. The proxy handler invokes it explicitly after
+      each forwarded call.
     """
     if config.replay_traj_path:
         from rock.sdk.model.server.integrations.traj_replayer import SequentialCursor
@@ -72,12 +69,11 @@ def _configure_proxy_integrations(app: FastAPI, config: ModelServiceConfig) -> N
         logger.info(f"replay cursor loaded, traj_path={config.replay_traj_path}")
         return
 
-    if config.traj_enabled:
-        from rock.sdk.model.server.integrations.traj_recorder import TrajectoryRecorder
+    from rock.sdk.model.server.integrations.traj_recorder import TrajectoryRecorder
 
-        traj_path = config.traj_file or TRAJ_FILE
-        app.state.recorder = TrajectoryRecorder(traj_file=traj_path)
-        logger.info(f"trajectory recorder attached, traj_file={traj_path}")
+    traj_path = config.traj_file or TRAJ_FILE
+    app.state.recorder = TrajectoryRecorder(traj_file=traj_path)
+    logger.info(f"trajectory recorder attached, traj_file={traj_path}")
 
 
 def main(
@@ -134,7 +130,6 @@ def create_config_from_args(args) -> ModelServiceConfig:
         logger.info(f"num_retries set from command line: {args.num_retries}")
     if getattr(args, "traj_file", None):
         config.replay_traj_path = args.traj_file
-        config.traj_enabled = False
         logger.info(f"replay mode enabled via --traj-file: {args.traj_file}")
 
     return config
diff --git a/rock/sdk/model/server/sse_utils.py b/rock/sdk/model/server/sse_utils.py
new file mode 100644
index 0000000000..1cbe27298f
--- /dev/null
+++ b/rock/sdk/model/server/sse_utils.py
@@ -0,0 +1,92 @@
+"""SSE codec utilities for the chat/completions proxy.
+
+Three pure helpers, no openai/litellm dependencies:
+
+- :func:`parse_sse_data_chunks` — incremental SSE byte stream → list of decoded
+  ``data:`` payload dicts (used by the forward path to feed chunks into the
+  stream-state aggregator while bytes pass through verbatim to the client).
+- :func:`completion_to_chunk_dict` — convert a non-streaming ``chat.completion``
+  response into a single ``chat.completion.chunk`` dict, by renaming
+  ``message`` → ``delta``. Used by the replay path's streaming output.
+- :func:`encode_sse_event` — encode a payload dict as ``data: <json>\\n\\n``
+  bytes (one SSE event).
+"""
+
+from __future__ import annotations
+
+import json
+import time
+import uuid
+from typing import Final
+
+# Terminal SSE event sent at the end of a chat/completions stream.
+SSE_DONE: Final[bytes] = b"data: [DONE]\n\n"
+
+
+def parse_sse_data_chunks(buffer: bytes) -> tuple[list[dict], bytes]:
+    """Extract complete SSE events from a (possibly partial) byte buffer.
+
+    Returns ``(chunks, leftover)``: the parsed ``data:`` JSON payload dicts and
+    the bytes that did not yet form a complete event (``\\n\\n``-terminated).
+
+    - ``data: [DONE]`` is skipped (terminal marker, has no JSON payload).
+    - Lines that don't start with ``data:`` (``event:`` / ``id:`` / blank)
+      are ignored.
+    - Malformed JSON in a ``data:`` line is silently skipped — caller logs at
+      its own discretion (typically ``debug``).
+
+    Caller pattern::
+
+        chunks, buffer = parse_sse_data_chunks(buffer + new_bytes)
+        for chunk_dict in chunks:
+            ... feed to aggregator, etc ...
+    """
+    chunks: list[dict] = []
+    while b"\n\n" in buffer:
+        event, buffer = buffer.split(b"\n\n", 1)
+        for raw_line in event.split(b"\n"):
+            line = raw_line.decode("utf-8", errors="replace").strip()
+            if not line.startswith("data:"):
+                continue
+            payload = line[len("data:") :].strip()
+            if not payload or payload == "[DONE]":
+                continue
+            try:
+                chunks.append(json.loads(payload))
+            except json.JSONDecodeError:
+                continue
+    return chunks, buffer
+
+
+def completion_to_chunk_dict(response: dict, *, model: str) -> dict:
+    """Convert a recorded ``chat.completion`` dict into a single
+    ``chat.completion.chunk`` dict, suitable for re-streaming.
+
+    Only ``message`` → ``delta`` is renamed; every other field (including
+    provider-specific extras like ``reasoning_content`` inside the message)
+    flows through unchanged. ``id`` / ``created`` are synthesized when missing.
+    """
+    choices_in = response.get("choices") or []
+    choices_out = []
+    for choice in choices_in:
+        delta = dict(choice.get("message") or {})
+        choices_out.append(
+            {
+                "index": choice.get("index", 0),
+                "delta": delta,
+                "finish_reason": choice.get("finish_reason"),
+                "logprobs": choice.get("logprobs"),
+            }
+        )
+    return {
+        "id": response.get("id") or f"chatcmpl-{uuid.uuid4()}",
+        "object": "chat.completion.chunk",
+        "created": response.get("created") or int(time.time()),
+        "model": response.get("model") or model,
+        "choices": choices_out,
+    }
+
+
+def encode_sse_event(data: dict) -> bytes:
+    """Encode a JSON payload as one SSE ``data:`` event (terminated by ``\\n\\n``)."""
+    return f"data: {json.dumps(data, ensure_ascii=False)}\n\n".encode()
diff --git a/tests/unit/sdk/model/test_proxy.py b/tests/unit/sdk/model/test_proxy.py
index 88c61edcb3..fe5634fab1 100644
--- a/tests/unit/sdk/model/test_proxy.py
+++ b/tests/unit/sdk/model/test_proxy.py
@@ -439,7 +439,6 @@ def test_config_default_host_and_port():
 
 def test_config_default_traj_and_replay():
     config = ModelServiceConfig()
-    assert config.traj_enabled is True
     assert config.traj_file is None
     assert config.replay_traj_path is None
 
@@ -450,14 +449,12 @@ async def test_config_loads_traj_and_replay_from_file(tmp_path):
     conf_file.write_text(
         yaml.dump(
             {
-                "traj_enabled": False,
                 "traj_file": "/tmp/my-traj.jsonl",
                 "replay_traj_path": "/tmp/in.jsonl",
             }
         )
     )
     config = ModelServiceConfig.from_file(str(conf_file))
-    assert config.traj_enabled is False
     assert config.traj_file == "/tmp/my-traj.jsonl"
     assert config.replay_traj_path == "/tmp/in.jsonl"
 
@@ -504,7 +501,6 @@ def test_cli_traj_file_enables_replay():
     )
     config = create_config_from_args(args)
     assert config.replay_traj_path == "/tmp/in.jsonl"
-    assert config.traj_enabled is False
 
 
 # ---------- Metrics singleton + legacy record_traj (still used by local mode) ----------
diff --git a/tests/unit/sdk/model/test_sse_utils.py b/tests/unit/sdk/model/test_sse_utils.py
new file mode 100644
index 0000000000..6c9318a510
--- /dev/null
+++ b/tests/unit/sdk/model/test_sse_utils.py
@@ -0,0 +1,172 @@
+"""Tests for the pure SSE codec utilities (no openai/litellm dependencies)."""
+
+import json
+
+from rock.sdk.model.server.sse_utils import (
+    SSE_DONE,
+    completion_to_chunk_dict,
+    encode_sse_event,
+    parse_sse_data_chunks,
+)
+
+# ---------- parse_sse_data_chunks ----------
+
+
+def test_parse_returns_complete_events_and_leftover_buffer():
+    raw = b'data: {"a": 1}\n\ndata: {"a": 2}\n\ndata: {"a": 3}'  # 3rd event is incomplete
+    chunks, leftover = parse_sse_data_chunks(raw)
+
+    assert chunks == [{"a": 1}, {"a": 2}]
+    assert leftover == b'data: {"a": 3}'
+
+
+def test_parse_skips_done_marker():
+    raw = b'data: {"x": 1}\n\ndata: [DONE]\n\n'
+    chunks, leftover = parse_sse_data_chunks(raw)
+
+    assert chunks == [{"x": 1}]
+    assert leftover == b""
+
+
+def test_parse_skips_non_data_lines():
+    raw = b'event: progress\ndata: {"y": 2}\nid: abc\n\n'
+    chunks, leftover = parse_sse_data_chunks(raw)
+
+    assert chunks == [{"y": 2}]
+    assert leftover == b""
+
+
+def test_parse_silently_skips_malformed_json():
+    raw = b'data: not-json-at-all\n\ndata: {"ok": true}\n\n'
+    chunks, leftover = parse_sse_data_chunks(raw)
+
+    assert chunks == [{"ok": True}]
+    assert leftover == b""
+
+
+def test_parse_handles_empty_buffer():
+    chunks, leftover = parse_sse_data_chunks(b"")
+    assert chunks == []
+    assert leftover == b""
+
+
+def test_parse_incremental_streaming_pattern():
+    """Simulates feeding bytes in arbitrary chunks; final concatenation == all events."""
+    full_stream = b'data: {"i": 0}\n\ndata: {"i": 1}\n\ndata: {"i": 2}\n\ndata: [DONE]\n\n'
+    fragments = [full_stream[i : i + 5] for i in range(0, len(full_stream), 5)]
+
+    buffer = b""
+    collected: list[dict] = []
+    for frag in fragments:
+        new_chunks, buffer = parse_sse_data_chunks(buffer + frag)
+        collected.extend(new_chunks)
+
+    assert collected == [{"i": 0}, {"i": 1}, {"i": 2}]
+    assert buffer == b""
+
+
+def test_parse_handles_unicode_payload():
+    raw = b'data: {"content": "\xe4\xbd\xa0\xe5\xa5\xbd"}\n\n'  # "你好" UTF-8
+    chunks, _ = parse_sse_data_chunks(raw)
+    assert chunks == [{"content": "你好"}]
+
+
+# ---------- completion_to_chunk_dict ----------
+
+
+def test_completion_to_chunk_renames_message_to_delta():
+    response = {
+        "id": "rec-1",
+        "object": "chat.completion",
+        "created": 100,
+        "model": "gpt-4",
+        "choices": [
+            {
+                "index": 0,
+                "message": {"role": "assistant", "content": "hi"},
+                "finish_reason": "stop",
+            }
+        ],
+    }
+    chunk = completion_to_chunk_dict(response, model="gpt-4")
+
+    assert chunk["object"] == "chat.completion.chunk"
+    assert chunk["id"] == "rec-1"
+    assert chunk["created"] == 100
+    assert chunk["model"] == "gpt-4"
+    assert chunk["choices"][0]["delta"] == {"role": "assistant", "content": "hi"}
+    assert chunk["choices"][0]["finish_reason"] == "stop"
+    assert chunk["choices"][0]["index"] == 0
+    assert "message" not in chunk["choices"][0]
+
+
+def test_completion_to_chunk_preserves_provider_specific_message_fields():
+    """reasoning_content / tool_calls / etc inside message are kept verbatim in delta."""
+    response = {
+        "choices": [
+            {
+                "index": 0,
+                "message": {
+                    "role": "assistant",
+                    "content": "answer",
+                    "reasoning_content": "step-by-step thinking",
+                    "tool_calls": [{"id": "t1", "type": "function"}],
+                },
+                "finish_reason": "tool_calls",
+            }
+        ],
+    }
+    chunk = completion_to_chunk_dict(response, model="glm-5")
+
+    assert chunk["choices"][0]["delta"]["reasoning_content"] == "step-by-step thinking"
+    assert chunk["choices"][0]["delta"]["tool_calls"] == [{"id": "t1", "type": "function"}]
+    assert chunk["choices"][0]["finish_reason"] == "tool_calls"
+
+
+def test_completion_to_chunk_synthesizes_id_and_created_when_missing():
+    chunk = completion_to_chunk_dict(
+        {"choices": [{"index": 0, "message": {"role": "assistant"}, "finish_reason": "stop"}]},
+        model="any",
+    )
+    assert chunk["id"].startswith("chatcmpl-")
+    assert isinstance(chunk["created"], int) and chunk["created"] > 0
+    assert chunk["model"] == "any"
+
+
+def test_completion_to_chunk_handles_empty_choices():
+    chunk = completion_to_chunk_dict({"choices": []}, model="m")
+    assert chunk["choices"] == []
+
+
+# ---------- encode_sse_event ----------
+
+
+def test_encode_sse_event_appends_double_newline_terminator():
+    out = encode_sse_event({"k": "v"})
+    assert out.endswith(b"\n\n")
+    assert out.startswith(b"data: ")
+    body = out[len(b"data: ") : -len(b"\n\n")]
+    assert json.loads(body) == {"k": "v"}
+
+
+def test_encode_sse_event_preserves_unicode_without_escapes():
+    out = encode_sse_event({"content": "你好"})
+    # ensure_ascii=False is critical so Chinese stays readable in the wire format
+    assert "你好".encode() in out
+
+
+def test_sse_done_constant():
+    assert SSE_DONE == b"data: [DONE]\n\n"
+
+
+# ---------- round-trip ----------
+
+
+def test_roundtrip_encode_then_parse():
+    """encode → parse must round-trip a payload dict."""
+    payloads = [{"i": 0, "text": "alpha"}, {"i": 1, "text": "beta 中文"}]
+    wire = b"".join(encode_sse_event(p) for p in payloads) + SSE_DONE
+    chunks, leftover = parse_sse_data_chunks(wire)
+
+    assert chunks == payloads
+    assert leftover == b""

From a3459c9977950335eff05d708bbabe4451af7089 Mon Sep 17 00:00:00 2001
From: "pengshixin.psx" <pengshixin.psx@alibaba-inc.com>
Date: Tue, 12 May 2026 06:14:22 +0000
Subject: [PATCH 06/25] chore(model-service): remove litellm remnants
 (num_retries, stale comments)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 rock/sdk/model/server/config.py |  5 +----
 rock/sdk/model/server/main.py   | 11 +----------
 rock/sdk/model/server/utils.py  | 15 ++-------------
 3 files changed, 4 insertions(+), 27 deletions(-)

diff --git a/rock/sdk/model/server/config.py b/rock/sdk/model/server/config.py
index 3923d43c6a..bd19edf902 100644
--- a/rock/sdk/model/server/config.py
+++ b/rock/sdk/model/server/config.py
@@ -51,14 +51,11 @@ class ModelServiceConfig(BaseModel):
     request_timeout: int = Field(default=120)
     """Request timeout in seconds."""
 
-    num_retries: int = Field(default=6)
-    """Number of retries for retryable failures (passed through to litellm)."""
-
     traj_file: str | None = Field(default=None)
     """Override default trajectory file path. None → uses TRAJ_FILE (LOG_DIR/LLMTraj.jsonl)."""
 
     replay_traj_path: str | None = Field(default=None)
-    """Path to a .jsonl trajectory file or a directory of .jsonl files for replay mode.
+    """Path to a .jsonl trajectory file for replay mode.
     When set, requests are served from recorded responses instead of a real upstream."""
 
     @classmethod
diff --git a/rock/sdk/model/server/main.py b/rock/sdk/model/server/main.py
index 1ba0b54a7b..83aec58f56 100644
--- a/rock/sdk/model/server/main.py
+++ b/rock/sdk/model/server/main.py
@@ -125,9 +125,6 @@ def create_config_from_args(args) -> ModelServiceConfig:
     if args.request_timeout:
         config.request_timeout = args.request_timeout
         logger.info(f"request_timeout set from command line: {args.request_timeout}s")
-    if getattr(args, "num_retries", None) is not None:
-        config.num_retries = args.num_retries
-        logger.info(f"num_retries set from command line: {args.num_retries}")
     if getattr(args, "traj_file", None):
         config.replay_traj_path = args.traj_file
         logger.info(f"replay mode enabled via --traj-file: {args.traj_file}")
@@ -173,17 +170,11 @@ def create_config_from_args(args) -> ModelServiceConfig:
     parser.add_argument(
         "--request-timeout", type=int, default=None, help="Request timeout in seconds. Overrides config file."
     )
-    parser.add_argument(
-        "--num-retries",
-        type=int,
-        default=None,
-        help="Number of retries for retryable failures (passed through to litellm). Overrides config file.",
-    )
     parser.add_argument(
         "--traj-file",
         type=str,
         default=None,
-        help="Replay mode: path to a recorded .jsonl traj file or directory. Disables real LLM upstreams.",
+        help="Replay mode: path to a recorded .jsonl traj file. Disables real LLM upstreams.",
     )
     args = parser.parse_args()
 
diff --git a/rock/sdk/model/server/utils.py b/rock/sdk/model/server/utils.py
index 86b7414e29..639ca3995b 100644
--- a/rock/sdk/model/server/utils.py
+++ b/rock/sdk/model/server/utils.py
@@ -26,13 +26,7 @@ def _get_or_create_metrics_monitor() -> MetricsMonitor:
 
 
 def _write_traj(data: dict):
-    """Write traj data to file in JSONL format.
-
-    Used by the legacy ``@record_traj`` decorator on the ``local`` model-service
-    flow. The proxy flow now persists trajectories via
-    :class:`rock.sdk.model.server.integrations.traj_recorder.TrajectoryRecorder`
-    instead, which uses litellm's StandardLoggingPayload schema.
-    """
+    """Write traj data to file in JSONL format."""
     from rock import env_vars
 
     append = env_vars.ROCK_MODEL_SERVICE_TRAJ_APPEND_MODE
@@ -44,12 +38,7 @@ def _write_traj(data: dict):
 
 
 def record_traj(func: Callable):
-    """Decorator to record chat completions input/output as traj.
-
-    Kept for the ``local`` model-service mode (rock/sdk/model/server/api/local.py).
-    The ``proxy`` mode no longer uses this decorator — it relies on the
-    TrajectoryRecorder litellm callback for richer payloads.
-    """
+    """Decorator to record chat completions input/output as traj (local mode only)."""
 
     @wraps(func)
     async def wrapper(*args, **kwargs):

From f125d335eb469ef288a99e318a7551edb91de281 Mon Sep 17 00:00:00 2001
From: "pengshixin.psx" <pengshixin.psx@alibaba-inc.com>
Date: Tue, 12 May 2026 06:38:11 +0000
Subject: [PATCH 07/25] refactor(model-service): split proxy handler into
 _ReplayBackend / _ForwardBackend

Strategy pattern eliminates the replay/forward branch inside chat_completions.
The backend is selected once at startup (_configure_proxy_integrations) and
attached to app.state.backend; the endpoint just parses the request and
dispatches. Each backend keeps the stream/non-stream branch local to itself.

A union type alias _CompletionBackend documents the closed set of backends
and a typed _get_backend(request) accessor wraps the app.state read.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 rock/sdk/model/server/api/proxy.py | 213 +++++++++++++++++------------
 rock/sdk/model/server/main.py      |  26 ++--
 tests/unit/sdk/model/test_proxy.py |  19 +--
 3 files changed, 149 insertions(+), 109 deletions(-)

diff --git a/rock/sdk/model/server/api/proxy.py b/rock/sdk/model/server/api/proxy.py
index bd000cd80d..0ad5d698a2 100644
--- a/rock/sdk/model/server/api/proxy.py
+++ b/rock/sdk/model/server/api/proxy.py
@@ -1,8 +1,8 @@
 """OpenAI-compatible chat/completions proxy with trajectory record/replay.
 
-Two paths share this handler:
+Two backends share the ``/v1/chat/completions`` route:
 
-1. **Forward / record mode** (default) — body bytes are POSTed verbatim to the
+1. **_ForwardBackend** (default) — body bytes are POSTed verbatim to the
    configured upstream via plain ``httpx``. The upstream response is forwarded
    byte-for-byte back to the client (raw JSON for non-stream, raw SSE bytes
    for stream). On the side we run a parser (``ChatCompletionChunk`` +
@@ -12,9 +12,10 @@
    returns (provider-specific ``reasoning_content``, ``citations``, ...) is
    passed through untouched.
 
-2. **Replay mode** (``replay_traj_path`` set) — the request is served directly
-   from the next record in ``app.state.replay_cursor`` without any upstream
-   call. Streaming emits the recorded response as one SSE chunk + ``[DONE]``.
+2. **_ReplayBackend** (``replay_traj_path`` set) — the request is served
+   directly from the next record in the ``SequentialCursor`` without any
+   upstream call. Streaming emits the recorded response as one SSE chunk +
+   ``[DONE]``.
 """
 
 from __future__ import annotations
@@ -158,32 +159,15 @@ async def _forward_stream_and_record(
     )
 
 
-@proxy_router.post("/v1/chat/completions")
-async def chat_completions(request: Request):
-    """OpenAI-compatible chat completions proxy endpoint.
-
-    Reads the body as raw bytes (no parsing on the forward path) and either
-    serves it from the replay cursor or forwards it to the configured upstream.
-    """
-    config: ModelServiceConfig = request.app.state.model_service_config
-    recorder: TrajectoryRecorder | None = getattr(request.app.state, "recorder", None)
-
-    body_bytes = await request.body()
-    try:
-        request_dict = json.loads(body_bytes) if body_bytes else {}
-    except json.JSONDecodeError:
-        raise HTTPException(status_code=400, detail="Request body is not valid JSON.")
-    if not isinstance(request_dict, dict):
-        raise HTTPException(status_code=400, detail="Request body must be a JSON object.")
+class _ReplayBackend:
+    """Serves requests from a pre-recorded trajectory; no upstream calls made."""
 
-    model_name = request_dict.get("model", "")
-    is_stream = bool(request_dict.get("stream"))
+    def __init__(self, cursor: SequentialCursor) -> None:
+        self._cursor = cursor
 
-    # ---- Replay mode: short-circuit, no upstream call ----
-    if config.replay_traj_path:
-        cursor: SequentialCursor = request.app.state.replay_cursor
+    async def serve(self, *, model_name: str, is_stream: bool, **_: Any) -> Response:
         try:
-            record = await cursor.next(expected_model=model_name)
+            record = await self._cursor.next(expected_model=model_name)
         except TrajectoryExhausted as exc:
             raise HTTPException(status_code=404, detail=str(exc))
 
@@ -191,9 +175,9 @@ async def chat_completions(request: Request):
         if not isinstance(response_dict, dict):
             raise HTTPException(
                 status_code=500,
-                detail=f"replay record at step {cursor.position - 1} has no usable response dict",
+                detail=f"replay record at step {self._cursor.position - 1} has no usable response dict",
             )
-        logger.info(f"[replay] step {cursor.position}/{cursor.total} served for model={model_name!r}")
+        logger.info(f"[replay] step {self._cursor.position}/{self._cursor.total} served for model={model_name!r}")
 
         if is_stream:
             return StreamingResponse(
@@ -202,71 +186,124 @@ async def chat_completions(request: Request):
             )
         return JSONResponse(status_code=200, content=response_dict)
 
-    # ---- Forward / record mode: byte-passthrough via httpx ----
-    upstream_url = f"{get_base_url(model_name, config)}/chat/completions"
-    fwd_headers = _filter_headers(request.headers)
-    logger.info(f"Routing model {model_name!r} to {upstream_url}")
-
-    if is_stream:
-        return StreamingResponse(
-            _forward_stream_and_record(
-                upstream_url=upstream_url,
-                body_bytes=body_bytes,
-                fwd_headers=fwd_headers,
-                timeout=config.request_timeout,
-                request_dict=request_dict,
-                recorder=recorder,
-            ),
-            media_type="text/event-stream",
-        )
 
-    # Non-stream: single POST, return upstream's status + body verbatim, record on the side.
-    start = time.time()
-    try:
-        async with httpx.AsyncClient(timeout=config.request_timeout) as client:
-            r = await client.post(upstream_url, content=body_bytes, headers=fwd_headers)
-    except httpx.TimeoutException as exc:
-        if recorder is not None:
-            await recorder.record(
-                request=request_dict,
-                response=None,
-                status="failure",
-                start_time=start,
-                end_time=time.time(),
-                error=f"timeout: {exc}",
+class _ForwardBackend:
+    """Forwards requests byte-for-byte to the upstream and optionally records the trajectory."""
+
+    def __init__(self, config: ModelServiceConfig, recorder: TrajectoryRecorder | None = None) -> None:
+        self._config = config
+        self._recorder = recorder
+
+    async def serve(
+        self,
+        *,
+        model_name: str,
+        is_stream: bool,
+        body_bytes: bytes,
+        fwd_headers: dict[str, str],
+        request_dict: dict[str, Any],
+        **_: Any,
+    ) -> Response:
+        upstream_url = f"{get_base_url(model_name, self._config)}/chat/completions"
+        logger.info(f"Routing model {model_name!r} to {upstream_url}")
+
+        if is_stream:
+            return StreamingResponse(
+                _forward_stream_and_record(
+                    upstream_url=upstream_url,
+                    body_bytes=body_bytes,
+                    fwd_headers=fwd_headers,
+                    timeout=self._config.request_timeout,
+                    request_dict=request_dict,
+                    recorder=self._recorder,
+                ),
+                media_type="text/event-stream",
             )
-        raise HTTPException(status_code=504, detail=f"Upstream timed out: {exc}")
-    except httpx.RequestError as exc:
-        if recorder is not None:
-            await recorder.record(
+
+        # Non-stream: single POST, return upstream's status + body verbatim, record on the side.
+        start = time.time()
+        try:
+            async with httpx.AsyncClient(timeout=self._config.request_timeout) as client:
+                r = await client.post(upstream_url, content=body_bytes, headers=fwd_headers)
+        except httpx.TimeoutException as exc:
+            if self._recorder is not None:
+                await self._recorder.record(
+                    request=request_dict,
+                    response=None,
+                    status="failure",
+                    start_time=start,
+                    end_time=time.time(),
+                    error=f"timeout: {exc}",
+                )
+            raise HTTPException(status_code=504, detail=f"Upstream timed out: {exc}")
+        except httpx.RequestError as exc:
+            if self._recorder is not None:
+                await self._recorder.record(
+                    request=request_dict,
+                    response=None,
+                    status="failure",
+                    start_time=start,
+                    end_time=time.time(),
+                    error=f"{type(exc).__name__}: {exc}",
+                )
+            raise HTTPException(status_code=502, detail=f"Upstream request failed: {exc}")
+
+        response_text = r.text  # bytes already read by httpx; .text decodes once
+        response_dict: dict | None = None
+        try:
+            parsed = json.loads(response_text) if response_text else None
+            if isinstance(parsed, dict):
+                response_dict = parsed
+        except json.JSONDecodeError:
+            pass
+
+        if self._recorder is not None:
+            await self._recorder.record(
                 request=request_dict,
-                response=None,
-                status="failure",
+                response=response_dict,
+                status="success" if r.status_code < 400 else "failure",
                 start_time=start,
                 end_time=time.time(),
-                error=f"{type(exc).__name__}: {exc}",
+                error=None if r.status_code < 400 else f"upstream_status={r.status_code}",
             )
-        raise HTTPException(status_code=502, detail=f"Upstream request failed: {exc}")
 
-    response_text = r.text  # bytes already read by httpx; .text decodes once
-    response_dict: dict | None = None
+        # Forward bytes verbatim — preserves any provider-specific fields untouched.
+        media_type = r.headers.get("content-type", "application/json")
+        return Response(content=response_text, status_code=r.status_code, media_type=media_type)
+
+
+_CompletionBackend = _ReplayBackend | _ForwardBackend
+
+
+def _get_backend(request: Request) -> _CompletionBackend:
+    """Typed accessor for the backend attached at startup by ``_configure_proxy_integrations``."""
+    return request.app.state.backend
+
+
+@proxy_router.post("/v1/chat/completions")
+async def chat_completions(request: Request):
+    """OpenAI-compatible chat completions proxy endpoint.
+
+    Reads the body as raw bytes (no parsing on the forward path) and delegates
+    to the backend attached at startup (replay or forward).
+    """
+    body_bytes = await request.body()
     try:
-        parsed = json.loads(response_text) if response_text else None
-        if isinstance(parsed, dict):
-            response_dict = parsed
+        request_dict = json.loads(body_bytes) if body_bytes else {}
     except json.JSONDecodeError:
-        pass
-
-    if recorder is not None:
-        await recorder.record(
-            request=request_dict,
-            response=response_dict,
-            status="success" if r.status_code < 400 else "failure",
-            start_time=start,
-            end_time=time.time(),
-            error=None if r.status_code < 400 else f"upstream_status={r.status_code}",
-        )
+        raise HTTPException(status_code=400, detail="Request body is not valid JSON.")
+    if not isinstance(request_dict, dict):
+        raise HTTPException(status_code=400, detail="Request body must be a JSON object.")
 
-    # Forward bytes verbatim — preserves any provider-specific fields untouched.
-    media_type = r.headers.get("content-type", "application/json")
-    return Response(content=response_text, status_code=r.status_code, media_type=media_type)
+    model_name = request_dict.get("model", "")
+    is_stream = bool(request_dict.get("stream"))
+    fwd_headers = _filter_headers(request.headers)
+
+    backend = _get_backend(request)
+    return await backend.serve(
+        model_name=model_name,
+        is_stream=is_stream,
+        body_bytes=body_bytes,
+        fwd_headers=fwd_headers,
+        request_dict=request_dict,
+    )
diff --git a/rock/sdk/model/server/main.py b/rock/sdk/model/server/main.py
index 83aec58f56..da2ee10a36 100644
--- a/rock/sdk/model/server/main.py
+++ b/rock/sdk/model/server/main.py
@@ -53,27 +53,29 @@ async def global_exception_handler(request, exc):
 
 
 def _configure_proxy_integrations(app: FastAPI, config: ModelServiceConfig) -> None:
-    """Wire up record/replay integrations and attach them to ``app.state``.
-
-    - Replay mode (``replay_traj_path`` set): load the trajectory into a
-      ``SequentialCursor`` and stash it as ``app.state.replay_cursor``. No
-      recorder is attached — replaying back into the source file would corrupt it.
-    - Forward mode (default): attach a ``TrajectoryRecorder`` instance as
-      ``app.state.recorder``. The proxy handler invokes it explicitly after
-      each forwarded call.
+    """Attach the appropriate backend to ``app.state.backend``.
+
+    - Replay mode (``replay_traj_path`` set): ``_ReplayBackend`` wrapping a
+      ``SequentialCursor``; no recorder — replaying back into the source file
+      would corrupt it.
+    - Forward mode (default): ``_ForwardBackend`` with a ``TrajectoryRecorder``.
     """
+    from rock.sdk.model.server.api.proxy import _ForwardBackend, _ReplayBackend
+
     if config.replay_traj_path:
         from rock.sdk.model.server.integrations.traj_replayer import SequentialCursor
 
-        app.state.replay_cursor = SequentialCursor.load(config.replay_traj_path)
-        logger.info(f"replay cursor loaded, traj_path={config.replay_traj_path}")
+        cursor = SequentialCursor.load(config.replay_traj_path)
+        app.state.backend = _ReplayBackend(cursor)
+        logger.info(f"replay backend attached, traj_path={config.replay_traj_path}")
         return
 
     from rock.sdk.model.server.integrations.traj_recorder import TrajectoryRecorder
 
     traj_path = config.traj_file or TRAJ_FILE
-    app.state.recorder = TrajectoryRecorder(traj_file=traj_path)
-    logger.info(f"trajectory recorder attached, traj_file={traj_path}")
+    recorder = TrajectoryRecorder(traj_file=traj_path)
+    app.state.backend = _ForwardBackend(config, recorder=recorder)
+    logger.info(f"forward backend attached, traj_file={traj_path}")
 
 
 def main(
diff --git a/tests/unit/sdk/model/test_proxy.py b/tests/unit/sdk/model/test_proxy.py
index fe5634fab1..b3e9d5b1ed 100644
--- a/tests/unit/sdk/model/test_proxy.py
+++ b/tests/unit/sdk/model/test_proxy.py
@@ -27,12 +27,16 @@
 )
 
 
-def _build_app(config: ModelServiceConfig, *, replay_cursor=None) -> FastAPI:
+def _build_app(config: ModelServiceConfig, *, replay_cursor=None, recorder=None) -> FastAPI:
     """Build a FastAPI app with the proxy router and the given config attached."""
+    from rock.sdk.model.server.api.proxy import _ForwardBackend, _ReplayBackend
+
     app = FastAPI()
     app.state.model_service_config = config
     if replay_cursor is not None:
-        app.state.replay_cursor = replay_cursor
+        app.state.backend = _ReplayBackend(replay_cursor)
+    else:
+        app.state.backend = _ForwardBackend(config, recorder=recorder)
     app.include_router(proxy_router)
     return app
 
@@ -264,19 +268,16 @@ def handler(request: httpx.Request) -> httpx.Response:
 
 @pytest.mark.asyncio
 async def test_forward_invokes_recorder_on_success(tmp_path):
-    """When app.state.recorder is set, success calls write a JSONL line with the
-    request and the upstream response verbatim."""
+    """When a recorder is attached to the backend, success calls write a JSONL line."""
     from rock.sdk.model.server.integrations.traj_recorder import TrajectoryRecorder
 
     upstream_payload = _success_response_json(content="recorded reply")
+    traj_file = tmp_path / "traj.jsonl"
 
     def handler(request: httpx.Request) -> httpx.Response:
         return httpx.Response(200, json=upstream_payload)
 
     config = ModelServiceConfig()
-    app = _build_app(config)
-    traj_file = tmp_path / "traj.jsonl"
-    app.state.recorder = TrajectoryRecorder(traj_file=traj_file)
 
     with (
         _patch_httpx_with_handler(handler),
@@ -284,8 +285,8 @@ def handler(request: httpx.Request) -> httpx.Response:
             "rock.sdk.model.server.integrations.traj_recorder._get_or_create_metrics_monitor", return_value=MagicMock()
         ),
     ):
-        # Re-create the recorder so it picks up the patched monitor.
-        app.state.recorder = TrajectoryRecorder(traj_file=traj_file)
+        recorder = TrajectoryRecorder(traj_file=traj_file)
+        app = _build_app(config, recorder=recorder)
         transport = ASGITransport(app=app)
         async with AsyncClient(transport=transport, base_url="http://test") as ac:
             await ac.post(

From 4b7a35abbf266ec6b717541e59b78291461b30a0 Mon Sep 17 00:00:00 2001
From: "pengshixin.psx" <pengshixin.psx@alibaba-inc.com>
Date: Tue, 12 May 2026 06:41:52 +0000
Subject: [PATCH 08/25] =?UTF-8?q?refactor(model-service):=20drop=20=5F=20p?=
 =?UTF-8?q?refix=20from=20public=20Backend=20classes,=20rename=20replay=5F?=
 =?UTF-8?q?traj=5Fpath=20=E2=86=92=20replay=5Ftraj=5Ffile?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The Backend classes are imported by main.py and tests, so the leading
underscore mis-signalled them as module-internal. Rename the config field
to align with traj_file naming. Also drop the defensive getattr() for
args.traj_file — argparse always sets it.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 rock/sdk/model/server/api/proxy.py | 12 ++++++------
 rock/sdk/model/server/config.py    |  2 +-
 rock/sdk/model/server/main.py      | 20 ++++++++++----------
 tests/unit/sdk/model/test_proxy.py | 20 ++++++++++----------
 4 files changed, 27 insertions(+), 27 deletions(-)

diff --git a/rock/sdk/model/server/api/proxy.py b/rock/sdk/model/server/api/proxy.py
index 0ad5d698a2..2aa911771a 100644
--- a/rock/sdk/model/server/api/proxy.py
+++ b/rock/sdk/model/server/api/proxy.py
@@ -2,7 +2,7 @@
 
 Two backends share the ``/v1/chat/completions`` route:
 
-1. **_ForwardBackend** (default) — body bytes are POSTed verbatim to the
+1. **ForwardBackend** (default) — body bytes are POSTed verbatim to the
    configured upstream via plain ``httpx``. The upstream response is forwarded
    byte-for-byte back to the client (raw JSON for non-stream, raw SSE bytes
    for stream). On the side we run a parser (``ChatCompletionChunk`` +
@@ -12,7 +12,7 @@
    returns (provider-specific ``reasoning_content``, ``citations``, ...) is
    passed through untouched.
 
-2. **_ReplayBackend** (``replay_traj_path`` set) — the request is served
+2. **ReplayBackend** (``replay_traj_file`` set) — the request is served
    directly from the next record in the ``SequentialCursor`` without any
    upstream call. Streaming emits the recorded response as one SSE chunk +
    ``[DONE]``.
@@ -159,7 +159,7 @@ async def _forward_stream_and_record(
     )
 
 
-class _ReplayBackend:
+class ReplayBackend:
     """Serves requests from a pre-recorded trajectory; no upstream calls made."""
 
     def __init__(self, cursor: SequentialCursor) -> None:
@@ -187,7 +187,7 @@ async def serve(self, *, model_name: str, is_stream: bool, **_: Any) -> Response
         return JSONResponse(status_code=200, content=response_dict)
 
 
-class _ForwardBackend:
+class ForwardBackend:
     """Forwards requests byte-for-byte to the upstream and optionally records the trajectory."""
 
     def __init__(self, config: ModelServiceConfig, recorder: TrajectoryRecorder | None = None) -> None:
@@ -272,10 +272,10 @@ async def serve(
         return Response(content=response_text, status_code=r.status_code, media_type=media_type)
 
 
-_CompletionBackend = _ReplayBackend | _ForwardBackend
+CompletionBackend = ReplayBackend | ForwardBackend
 
 
-def _get_backend(request: Request) -> _CompletionBackend:
+def _get_backend(request: Request) -> CompletionBackend:
     """Typed accessor for the backend attached at startup by ``_configure_proxy_integrations``."""
     return request.app.state.backend
 
diff --git a/rock/sdk/model/server/config.py b/rock/sdk/model/server/config.py
index bd19edf902..76e080305c 100644
--- a/rock/sdk/model/server/config.py
+++ b/rock/sdk/model/server/config.py
@@ -54,7 +54,7 @@ class ModelServiceConfig(BaseModel):
     traj_file: str | None = Field(default=None)
     """Override default trajectory file path. None → uses TRAJ_FILE (LOG_DIR/LLMTraj.jsonl)."""
 
-    replay_traj_path: str | None = Field(default=None)
+    replay_traj_file: str | None = Field(default=None)
     """Path to a .jsonl trajectory file for replay mode.
     When set, requests are served from recorded responses instead of a real upstream."""
 
diff --git a/rock/sdk/model/server/main.py b/rock/sdk/model/server/main.py
index da2ee10a36..951e0d5dff 100644
--- a/rock/sdk/model/server/main.py
+++ b/rock/sdk/model/server/main.py
@@ -55,26 +55,26 @@ async def global_exception_handler(request, exc):
 def _configure_proxy_integrations(app: FastAPI, config: ModelServiceConfig) -> None:
     """Attach the appropriate backend to ``app.state.backend``.
 
-    - Replay mode (``replay_traj_path`` set): ``_ReplayBackend`` wrapping a
+    - Replay mode (``replay_traj_file`` set): ``ReplayBackend`` wrapping a
       ``SequentialCursor``; no recorder — replaying back into the source file
       would corrupt it.
-    - Forward mode (default): ``_ForwardBackend`` with a ``TrajectoryRecorder``.
+    - Forward mode (default): ``ForwardBackend`` with a ``TrajectoryRecorder``.
     """
-    from rock.sdk.model.server.api.proxy import _ForwardBackend, _ReplayBackend
+    from rock.sdk.model.server.api.proxy import ForwardBackend, ReplayBackend
 
-    if config.replay_traj_path:
+    if config.replay_traj_file:
         from rock.sdk.model.server.integrations.traj_replayer import SequentialCursor
 
-        cursor = SequentialCursor.load(config.replay_traj_path)
-        app.state.backend = _ReplayBackend(cursor)
-        logger.info(f"replay backend attached, traj_path={config.replay_traj_path}")
+        cursor = SequentialCursor.load(config.replay_traj_file)
+        app.state.backend = ReplayBackend(cursor)
+        logger.info(f"replay backend attached, traj_path={config.replay_traj_file}")
         return
 
     from rock.sdk.model.server.integrations.traj_recorder import TrajectoryRecorder
 
     traj_path = config.traj_file or TRAJ_FILE
     recorder = TrajectoryRecorder(traj_file=traj_path)
-    app.state.backend = _ForwardBackend(config, recorder=recorder)
+    app.state.backend = ForwardBackend(config, recorder=recorder)
     logger.info(f"forward backend attached, traj_file={traj_path}")
 
 
@@ -127,8 +127,8 @@ def create_config_from_args(args) -> ModelServiceConfig:
     if args.request_timeout:
         config.request_timeout = args.request_timeout
         logger.info(f"request_timeout set from command line: {args.request_timeout}s")
-    if getattr(args, "traj_file", None):
-        config.replay_traj_path = args.traj_file
+    if args.traj_file:
+        config.replay_traj_file = args.traj_file
         logger.info(f"replay mode enabled via --traj-file: {args.traj_file}")
 
     return config
diff --git a/tests/unit/sdk/model/test_proxy.py b/tests/unit/sdk/model/test_proxy.py
index b3e9d5b1ed..caa5fb2254 100644
--- a/tests/unit/sdk/model/test_proxy.py
+++ b/tests/unit/sdk/model/test_proxy.py
@@ -29,14 +29,14 @@
 
 def _build_app(config: ModelServiceConfig, *, replay_cursor=None, recorder=None) -> FastAPI:
     """Build a FastAPI app with the proxy router and the given config attached."""
-    from rock.sdk.model.server.api.proxy import _ForwardBackend, _ReplayBackend
+    from rock.sdk.model.server.api.proxy import ForwardBackend, ReplayBackend
 
     app = FastAPI()
     app.state.model_service_config = config
     if replay_cursor is not None:
-        app.state.backend = _ReplayBackend(replay_cursor)
+        app.state.backend = ReplayBackend(replay_cursor)
     else:
-        app.state.backend = _ForwardBackend(config, recorder=recorder)
+        app.state.backend = ForwardBackend(config, recorder=recorder)
     app.include_router(proxy_router)
     return app
 
@@ -327,7 +327,7 @@ async def test_replay_returns_recorded_response_no_upstream_call(tmp_path):
     traj.write_text(json.dumps(record) + "\n", encoding="utf-8")
 
     config = ModelServiceConfig()
-    config.replay_traj_path = str(traj)
+    config.replay_traj_file = str(traj)
     app = _build_app(config, replay_cursor=SequentialCursor.load(traj))
 
     transport = ASGITransport(app=app)
@@ -362,7 +362,7 @@ async def test_replay_streaming_emits_recorded_response_as_sse(tmp_path):
     traj.write_text(json.dumps(record) + "\n", encoding="utf-8")
 
     config = ModelServiceConfig()
-    config.replay_traj_path = str(traj)
+    config.replay_traj_file = str(traj)
     app = _build_app(config, replay_cursor=SequentialCursor.load(traj))
 
     transport = ASGITransport(app=app)
@@ -392,7 +392,7 @@ async def test_replay_returns_404_when_cursor_exhausted(tmp_path):
     traj.write_text(json.dumps(record) + "\n", encoding="utf-8")
 
     config = ModelServiceConfig()
-    config.replay_traj_path = str(traj)
+    config.replay_traj_file = str(traj)
     app = _build_app(config, replay_cursor=SequentialCursor.load(traj))
 
     transport = ASGITransport(app=app)
@@ -441,7 +441,7 @@ def test_config_default_host_and_port():
 def test_config_default_traj_and_replay():
     config = ModelServiceConfig()
     assert config.traj_file is None
-    assert config.replay_traj_path is None
+    assert config.replay_traj_file is None
 
 
 @pytest.mark.asyncio
@@ -451,13 +451,13 @@ async def test_config_loads_traj_and_replay_from_file(tmp_path):
         yaml.dump(
             {
                 "traj_file": "/tmp/my-traj.jsonl",
-                "replay_traj_path": "/tmp/in.jsonl",
+                "replay_traj_file": "/tmp/in.jsonl",
             }
         )
     )
     config = ModelServiceConfig.from_file(str(conf_file))
     assert config.traj_file == "/tmp/my-traj.jsonl"
-    assert config.replay_traj_path == "/tmp/in.jsonl"
+    assert config.replay_traj_file == "/tmp/in.jsonl"
 
 
 def test_cli_args_override_config_file(tmp_path):
@@ -501,7 +501,7 @@ def test_cli_traj_file_enables_replay():
         traj_file="/tmp/in.jsonl",
     )
     config = create_config_from_args(args)
-    assert config.replay_traj_path == "/tmp/in.jsonl"
+    assert config.replay_traj_file == "/tmp/in.jsonl"
 
 
 # ---------- Metrics singleton + legacy record_traj (still used by local mode) ----------

From 0e5cb2093ce2d5ef6a2740ce09bf9b32fdc3471c Mon Sep 17 00:00:00 2001
From: "pengshixin.psx" <pengshixin.psx@alibaba-inc.com>
Date: Tue, 12 May 2026 06:56:11 +0000
Subject: [PATCH 09/25] feat(model-service): restore retry on
 retryable_status_codes + connection errors
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Retry was lost when litellm was dropped (it provided num_retries internally).
Restored using a unified _send_with_retry helper that:
  - Always opens upstream with stream=True so the same code path serves both
    stream and non-stream callers (non-stream just await resp.aread()).
  - Retries on httpx.TimeoutException, httpx.ConnectError, and HTTP statuses
    in config.retryable_status_codes (default [429, 500]).
  - Defaults: 6 attempts, exponential backoff 2s→32s with jitter — matches
    the original perform_llm_request behavior.
  - For stream: retry happens before any byte is yielded; mid-stream drops
    are not retried (would corrupt downstream).

Module-level retry constants are read at call time so tests can monkeypatch
them. Added 4 tests covering: success after retry, exhausted retries returning
last response, non-whitelisted status not retried, and stream retry path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 rock/sdk/model/server/api/proxy.py | 211 +++++++++++++++++++++--------
 tests/unit/sdk/model/test_proxy.py | 119 +++++++++++++++-
 2 files changed, 273 insertions(+), 57 deletions(-)

diff --git a/rock/sdk/model/server/api/proxy.py b/rock/sdk/model/server/api/proxy.py
index 2aa911771a..55eec129cf 100644
--- a/rock/sdk/model/server/api/proxy.py
+++ b/rock/sdk/model/server/api/proxy.py
@@ -20,7 +20,9 @@
 
 from __future__ import annotations
 
+import asyncio
 import json
+import random
 import time
 from collections.abc import AsyncIterator
 from typing import Any
@@ -53,6 +55,64 @@
 #   - transfer-encoding / connection: RFC 7230 hop-by-hop, scoped to one connection
 _HEADERS_NOT_TO_FORWARD = frozenset({"host", "content-length", "transfer-encoding", "connection"})
 
+# Retry knobs for upstream POST. Read at call-time so tests can monkeypatch them.
+# Default: up to 6 attempts with exponential backoff (2s → 4s → 8s → 16s → 32s, jittered).
+_RETRY_MAX_ATTEMPTS = 6
+_RETRY_DELAY_SECONDS = 2.0
+_RETRY_BACKOFF = 2.0
+_RETRY_EXCEPTIONS: tuple[type[Exception], ...] = (
+    httpx.TimeoutException,
+    httpx.ConnectError,
+    httpx.HTTPStatusError,
+)
+
+
+async def _send_with_retry(
+    client: httpx.AsyncClient,
+    url: str,
+    *,
+    body_bytes: bytes,
+    headers: dict[str, str],
+    retryable_codes: list[int],
+) -> httpx.Response:
+    """POST with retry on connection errors and whitelisted statuses, returning
+    an open streaming response.
+
+    Always uses ``stream=True`` so the same path serves both stream and non-stream
+    callers — non-stream just calls ``await resp.aread()`` to materialize the body.
+    Assumes a failed upstream returns its error body before any byte is yielded
+    to downstream (so retry can still discard it cleanly).
+
+    Caller MUST ``await resp.aclose()`` after consuming.
+    """
+    last_exc: Exception | None = None
+    delay = _RETRY_DELAY_SECONDS
+    for attempt in range(1, _RETRY_MAX_ATTEMPTS + 1):
+        try:
+            resp = await client.send(
+                client.build_request("POST", url, content=body_bytes, headers=headers),
+                stream=True,
+            )
+        except (httpx.TimeoutException, httpx.ConnectError) as exc:
+            last_exc = exc
+            if attempt >= _RETRY_MAX_ATTEMPTS:
+                raise
+            logger.warning(f"connect failed (attempt {attempt}/{_RETRY_MAX_ATTEMPTS}): {exc}")
+            await asyncio.sleep(random.uniform(0, delay * 2))
+            delay *= _RETRY_BACKOFF
+            continue
+
+        if resp.status_code in retryable_codes and attempt < _RETRY_MAX_ATTEMPTS:
+            await resp.aclose()
+            logger.warning(f"upstream status {resp.status_code}, retry {attempt}/{_RETRY_MAX_ATTEMPTS}")
+            await asyncio.sleep(random.uniform(0, delay * 2))
+            delay *= _RETRY_BACKOFF
+            continue
+
+        return resp
+
+    raise last_exc  # pragma: no cover  # unreachable
+
 
 def get_base_url(model_name: str, config: ModelServiceConfig) -> str:
     """Pick the upstream base URL by model name.
@@ -104,39 +164,65 @@ async def _forward_stream_and_record(
     timeout: float,
     request_dict: dict[str, Any],
     recorder: TrajectoryRecorder | None,
+    retryable_codes: list[int],
 ) -> AsyncIterator[bytes]:
     """SSE bytes are forwarded verbatim; chunks are parsed in parallel and
-    aggregated into the final ChatCompletion that the recorder writes to JSONL."""
+    aggregated into the final ChatCompletion that the recorder writes to JSONL.
+
+    Retry on connection errors and whitelisted statuses happens BEFORE any byte
+    is yielded; mid-stream connection drops are not retried (would corrupt the
+    client transmission)."""
     state = ChatCompletionStreamState()
     start = time.time()
     parse_buffer = b""
     upstream_status = 0
 
-    try:
-        async with httpx.AsyncClient(timeout=timeout) as client:
-            async with client.stream("POST", upstream_url, content=body_bytes, headers=fwd_headers) as r:
-                upstream_status = r.status_code
-                async for chunk in r.aiter_bytes():
-                    yield chunk
-                    chunk_dicts, parse_buffer = parse_sse_data_chunks(parse_buffer + chunk)
-                    for chunk_dict in chunk_dicts:
-                        try:
-                            state.handle_chunk(ChatCompletionChunk.model_validate(chunk_dict))
-                        except Exception as exc:  # parser error: forward continues, traj will be partial
-                            logger.debug(f"[record] chunk parse failed (forward continues): {exc}")
-    except httpx.RequestError as exc:
-        # Connection died mid-stream. The bytes already sent reach the client;
-        # we still try to record what we got.
-        if recorder is not None:
-            await recorder.record(
-                request=request_dict,
-                response=None,
-                status="failure",
-                start_time=start,
-                end_time=time.time(),
-                error=f"{type(exc).__name__}: {exc}",
+    async with httpx.AsyncClient(timeout=timeout) as client:
+        try:
+            resp = await _send_with_retry(
+                client,
+                upstream_url,
+                body_bytes=body_bytes,
+                headers=fwd_headers,
+                retryable_codes=retryable_codes,
             )
-        return
+        except (httpx.TimeoutException, httpx.ConnectError) as exc:
+            if recorder is not None:
+                await recorder.record(
+                    request=request_dict,
+                    response=None,
+                    status="failure",
+                    start_time=start,
+                    end_time=time.time(),
+                    error=f"{type(exc).__name__}: {exc}",
+                )
+            return
+
+        try:
+            upstream_status = resp.status_code
+            async for chunk in resp.aiter_bytes():
+                yield chunk
+                chunk_dicts, parse_buffer = parse_sse_data_chunks(parse_buffer + chunk)
+                for chunk_dict in chunk_dicts:
+                    try:
+                        state.handle_chunk(ChatCompletionChunk.model_validate(chunk_dict))
+                    except Exception as exc:  # parser error: forward continues, traj will be partial
+                        logger.debug(f"[record] chunk parse failed (forward continues): {exc}")
+        except httpx.RequestError as exc:
+            # Connection died mid-stream — bytes already sent reach the client;
+            # record what we got and return.
+            if recorder is not None:
+                await recorder.record(
+                    request=request_dict,
+                    response=None,
+                    status="failure",
+                    start_time=start,
+                    end_time=time.time(),
+                    error=f"{type(exc).__name__}: {exc}",
+                )
+            return
+        finally:
+            await resp.aclose()
 
     if recorder is None:
         return
@@ -216,39 +302,53 @@ async def serve(
                     timeout=self._config.request_timeout,
                     request_dict=request_dict,
                     recorder=self._recorder,
+                    retryable_codes=self._config.retryable_status_codes,
                 ),
                 media_type="text/event-stream",
             )
 
-        # Non-stream: single POST, return upstream's status + body verbatim, record on the side.
+        # Non-stream: same retry path as stream (open with stream=True), then aread() the body.
         start = time.time()
-        try:
-            async with httpx.AsyncClient(timeout=self._config.request_timeout) as client:
-                r = await client.post(upstream_url, content=body_bytes, headers=fwd_headers)
-        except httpx.TimeoutException as exc:
-            if self._recorder is not None:
-                await self._recorder.record(
-                    request=request_dict,
-                    response=None,
-                    status="failure",
-                    start_time=start,
-                    end_time=time.time(),
-                    error=f"timeout: {exc}",
-                )
-            raise HTTPException(status_code=504, detail=f"Upstream timed out: {exc}")
-        except httpx.RequestError as exc:
-            if self._recorder is not None:
-                await self._recorder.record(
-                    request=request_dict,
-                    response=None,
-                    status="failure",
-                    start_time=start,
-                    end_time=time.time(),
-                    error=f"{type(exc).__name__}: {exc}",
+        async with httpx.AsyncClient(timeout=self._config.request_timeout) as client:
+            try:
+                resp = await _send_with_retry(
+                    client,
+                    upstream_url,
+                    body_bytes=body_bytes,
+                    headers=fwd_headers,
+                    retryable_codes=self._config.retryable_status_codes,
                 )
-            raise HTTPException(status_code=502, detail=f"Upstream request failed: {exc}")
-
-        response_text = r.text  # bytes already read by httpx; .text decodes once
+            except httpx.TimeoutException as exc:
+                if self._recorder is not None:
+                    await self._recorder.record(
+                        request=request_dict,
+                        response=None,
+                        status="failure",
+                        start_time=start,
+                        end_time=time.time(),
+                        error=f"timeout: {exc}",
+                    )
+                raise HTTPException(status_code=504, detail=f"Upstream timed out: {exc}")
+            except httpx.RequestError as exc:
+                if self._recorder is not None:
+                    await self._recorder.record(
+                        request=request_dict,
+                        response=None,
+                        status="failure",
+                        start_time=start,
+                        end_time=time.time(),
+                        error=f"{type(exc).__name__}: {exc}",
+                    )
+                raise HTTPException(status_code=502, detail=f"Upstream request failed: {exc}")
+
+            try:
+                response_bytes = await resp.aread()
+                status_code = resp.status_code
+                content_type = resp.headers.get("content-type", "application/json")
+            finally:
+                await resp.aclose()
+
+        response_text = response_bytes.decode("utf-8", errors="replace")
         response_dict: dict | None = None
         try:
             parsed = json.loads(response_text) if response_text else None
@@ -261,15 +361,14 @@ async def serve(
             await self._recorder.record(
                 request=request_dict,
                 response=response_dict,
-                status="success" if r.status_code < 400 else "failure",
+                status="success" if status_code < 400 else "failure",
                 start_time=start,
                 end_time=time.time(),
-                error=None if r.status_code < 400 else f"upstream_status={r.status_code}",
+                error=None if status_code < 400 else f"upstream_status={status_code}",
             )
 
         # Forward bytes verbatim — preserves any provider-specific fields untouched.
-        media_type = r.headers.get("content-type", "application/json")
-        return Response(content=response_text, status_code=r.status_code, media_type=media_type)
+        return Response(content=response_bytes, status_code=status_code, media_type=content_type)
 
 
 CompletionBackend = ReplayBackend | ForwardBackend
diff --git a/tests/unit/sdk/model/test_proxy.py b/tests/unit/sdk/model/test_proxy.py
index caa5fb2254..f47813ac8b 100644
--- a/tests/unit/sdk/model/test_proxy.py
+++ b/tests/unit/sdk/model/test_proxy.py
@@ -247,7 +247,10 @@ def handler(request: httpx.Request) -> httpx.Response:
 
 
 @pytest.mark.asyncio
-async def test_forward_502_on_upstream_connection_failure():
+async def test_forward_502_on_upstream_connection_failure(monkeypatch):
+    """ConnectError → 502. Retry disabled here to keep the test fast."""
+    monkeypatch.setattr("rock.sdk.model.server.api.proxy._RETRY_MAX_ATTEMPTS", 1)
+
     def handler(request: httpx.Request) -> httpx.Response:
         raise httpx.ConnectError("upstream is down")
 
@@ -263,6 +266,120 @@ def handler(request: httpx.Request) -> httpx.Response:
     assert r.status_code == 502
 
 
+# ---------- Forward path: retry ----------
+
+
+@pytest.mark.asyncio
+async def test_forward_retries_on_retryable_status_then_succeeds(monkeypatch):
+    """A 429 is retried; the next attempt's 200 is returned to the client."""
+    monkeypatch.setattr("rock.sdk.model.server.api.proxy._RETRY_DELAY_SECONDS", 0.0)
+
+    attempts = []
+
+    def handler(request: httpx.Request) -> httpx.Response:
+        attempts.append(1)
+        if len(attempts) < 3:
+            return httpx.Response(429, json={"error": "rate limited"})
+        return httpx.Response(200, json=_success_response_json(content="finally"))
+
+    app = _build_app(ModelServiceConfig())  # default retryable_status_codes = [429, 500]
+    with _patch_httpx_with_handler(handler):
+        transport = ASGITransport(app=app)
+        async with AsyncClient(transport=transport, base_url="http://test") as ac:
+            r = await ac.post(
+                "/v1/chat/completions",
+                json={"model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "hi"}]},
+            )
+
+    assert r.status_code == 200
+    assert r.json()["choices"][0]["message"]["content"] == "finally"
+    assert len(attempts) == 3
+
+
+@pytest.mark.asyncio
+async def test_forward_returns_last_response_when_retries_exhausted(monkeypatch):
+    """All attempts return 429 → the final 429 body+status is forwarded verbatim."""
+    monkeypatch.setattr("rock.sdk.model.server.api.proxy._RETRY_MAX_ATTEMPTS", 3)
+    monkeypatch.setattr("rock.sdk.model.server.api.proxy._RETRY_DELAY_SECONDS", 0.0)
+
+    attempts = []
+
+    def handler(request: httpx.Request) -> httpx.Response:
+        attempts.append(1)
+        return httpx.Response(429, json={"error": "still rate limited"})
+
+    app = _build_app(ModelServiceConfig())
+    with _patch_httpx_with_handler(handler):
+        transport = ASGITransport(app=app)
+        async with AsyncClient(transport=transport, base_url="http://test") as ac:
+            r = await ac.post(
+                "/v1/chat/completions",
+                json={"model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "hi"}]},
+            )
+
+    assert r.status_code == 429
+    assert r.json() == {"error": "still rate limited"}
+    assert len(attempts) == 3
+
+
+@pytest.mark.asyncio
+async def test_forward_does_not_retry_non_whitelisted_status(monkeypatch):
+    """400 is not in retryable_status_codes → forwarded immediately, no retry."""
+    monkeypatch.setattr("rock.sdk.model.server.api.proxy._RETRY_DELAY_SECONDS", 0.0)
+
+    attempts = []
+
+    def handler(request: httpx.Request) -> httpx.Response:
+        attempts.append(1)
+        return httpx.Response(400, json={"error": "bad request"})
+
+    app = _build_app(ModelServiceConfig())
+    with _patch_httpx_with_handler(handler):
+        transport = ASGITransport(app=app)
+        async with AsyncClient(transport=transport, base_url="http://test") as ac:
+            r = await ac.post(
+                "/v1/chat/completions",
+                json={"model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "hi"}]},
+            )
+
+    assert r.status_code == 400
+    assert len(attempts) == 1
+
+
+@pytest.mark.asyncio
+async def test_forward_stream_retries_on_retryable_status_then_succeeds(monkeypatch):
+    """Streaming: 500 on first attempt, then 200 SSE on second — client sees only the 200 body."""
+    monkeypatch.setattr("rock.sdk.model.server.api.proxy._RETRY_DELAY_SECONDS", 0.0)
+
+    attempts = []
+    sse_body = (
+        b'data: {"id":"x","object":"chat.completion.chunk","choices":[{"index":0,'
+        b'"delta":{"content":"hello"},"finish_reason":null}]}\n\n'
+        b"data: [DONE]\n\n"
+    )
+
+    def handler(request: httpx.Request) -> httpx.Response:
+        attempts.append(1)
+        if len(attempts) < 2:
+            return httpx.Response(500, json={"error": "internal"})
+        return httpx.Response(200, content=sse_body, headers={"content-type": "text/event-stream"})
+
+    app = _build_app(ModelServiceConfig())
+    with _patch_httpx_with_handler(handler):
+        transport = ASGITransport(app=app)
+        async with AsyncClient(transport=transport, base_url="http://test") as ac:
+            r = await ac.post(
+                "/v1/chat/completions",
+                json={"model": "gpt-3.5-turbo", "stream": True, "messages": [{"role": "user", "content": "hi"}]},
+            )
+
+    body = r.text
+    assert "hello" in body
+    assert "[DONE]" in body
+    assert "internal" not in body  # the 500 attempt is not leaked to client
+    assert len(attempts) == 2
+
+
 # ---------- Forward path: recording ----------
 
 

From 1b20a37f423f7f068b84ed7b82a6d9bf42616362 Mon Sep 17 00:00:00 2001
From: "pengshixin.psx" <pengshixin.psx@alibaba-inc.com>
Date: Tue, 12 May 2026 07:41:41 +0000
Subject: [PATCH 10/25] test(model-service): add e2e tests against an in-thread
 uvicorn mock upstream
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

A tiny FastAPI mock app runs in a background thread via uvicorn; the proxy
calls it over real TCP through its own httpx.AsyncClient — production code
path, no transport injection or patching. Three scenarios:
  - non-stream forward: vendor field round-trips, recorder writes JSONL
  - stream forward: SSE chunks reach the client, recorder gets aggregated
    final completion
  - record-then-replay: replay phase uses a bogus base_url to prove the
    upstream isn't called

Tests use FastAPI's TestClient (sync) so the test bodies read top-down with
no async noise; the async wiring lives inside MockUpstreamServer.

Drive-by cleanups in proxy.py: localize the openai SDK imports inside the
streaming aggregator (only needed there), and drop the now-unused
_RETRY_EXCEPTIONS constant.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 rock/sdk/model/server/api/proxy.py     |  12 +-
 tests/unit/sdk/model/test_proxy_e2e.py | 222 +++++++++++++++++++++++++
 2 files changed, 227 insertions(+), 7 deletions(-)
 create mode 100644 tests/unit/sdk/model/test_proxy_e2e.py

diff --git a/rock/sdk/model/server/api/proxy.py b/rock/sdk/model/server/api/proxy.py
index 55eec129cf..5fa750e5d2 100644
--- a/rock/sdk/model/server/api/proxy.py
+++ b/rock/sdk/model/server/api/proxy.py
@@ -30,8 +30,6 @@
 import httpx
 from fastapi import APIRouter, HTTPException, Request
 from fastapi.responses import JSONResponse, Response, StreamingResponse
-from openai.lib.streaming.chat import ChatCompletionStreamState
-from openai.types.chat import ChatCompletionChunk
 
 from rock.logger import init_logger
 from rock.sdk.model.server.config import ModelServiceConfig
@@ -60,11 +58,6 @@
 _RETRY_MAX_ATTEMPTS = 6
 _RETRY_DELAY_SECONDS = 2.0
 _RETRY_BACKOFF = 2.0
-_RETRY_EXCEPTIONS: tuple[type[Exception], ...] = (
-    httpx.TimeoutException,
-    httpx.ConnectError,
-    httpx.HTTPStatusError,
-)
 
 
 async def _send_with_retry(
@@ -172,6 +165,11 @@ async def _forward_stream_and_record(
     Retry on connection errors and whitelisted statuses happens BEFORE any byte
     is yielded; mid-stream connection drops are not retried (would corrupt the
     client transmission)."""
+    # openai SDK is used purely as a stream-aggregation parser — keep the import
+    # local so module load doesn't pull it in for callers that never stream.
+    from openai.lib.streaming.chat import ChatCompletionStreamState
+    from openai.types.chat import ChatCompletionChunk
+
     state = ChatCompletionStreamState()
     start = time.time()
     parse_buffer = b""
diff --git a/tests/unit/sdk/model/test_proxy_e2e.py b/tests/unit/sdk/model/test_proxy_e2e.py
new file mode 100644
index 0000000000..c028832b8c
--- /dev/null
+++ b/tests/unit/sdk/model/test_proxy_e2e.py
@@ -0,0 +1,222 @@
+"""End-to-end: real in-process TCP mock upstream + real proxy router + recorder.
+
+The mock upstream is a tiny FastAPI app served by uvicorn in a background thread
+(real TCP). The proxy stays in-process and is hit via FastAPI's ``TestClient``;
+its outbound ``httpx.AsyncClient`` makes a real TCP call to the mock — production
+code path, no transport injection, no patching.
+"""
+
+from __future__ import annotations
+
+import asyncio
+import json
+import threading
+import time
+from collections.abc import Iterator
+from pathlib import Path
+
+import pytest
+import uvicorn
+from fastapi import FastAPI, Request
+from fastapi.responses import JSONResponse, StreamingResponse
+from fastapi.testclient import TestClient
+
+from rock.sdk.model.server.api.proxy import ForwardBackend, ReplayBackend, proxy_router
+from rock.sdk.model.server.config import ModelServiceConfig
+from rock.sdk.model.server.integrations.traj_recorder import TrajectoryRecorder
+from rock.sdk.model.server.integrations.traj_replayer import SequentialCursor
+from rock.utils.system import find_free_port
+
+# ---------- Mock upstream: a tiny FastAPI app behind a real TCP uvicorn ----------
+
+
+def _build_mock_upstream() -> FastAPI:
+    """One stream + one non-stream reply, with a vendor field to verify byte-passthrough."""
+    app = FastAPI()
+
+    def completion(model: str) -> dict:
+        return {
+            "id": "chatcmpl-mock-1",
+            "object": "chat.completion",
+            "created": 0,
+            "model": model,
+            "choices": [
+                {
+                    "index": 0,
+                    "message": {
+                        "role": "assistant",
+                        "content": "Hello, ROCK!",
+                        "reasoning_content": "thinking step-by-step",
+                    },
+                    "finish_reason": "stop",
+                }
+            ],
+            "usage": {"prompt_tokens": 5, "completion_tokens": 4, "total_tokens": 9},
+        }
+
+    async def stream_gen(model: str):
+        base = {"id": "chatcmpl-mock-1", "object": "chat.completion.chunk", "created": 0, "model": model}
+        for piece in ["Hello", ", ", "ROCK", "!"]:
+            chunk = {
+                **base,
+                "choices": [{"index": 0, "delta": {"role": "assistant", "content": piece}, "finish_reason": None}],
+            }
+            yield f"data: {json.dumps(chunk, ensure_ascii=False)}\n\n".encode()
+            await asyncio.sleep(0.005)
+        final = {**base, "choices": [{"index": 0, "delta": {}, "finish_reason": "stop"}]}
+        yield f"data: {json.dumps(final)}\n\n".encode()
+        yield b"data: [DONE]\n\n"
+
+    @app.post("/v1/chat/completions")
+    async def chat_completions(request: Request):
+        body = await request.json()
+        model = body.get("model", "mock")
+        if body.get("stream"):
+            return StreamingResponse(stream_gen(model), media_type="text/event-stream")
+        return JSONResponse(status_code=200, content=completion(model))
+
+    return app
+
+
+class MockUpstreamServer:
+    """Runs ``_build_mock_upstream()`` behind a real TCP uvicorn in a background thread.
+
+    Use as ``with MockUpstreamServer() as base_url: ...``. ``Server.run()``
+    spins up its own asyncio loop inside the thread; we poll ``server.started``
+    to know when it's accepting connections.
+    """
+
+    def __init__(self) -> None:
+        port = asyncio.run(find_free_port())
+        config = uvicorn.Config(
+            _build_mock_upstream(), host="127.0.0.1", port=port, log_level="warning", access_log=False
+        )
+        self._server = uvicorn.Server(config)
+        self._thread = threading.Thread(target=self._server.run, daemon=True)
+        self.base_url = f"http://127.0.0.1:{port}/v1"
+
+    def __enter__(self) -> str:
+        self._thread.start()
+        deadline = time.time() + 5.0
+        while not self._server.started:
+            if time.time() > deadline:
+                raise RuntimeError("mock upstream did not start within 5s")
+            time.sleep(0.02)
+        return self.base_url
+
+    def __exit__(self, *_exc) -> None:
+        self._server.should_exit = True
+        self._thread.join(timeout=5)
+
+
+@pytest.fixture
+def mock_upstream() -> Iterator[str]:
+    with MockUpstreamServer() as base_url:
+        yield base_url
+
+
+# ---------- Proxy app builder + tests ----------
+
+
+def _build_proxy_app(*, mock_url: str, traj_file: Path | None = None, replay_cursor=None) -> FastAPI:
+    config = ModelServiceConfig()
+    config.proxy_base_url = mock_url
+
+    app = FastAPI()
+    app.state.model_service_config = config
+    if replay_cursor is not None:
+        app.state.backend = ReplayBackend(replay_cursor)
+    else:
+        recorder = TrajectoryRecorder(traj_file=traj_file) if traj_file is not None else None
+        app.state.backend = ForwardBackend(config, recorder=recorder)
+    app.include_router(proxy_router)
+    return app
+
+
+def test_e2e_non_stream_forwards_and_records(mock_upstream, tmp_path):
+    """Vendor field reaches the client; recorder writes a JSONL line with the full response."""
+    traj_file = tmp_path / "traj.jsonl"
+    proxy_app = _build_proxy_app(mock_url=mock_upstream, traj_file=traj_file)
+
+    with TestClient(proxy_app) as client:
+        r = client.post(
+            "/v1/chat/completions",
+            json={"model": "mock-model", "messages": [{"role": "user", "content": "hi"}]},
+            headers={"Authorization": "Bearer test-key"},
+        )
+
+    assert r.status_code == 200
+    msg = r.json()["choices"][0]["message"]
+    assert msg["content"] == "Hello, ROCK!"
+    assert msg["reasoning_content"] == "thinking step-by-step"
+
+    rec = json.loads(traj_file.read_text(encoding="utf-8").strip())
+    assert rec["status"] == "success"
+    assert rec["stream"] is False
+    assert rec["response"]["choices"][0]["message"]["reasoning_content"] == "thinking step-by-step"
+
+
+def test_e2e_stream_forwards_chunks_and_records_aggregated(mock_upstream, tmp_path):
+    """Each upstream SSE chunk reaches the client; recorder gets the aggregated final completion."""
+    traj_file = tmp_path / "traj.jsonl"
+    proxy_app = _build_proxy_app(mock_url=mock_upstream, traj_file=traj_file)
+
+    with TestClient(proxy_app) as client:
+        with client.stream(
+            "POST",
+            "/v1/chat/completions",
+            json={"model": "mock-model", "stream": True, "messages": [{"role": "user", "content": "hi"}]},
+            headers={"Authorization": "Bearer test-key"},
+        ) as r:
+            body = b"".join(r.iter_bytes()).decode("utf-8")
+
+    for piece in ["Hello", "ROCK"]:
+        assert f'"content": "{piece}"' in body
+    assert '"finish_reason": "stop"' in body
+    assert body.rstrip().endswith("data: [DONE]")
+
+    rec = json.loads(traj_file.read_text(encoding="utf-8").strip())
+    assert rec["status"] == "success"
+    assert rec["stream"] is True
+    assert rec["response"]["choices"][0]["message"]["content"] == "Hello, ROCK!"
+
+
+def test_e2e_record_then_replay_returns_same_content(mock_upstream, tmp_path):
+    """Record one non-stream + one stream call, then replay them without touching the upstream."""
+    traj_file = tmp_path / "traj.jsonl"
+
+    # ---- record phase ----
+    proxy_record = _build_proxy_app(mock_url=mock_upstream, traj_file=traj_file)
+    with TestClient(proxy_record) as client:
+        r1 = client.post(
+            "/v1/chat/completions",
+            json={"model": "mock-model", "messages": [{"role": "user", "content": "hi"}]},
+        )
+        with client.stream(
+            "POST",
+            "/v1/chat/completions",
+            json={"model": "mock-model", "stream": True, "messages": [{"role": "user", "content": "hi"}]},
+        ) as st:
+            for _ in st.iter_bytes():
+                pass
+    recorded = r1.json()
+
+    # ---- replay phase: bogus base_url proves the upstream isn't called ----
+    cursor = SequentialCursor.load(traj_file)
+    proxy_replay = _build_proxy_app(mock_url="http://invalid.local:1/v1", replay_cursor=cursor)
+    with TestClient(proxy_replay) as client:
+        ns2 = client.post(
+            "/v1/chat/completions",
+            json={"model": "mock-model", "messages": [{"role": "user", "content": "different"}]},
+        )
+        with client.stream(
+            "POST",
+            "/v1/chat/completions",
+            json={"model": "mock-model", "stream": True, "messages": [{"role": "user", "content": "different"}]},
+        ) as st:
+            st_body = b"".join(st.iter_bytes()).decode("utf-8")
+
+    assert ns2.status_code == 200
+    assert ns2.json()["choices"][0]["message"]["content"] == recorded["choices"][0]["message"]["content"]
+    assert "Hello, ROCK!" in st_body
+    assert st_body.rstrip().endswith("data: [DONE]")

From 6a36702e5ecc0f04cc3781acd6f810ce679ab23b Mon Sep 17 00:00:00 2001
From: "pengshixin.psx" <pengshixin.psx@alibaba-inc.com>
Date: Tue, 12 May 2026 08:07:44 +0000
Subject: [PATCH 11/25] fix(model-service): inject positional index into
 replay-stream tool_calls deltas

A recorded non-stream message.tool_calls carries no 'index' field, but the
OpenAI streaming spec requires it on chunk deltas. Without it, downstream
clients using the openai SDK reject the replay-stream chunk with a pydantic
ValidationError ('Field required: index'). completion_to_chunk_dict now
injects a positional index when missing (existing 'index' is preserved).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 rock/sdk/model/server/sse_utils.py     |  7 ++++
 tests/unit/sdk/model/test_sse_utils.py | 55 +++++++++++++++++++++++++-
 2 files changed, 60 insertions(+), 2 deletions(-)

diff --git a/rock/sdk/model/server/sse_utils.py b/rock/sdk/model/server/sse_utils.py
index 1cbe27298f..f1cca034e6 100644
--- a/rock/sdk/model/server/sse_utils.py
+++ b/rock/sdk/model/server/sse_utils.py
@@ -65,11 +65,18 @@ def completion_to_chunk_dict(response: dict, *, model: str) -> dict:
     Only ``message`` → ``delta`` is renamed; every other field (including
     provider-specific extras like ``reasoning_content`` inside the message)
     flows through unchanged. ``id`` / ``created`` are synthesized when missing.
+
+    ``tool_calls`` items get a positional ``index`` injected if missing — the
+    OpenAI streaming spec requires it on chunk deltas (a recorded non-stream
+    ``message.tool_calls`` carries no ``index``, but downstream stream parsers
+    e.g. the openai SDK will reject the chunk without one).
     """
     choices_in = response.get("choices") or []
     choices_out = []
     for choice in choices_in:
         delta = dict(choice.get("message") or {})
+        if "tool_calls" in delta and delta["tool_calls"]:
+            delta["tool_calls"] = [{"index": tc.get("index", i), **tc} for i, tc in enumerate(delta["tool_calls"])]
         choices_out.append(
             {
                 "index": choice.get("index", 0),
diff --git a/tests/unit/sdk/model/test_sse_utils.py b/tests/unit/sdk/model/test_sse_utils.py
index 6c9318a510..f660d2751b 100644
--- a/tests/unit/sdk/model/test_sse_utils.py
+++ b/tests/unit/sdk/model/test_sse_utils.py
@@ -101,7 +101,8 @@ def test_completion_to_chunk_renames_message_to_delta():
 
 
 def test_completion_to_chunk_preserves_provider_specific_message_fields():
-    """reasoning_content / tool_calls / etc inside message are kept verbatim in delta."""
+    """reasoning_content kept verbatim; tool_calls get a positional index injected
+    (required by the OpenAI streaming spec — see test below)."""
     response = {
         "choices": [
             {
@@ -119,10 +120,60 @@ def test_completion_to_chunk_preserves_provider_specific_message_fields():
     chunk = completion_to_chunk_dict(response, model="glm-5")
 
     assert chunk["choices"][0]["delta"]["reasoning_content"] == "step-by-step thinking"
-    assert chunk["choices"][0]["delta"]["tool_calls"] == [{"id": "t1", "type": "function"}]
+    assert chunk["choices"][0]["delta"]["tool_calls"] == [{"index": 0, "id": "t1", "type": "function"}]
     assert chunk["choices"][0]["finish_reason"] == "tool_calls"
 
 
+def test_completion_to_chunk_injects_tool_call_index_for_openai_sdk_compat():
+    """A recorded non-stream message has tool_calls without 'index'; the OpenAI
+    streaming spec requires it on chunk deltas, and the openai SDK's
+    ChatCompletionChunk.model_validate() rejects the chunk otherwise. We inject
+    a positional index so replay-stream output is parseable by strict clients."""
+    response = {
+        "choices": [
+            {
+                "index": 0,
+                "message": {
+                    "role": "assistant",
+                    "tool_calls": [
+                        {"id": "a", "type": "function", "function": {"name": "f1", "arguments": "{}"}},
+                        {"id": "b", "type": "function", "function": {"name": "f2", "arguments": "{}"}},
+                    ],
+                },
+                "finish_reason": "tool_calls",
+            }
+        ],
+    }
+    chunk = completion_to_chunk_dict(response, model="m")
+    tcs = chunk["choices"][0]["delta"]["tool_calls"]
+    assert [tc["index"] for tc in tcs] == [0, 1]
+
+    # End-to-end: openai SDK accepts the chunk
+    from openai.types.chat import ChatCompletionChunk
+
+    ChatCompletionChunk.model_validate(chunk)  # must not raise
+
+
+def test_completion_to_chunk_preserves_explicit_tool_call_index():
+    """If the recorded tool_calls already have 'index', we don't overwrite it."""
+    response = {
+        "choices": [
+            {
+                "index": 0,
+                "message": {
+                    "role": "assistant",
+                    "tool_calls": [
+                        {"index": 5, "id": "a", "type": "function", "function": {"name": "f", "arguments": "{}"}},
+                    ],
+                },
+                "finish_reason": "tool_calls",
+            }
+        ],
+    }
+    chunk = completion_to_chunk_dict(response, model="m")
+    assert chunk["choices"][0]["delta"]["tool_calls"][0]["index"] == 5
+
+
 def test_completion_to_chunk_synthesizes_id_and_created_when_missing():
     chunk = completion_to_chunk_dict(
         {"choices": [{"index": 0, "message": {"role": "assistant"}, "finish_reason": "stop"}]},

From 79c868e37519249b054728ae114d7db79ecc9153 Mon Sep 17 00:00:00 2001
From: "pengshixin.psx" <pengshixin.psx@alibaba-inc.com>
Date: Tue, 12 May 2026 08:10:16 +0000
Subject: [PATCH 12/25] test(model-service): refactor proxy e2e into
 MockUpstream + TestProxyRecordReplay
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Rename test_proxy_e2e.py → test_proxy_record_replay.py to make the file
purpose explicit (the suite revolves around the record→replay capability).

Refactor the test surface:
  - MockUpstream class encapsulates the FastAPI app, server lifecycle, the
    canonical reply, and an assert_message() helper. Test data and the
    handler stay in sync because they share the same constants.
  - TestProxyRecordReplay class groups the three tests with shorter names:
      test_forward_non_stream
      test_forward_stream
      test_replay  (parametrized over record × replay stream/non-stream)
  - _call_chat_completions helper unifies stream/non-stream call sites.

Expand coverage to 2 parallel tool_calls (get_weather + get_time) — exercises
the openai SDK aggregator's multi-index tool_call assembly.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 tests/unit/sdk/model/test_proxy_e2e.py        | 222 ------------
 .../sdk/model/test_proxy_record_replay.py     | 333 ++++++++++++++++++
 2 files changed, 333 insertions(+), 222 deletions(-)
 delete mode 100644 tests/unit/sdk/model/test_proxy_e2e.py
 create mode 100644 tests/unit/sdk/model/test_proxy_record_replay.py

diff --git a/tests/unit/sdk/model/test_proxy_e2e.py b/tests/unit/sdk/model/test_proxy_e2e.py
deleted file mode 100644
index c028832b8c..0000000000
--- a/tests/unit/sdk/model/test_proxy_e2e.py
+++ /dev/null
@@ -1,222 +0,0 @@
-"""End-to-end: real in-process TCP mock upstream + real proxy router + recorder.
-
-The mock upstream is a tiny FastAPI app served by uvicorn in a background thread
-(real TCP). The proxy stays in-process and is hit via FastAPI's ``TestClient``;
-its outbound ``httpx.AsyncClient`` makes a real TCP call to the mock — production
-code path, no transport injection, no patching.
-"""
-
-from __future__ import annotations
-
-import asyncio
-import json
-import threading
-import time
-from collections.abc import Iterator
-from pathlib import Path
-
-import pytest
-import uvicorn
-from fastapi import FastAPI, Request
-from fastapi.responses import JSONResponse, StreamingResponse
-from fastapi.testclient import TestClient
-
-from rock.sdk.model.server.api.proxy import ForwardBackend, ReplayBackend, proxy_router
-from rock.sdk.model.server.config import ModelServiceConfig
-from rock.sdk.model.server.integrations.traj_recorder import TrajectoryRecorder
-from rock.sdk.model.server.integrations.traj_replayer import SequentialCursor
-from rock.utils.system import find_free_port
-
-# ---------- Mock upstream: a tiny FastAPI app behind a real TCP uvicorn ----------
-
-
-def _build_mock_upstream() -> FastAPI:
-    """One stream + one non-stream reply, with a vendor field to verify byte-passthrough."""
-    app = FastAPI()
-
-    def completion(model: str) -> dict:
-        return {
-            "id": "chatcmpl-mock-1",
-            "object": "chat.completion",
-            "created": 0,
-            "model": model,
-            "choices": [
-                {
-                    "index": 0,
-                    "message": {
-                        "role": "assistant",
-                        "content": "Hello, ROCK!",
-                        "reasoning_content": "thinking step-by-step",
-                    },
-                    "finish_reason": "stop",
-                }
-            ],
-            "usage": {"prompt_tokens": 5, "completion_tokens": 4, "total_tokens": 9},
-        }
-
-    async def stream_gen(model: str):
-        base = {"id": "chatcmpl-mock-1", "object": "chat.completion.chunk", "created": 0, "model": model}
-        for piece in ["Hello", ", ", "ROCK", "!"]:
-            chunk = {
-                **base,
-                "choices": [{"index": 0, "delta": {"role": "assistant", "content": piece}, "finish_reason": None}],
-            }
-            yield f"data: {json.dumps(chunk, ensure_ascii=False)}\n\n".encode()
-            await asyncio.sleep(0.005)
-        final = {**base, "choices": [{"index": 0, "delta": {}, "finish_reason": "stop"}]}
-        yield f"data: {json.dumps(final)}\n\n".encode()
-        yield b"data: [DONE]\n\n"
-
-    @app.post("/v1/chat/completions")
-    async def chat_completions(request: Request):
-        body = await request.json()
-        model = body.get("model", "mock")
-        if body.get("stream"):
-            return StreamingResponse(stream_gen(model), media_type="text/event-stream")
-        return JSONResponse(status_code=200, content=completion(model))
-
-    return app
-
-
-class MockUpstreamServer:
-    """Runs ``_build_mock_upstream()`` behind a real TCP uvicorn in a background thread.
-
-    Use as ``with MockUpstreamServer() as base_url: ...``. ``Server.run()``
-    spins up its own asyncio loop inside the thread; we poll ``server.started``
-    to know when it's accepting connections.
-    """
-
-    def __init__(self) -> None:
-        port = asyncio.run(find_free_port())
-        config = uvicorn.Config(
-            _build_mock_upstream(), host="127.0.0.1", port=port, log_level="warning", access_log=False
-        )
-        self._server = uvicorn.Server(config)
-        self._thread = threading.Thread(target=self._server.run, daemon=True)
-        self.base_url = f"http://127.0.0.1:{port}/v1"
-
-    def __enter__(self) -> str:
-        self._thread.start()
-        deadline = time.time() + 5.0
-        while not self._server.started:
-            if time.time() > deadline:
-                raise RuntimeError("mock upstream did not start within 5s")
-            time.sleep(0.02)
-        return self.base_url
-
-    def __exit__(self, *_exc) -> None:
-        self._server.should_exit = True
-        self._thread.join(timeout=5)
-
-
-@pytest.fixture
-def mock_upstream() -> Iterator[str]:
-    with MockUpstreamServer() as base_url:
-        yield base_url
-
-
-# ---------- Proxy app builder + tests ----------
-
-
-def _build_proxy_app(*, mock_url: str, traj_file: Path | None = None, replay_cursor=None) -> FastAPI:
-    config = ModelServiceConfig()
-    config.proxy_base_url = mock_url
-
-    app = FastAPI()
-    app.state.model_service_config = config
-    if replay_cursor is not None:
-        app.state.backend = ReplayBackend(replay_cursor)
-    else:
-        recorder = TrajectoryRecorder(traj_file=traj_file) if traj_file is not None else None
-        app.state.backend = ForwardBackend(config, recorder=recorder)
-    app.include_router(proxy_router)
-    return app
-
-
-def test_e2e_non_stream_forwards_and_records(mock_upstream, tmp_path):
-    """Vendor field reaches the client; recorder writes a JSONL line with the full response."""
-    traj_file = tmp_path / "traj.jsonl"
-    proxy_app = _build_proxy_app(mock_url=mock_upstream, traj_file=traj_file)
-
-    with TestClient(proxy_app) as client:
-        r = client.post(
-            "/v1/chat/completions",
-            json={"model": "mock-model", "messages": [{"role": "user", "content": "hi"}]},
-            headers={"Authorization": "Bearer test-key"},
-        )
-
-    assert r.status_code == 200
-    msg = r.json()["choices"][0]["message"]
-    assert msg["content"] == "Hello, ROCK!"
-    assert msg["reasoning_content"] == "thinking step-by-step"
-
-    rec = json.loads(traj_file.read_text(encoding="utf-8").strip())
-    assert rec["status"] == "success"
-    assert rec["stream"] is False
-    assert rec["response"]["choices"][0]["message"]["reasoning_content"] == "thinking step-by-step"
-
-
-def test_e2e_stream_forwards_chunks_and_records_aggregated(mock_upstream, tmp_path):
-    """Each upstream SSE chunk reaches the client; recorder gets the aggregated final completion."""
-    traj_file = tmp_path / "traj.jsonl"
-    proxy_app = _build_proxy_app(mock_url=mock_upstream, traj_file=traj_file)
-
-    with TestClient(proxy_app) as client:
-        with client.stream(
-            "POST",
-            "/v1/chat/completions",
-            json={"model": "mock-model", "stream": True, "messages": [{"role": "user", "content": "hi"}]},
-            headers={"Authorization": "Bearer test-key"},
-        ) as r:
-            body = b"".join(r.iter_bytes()).decode("utf-8")
-
-    for piece in ["Hello", "ROCK"]:
-        assert f'"content": "{piece}"' in body
-    assert '"finish_reason": "stop"' in body
-    assert body.rstrip().endswith("data: [DONE]")
-
-    rec = json.loads(traj_file.read_text(encoding="utf-8").strip())
-    assert rec["status"] == "success"
-    assert rec["stream"] is True
-    assert rec["response"]["choices"][0]["message"]["content"] == "Hello, ROCK!"
-
-
-def test_e2e_record_then_replay_returns_same_content(mock_upstream, tmp_path):
-    """Record one non-stream + one stream call, then replay them without touching the upstream."""
-    traj_file = tmp_path / "traj.jsonl"
-
-    # ---- record phase ----
-    proxy_record = _build_proxy_app(mock_url=mock_upstream, traj_file=traj_file)
-    with TestClient(proxy_record) as client:
-        r1 = client.post(
-            "/v1/chat/completions",
-            json={"model": "mock-model", "messages": [{"role": "user", "content": "hi"}]},
-        )
-        with client.stream(
-            "POST",
-            "/v1/chat/completions",
-            json={"model": "mock-model", "stream": True, "messages": [{"role": "user", "content": "hi"}]},
-        ) as st:
-            for _ in st.iter_bytes():
-                pass
-    recorded = r1.json()
-
-    # ---- replay phase: bogus base_url proves the upstream isn't called ----
-    cursor = SequentialCursor.load(traj_file)
-    proxy_replay = _build_proxy_app(mock_url="http://invalid.local:1/v1", replay_cursor=cursor)
-    with TestClient(proxy_replay) as client:
-        ns2 = client.post(
-            "/v1/chat/completions",
-            json={"model": "mock-model", "messages": [{"role": "user", "content": "different"}]},
-        )
-        with client.stream(
-            "POST",
-            "/v1/chat/completions",
-            json={"model": "mock-model", "stream": True, "messages": [{"role": "user", "content": "different"}]},
-        ) as st:
-            st_body = b"".join(st.iter_bytes()).decode("utf-8")
-
-    assert ns2.status_code == 200
-    assert ns2.json()["choices"][0]["message"]["content"] == recorded["choices"][0]["message"]["content"]
-    assert "Hello, ROCK!" in st_body
-    assert st_body.rstrip().endswith("data: [DONE]")
diff --git a/tests/unit/sdk/model/test_proxy_record_replay.py b/tests/unit/sdk/model/test_proxy_record_replay.py
new file mode 100644
index 0000000000..1bb9832784
--- /dev/null
+++ b/tests/unit/sdk/model/test_proxy_record_replay.py
@@ -0,0 +1,333 @@
+"""End-to-end: real in-process TCP mock upstream + real proxy router + recorder.
+
+The mock upstream is a tiny FastAPI app served by uvicorn in a background thread
+(real TCP). The proxy stays in-process and is hit via FastAPI's ``TestClient``;
+its outbound ``httpx.AsyncClient`` makes a real TCP call to the mock — production
+code path, no transport injection, no patching.
+"""
+
+from __future__ import annotations
+
+import asyncio
+import json
+import threading
+import time
+from collections.abc import Iterator
+from pathlib import Path
+
+import pytest
+import uvicorn
+from fastapi import FastAPI, Request
+from fastapi.responses import JSONResponse, StreamingResponse
+from fastapi.testclient import TestClient
+
+from rock.sdk.model.server.api.proxy import ForwardBackend, ReplayBackend, proxy_router
+from rock.sdk.model.server.config import ModelServiceConfig
+from rock.sdk.model.server.integrations.traj_recorder import TrajectoryRecorder
+from rock.sdk.model.server.integrations.traj_replayer import SequentialCursor
+from rock.sdk.model.server.sse_utils import parse_sse_data_chunks
+from rock.utils.system import find_free_port
+
+# ---------------------------------------------------------------------------
+# Mock upstream — a tiny FastAPI app behind a real TCP uvicorn in a thread.
+# Owns both the canned reply AND the assertion helper, so the response shape
+# and the expectations stay in sync if either is edited.
+# ---------------------------------------------------------------------------
+
+
+class MockUpstream:
+    """Mock OpenAI-compatible upstream.
+
+    Single canonical reply (returned for both stream and non-stream requests)
+    contains three fields the proxy must preserve end-to-end:
+      - ``content``            (plain text)
+      - ``reasoning_content``  (vendor-specific thinking)
+      - ``tool_calls``         (a function call)
+    The streaming variant splits each field into multiple deltas so the
+    recorder also exercises the openai SDK's stream-state aggregator.
+
+    Use as ``with MockUpstream() as mock: ...``; ``mock.base_url`` points at
+    the running server. ``mock.assert_message(msg)`` checks any received
+    assistant message matches the canonical reply.
+    """
+
+    # Canonical reply values — change here, both the handler and the assertion
+    # helper pick them up automatically. Two parallel tool_calls cover the
+    # multi-tool-call case (modern LLMs commonly emit several at once).
+    EXPECTED_CONTENT = "Checking weather and time for you."
+    EXPECTED_REASONING = "User wants weather + time; calling both tools in parallel."
+    EXPECTED_TOOL_CALLS = [
+        {
+            "id": "call_weather",
+            "type": "function",
+            "function": {"name": "get_weather", "arguments": '{"city":"Tokyo","unit":"celsius"}'},
+        },
+        {
+            "id": "call_time",
+            "type": "function",
+            "function": {"name": "get_time", "arguments": '{"city":"Tokyo"}'},
+        },
+    ]
+
+    def __init__(self) -> None:
+        port = asyncio.run(find_free_port())
+        config = uvicorn.Config(self._build_app(), host="127.0.0.1", port=port, log_level="warning", access_log=False)
+        self._server = uvicorn.Server(config)
+        self._thread = threading.Thread(target=self._server.run, daemon=True)
+        self.base_url = f"http://127.0.0.1:{port}/v1"
+
+    # ---- lifecycle ----
+
+    def __enter__(self) -> MockUpstream:
+        self._thread.start()
+        deadline = time.time() + 5.0
+        while not self._server.started:
+            if time.time() > deadline:
+                raise RuntimeError("mock upstream did not start within 5s")
+            time.sleep(0.02)
+        return self
+
+    def __exit__(self, *_exc) -> None:
+        self._server.should_exit = True
+        self._thread.join(timeout=5)
+
+    # ---- assertion helper ----
+
+    def assert_message(self, msg: dict) -> None:
+        """Assert ``msg`` is the canonical full message (content + reasoning + 2 parallel tool_calls)."""
+        assert msg["content"] == self.EXPECTED_CONTENT
+        assert msg["reasoning_content"] == self.EXPECTED_REASONING
+        tcs = msg["tool_calls"]
+        assert len(tcs) == len(self.EXPECTED_TOOL_CALLS)
+        for actual, expected in zip(tcs, self.EXPECTED_TOOL_CALLS, strict=True):
+            assert actual["id"] == expected["id"]
+            assert actual["type"] == expected["type"]
+            assert actual["function"]["name"] == expected["function"]["name"]
+            assert json.loads(actual["function"]["arguments"]) == json.loads(expected["function"]["arguments"])
+
+    # ---- internal: FastAPI app + handlers ----
+
+    def _build_app(self) -> FastAPI:
+        app = FastAPI()
+
+        @app.post("/v1/chat/completions")
+        async def chat_completions(request: Request):
+            body = await request.json()
+            model = body.get("model", "mock")
+            if body.get("stream"):
+                return StreamingResponse(self._stream_gen(model), media_type="text/event-stream")
+            return JSONResponse(status_code=200, content=self._completion_json(model))
+
+        return app
+
+    def _completion_json(self, model: str) -> dict:
+        return {
+            "id": "chatcmpl-mock-1",
+            "object": "chat.completion",
+            "created": 0,
+            "model": model,
+            "choices": [
+                {
+                    "index": 0,
+                    "message": {
+                        "role": "assistant",
+                        "content": self.EXPECTED_CONTENT,
+                        "reasoning_content": self.EXPECTED_REASONING,
+                        "tool_calls": self.EXPECTED_TOOL_CALLS,
+                    },
+                    "finish_reason": "tool_calls",
+                }
+            ],
+            "usage": {"prompt_tokens": 12, "completion_tokens": 24, "total_tokens": 36},
+        }
+
+    async def _stream_gen(self, model: str):
+        base = {"id": "chatcmpl-mock-1", "object": "chat.completion.chunk", "created": 0, "model": model}
+
+        def emit(delta: dict, finish_reason=None) -> bytes:
+            payload = {**base, "choices": [{"index": 0, "delta": delta, "finish_reason": finish_reason}]}
+            return f"data: {json.dumps(payload, ensure_ascii=False)}\n\n".encode()
+
+        # 1-2. Reasoning split in two deltas
+        yield emit({"role": "assistant", "reasoning_content": "User wants weather + time; "})
+        await asyncio.sleep(0.005)
+        yield emit({"reasoning_content": "calling both tools in parallel."})
+        await asyncio.sleep(0.005)
+        # 3-4. Content split in two deltas
+        yield emit({"content": "Checking weather"})
+        await asyncio.sleep(0.005)
+        yield emit({"content": " and time for you."})
+        await asyncio.sleep(0.005)
+
+        # 5-7. tool_call[0] (get_weather): announce, then arguments in two pieces
+        yield emit(
+            {
+                "tool_calls": [
+                    {
+                        "index": 0,
+                        "id": "call_weather",
+                        "type": "function",
+                        "function": {"name": "get_weather", "arguments": ""},
+                    }
+                ]
+            }
+        )
+        await asyncio.sleep(0.005)
+        yield emit({"tool_calls": [{"index": 0, "function": {"arguments": '{"city":"Tokyo",'}}]})
+        await asyncio.sleep(0.005)
+        yield emit({"tool_calls": [{"index": 0, "function": {"arguments": '"unit":"celsius"}'}}]})
+        await asyncio.sleep(0.005)
+
+        # 8-9. tool_call[1] (get_time): announce + arguments in one piece
+        yield emit(
+            {
+                "tool_calls": [
+                    {
+                        "index": 1,
+                        "id": "call_time",
+                        "type": "function",
+                        "function": {"name": "get_time", "arguments": ""},
+                    }
+                ]
+            }
+        )
+        await asyncio.sleep(0.005)
+        yield emit({"tool_calls": [{"index": 1, "function": {"arguments": '{"city":"Tokyo"}'}}]})
+        await asyncio.sleep(0.005)
+
+        # 10. Finish
+        yield emit({}, finish_reason="tool_calls")
+        yield b"data: [DONE]\n\n"
+
+
+@pytest.fixture
+def mock_upstream() -> Iterator[MockUpstream]:
+    with MockUpstream() as m:
+        yield m
+
+
+# ---------------------------------------------------------------------------
+# Proxy app builder + request helper (module-level, generic)
+# ---------------------------------------------------------------------------
+
+
+def _build_proxy_app(*, mock_url: str | None = None, traj_file: Path | None = None, replay_cursor=None) -> FastAPI:
+    config = ModelServiceConfig()
+    # ReplayBackend never calls upstream, so mock_url is only relevant for forward mode.
+    if mock_url is not None:
+        config.proxy_base_url = mock_url
+
+    app = FastAPI()
+    app.state.model_service_config = config
+    if replay_cursor is not None:
+        app.state.backend = ReplayBackend(replay_cursor)
+    else:
+        recorder = TrajectoryRecorder(traj_file=traj_file) if traj_file is not None else None
+        app.state.backend = ForwardBackend(config, recorder=recorder)
+    app.include_router(proxy_router)
+    return app
+
+
+def _call_chat_completions(client: TestClient, *, stream: bool) -> dict:
+    """One chat.completions call. Returns the assistant message dict.
+
+    - non-stream: just unwraps ``choices[0].message``.
+    - stream: replay always emits exactly one chunk + ``[DONE]`` (see
+      ``completion_to_chunk_dict``), so the chunk's ``delta`` IS the full
+      message — no aggregation needed.
+    """
+    payload = {"model": "mock-model", "messages": [{"role": "user", "content": "hi"}]}
+    if not stream:
+        r = client.post("/v1/chat/completions", json=payload)
+        assert r.status_code == 200
+        return r.json()["choices"][0]["message"]
+
+    with client.stream("POST", "/v1/chat/completions", json={**payload, "stream": True}) as r:
+        assert r.status_code == 200
+        body_bytes = b"".join(r.iter_bytes())
+    chunks, _ = parse_sse_data_chunks(body_bytes)
+    return chunks[0]["choices"][0]["delta"]
+
+
+# ---------------------------------------------------------------------------
+# Tests
+# ---------------------------------------------------------------------------
+
+
+class TestProxyRecordReplay:
+    """End-to-end: real TCP mock upstream <-> real proxy router + recorder/replayer."""
+
+    def test_forward_non_stream(self, mock_upstream: MockUpstream, tmp_path):
+        """Vendor field reaches the client; recorder writes a JSONL line with the full response."""
+        traj_file = tmp_path / "traj.jsonl"
+        proxy_app = _build_proxy_app(mock_url=mock_upstream.base_url, traj_file=traj_file)
+
+        with TestClient(proxy_app) as client:
+            r = client.post(
+                "/v1/chat/completions",
+                json={"model": "mock-model", "messages": [{"role": "user", "content": "hi"}]},
+                headers={"Authorization": "Bearer test-key"},
+            )
+
+        assert r.status_code == 200
+        body = r.json()
+        assert body["choices"][0]["finish_reason"] == "tool_calls"
+        mock_upstream.assert_message(body["choices"][0]["message"])
+
+        rec = json.loads(traj_file.read_text(encoding="utf-8").strip())
+        assert rec["status"] == "success"
+        assert rec["stream"] is False
+        assert rec["response"]["choices"][0]["finish_reason"] == "tool_calls"
+        mock_upstream.assert_message(rec["response"]["choices"][0]["message"])
+
+    def test_forward_stream(self, mock_upstream: MockUpstream, tmp_path):
+        """Each upstream SSE chunk reaches the client; recorder gets the aggregated final completion
+        with reasoning_content concatenated and tool_calls.arguments assembled from deltas."""
+        traj_file = tmp_path / "traj.jsonl"
+        proxy_app = _build_proxy_app(mock_url=mock_upstream.base_url, traj_file=traj_file)
+
+        with TestClient(proxy_app) as client:
+            with client.stream(
+                "POST",
+                "/v1/chat/completions",
+                json={"model": "mock-model", "stream": True, "messages": [{"role": "user", "content": "hi"}]},
+                headers={"Authorization": "Bearer test-key"},
+            ) as r:
+                body = b"".join(r.iter_bytes()).decode("utf-8")
+
+        # Raw chunks make it to the client untouched
+        assert '"reasoning_content": "User wants weather + time; "' in body
+        assert '"reasoning_content": "calling both tools in parallel."' in body
+        assert '"content": "Checking weather"' in body
+        assert '"content": " and time for you."' in body
+        assert '"name": "get_weather"' in body
+        assert '"name": "get_time"' in body
+        assert '"finish_reason": "tool_calls"' in body
+        assert body.rstrip().endswith("data: [DONE]")
+
+        # Recorder's aggregated message matches the canonical reply
+        rec = json.loads(traj_file.read_text(encoding="utf-8").strip())
+        assert rec["status"] == "success"
+        assert rec["stream"] is True
+        assert rec["response"]["choices"][0]["finish_reason"] == "tool_calls"
+        mock_upstream.assert_message(rec["response"]["choices"][0]["message"])
+
+    @pytest.mark.parametrize("replay_stream", [False, True], ids=["replay_nonstream", "replay_stream"])
+    @pytest.mark.parametrize("record_stream", [False, True], ids=["record_nonstream", "record_stream"])
+    def test_replay(self, mock_upstream: MockUpstream, tmp_path, record_stream: bool, replay_stream: bool):
+        """Recorded mode and replayed mode are orthogonal — all 4 combinations of
+        (stream/non-stream) on each side must yield the same full message."""
+        traj_file = tmp_path / "traj.jsonl"
+
+        # ---- record phase ----
+        proxy_record = _build_proxy_app(mock_url=mock_upstream.base_url, traj_file=traj_file)
+        with TestClient(proxy_record) as client:
+            _call_chat_completions(client, stream=record_stream)
+
+        # ---- replay phase: no upstream URL needed — ReplayBackend never calls upstream ----
+        cursor = SequentialCursor.load(traj_file)
+        proxy_replay = _build_proxy_app(replay_cursor=cursor)
+        with TestClient(proxy_replay) as client:
+            msg = _call_chat_completions(client, stream=replay_stream)
+
+        mock_upstream.assert_message(msg)

From c7bbd5a458adae985792d03ba5935e33a454218d Mon Sep 17 00:00:00 2001
From: "pengshixin.psx" <pengshixin.psx@alibaba-inc.com>
Date: Tue, 12 May 2026 08:12:57 +0000
Subject: [PATCH 13/25] chore: remove useless comment in pyproject.toml

---
 pyproject.toml | 5 -----
 1 file changed, 5 deletions(-)

diff --git a/pyproject.toml b/pyproject.toml
index ac87c14c41..d7d7a591b0 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -86,11 +86,6 @@ model-service = [
     "psutil",
     "swebench",
     "alibabacloud_cr20181201==2.0.5",
-    # openai SDK is used as a TYPE/parser library only — for ChatCompletionChunk
-    # validation and ChatCompletionStreamState (the official stream chunk aggregator).
-    # We do NOT use AsyncOpenAI as an HTTP client; transport is plain httpx so the
-    # proxy can forward upstream bytes verbatim, including any provider-specific
-    # fields (reasoning_content, citations, ...) without re-encoding OpenAI protocol.
     "openai>=1.50.0",
     "httpx",
 ]

From 56a644cd04a01675b09656e5b800992ed7a807f4 Mon Sep 17 00:00:00 2001
From: "pengshixin.psx" <pengshixin.psx@alibaba-inc.com>
Date: Tue, 12 May 2026 08:14:33 +0000
Subject: [PATCH 14/25] chore: remove uesless dev docs

---
 docs/dev/litellm_proxy_refactor.md | 535 -----------------------------
 1 file changed, 535 deletions(-)
 delete mode 100644 docs/dev/litellm_proxy_refactor.md

diff --git a/docs/dev/litellm_proxy_refactor.md b/docs/dev/litellm_proxy_refactor.md
deleted file mode 100644
index be30dbb4b2..0000000000
--- a/docs/dev/litellm_proxy_refactor.md
+++ /dev/null
@@ -1,535 +0,0 @@
-# LiteLLM 重构 model-service proxy + 加 record/replay —— Handoff 文档
-
-> 这份文档是给"接手者"(可能是另一个 Claude session 或人)看的,目的是让接手者**完全不看上一段对话**也能从我离开的地方继续往下做。文档放在 `docs/dev/litellm_proxy_refactor.md`。
-
----
-
-## 0. TL;DR
-
-**目标**:把 `rock model-service --type proxy` 的自写 httpx forward + retry 替换为基于 `litellm` SDK 的实现;同时把 chat/completions 轨迹的"录制 + 顺序回放"作为一等公民能力做进来,服务 SWE-agent / mini-swe-agent / OpenHands 类 deterministic agent 的"无 LLM 成本"调试。
-
-**当前状态**:**代码改动、单元测试、lint 全部完成通过**。下一步是集成验证(实际起 proxy + curl)和写 PR。
-
-**完成清单**:
-- ✅ `pyproject.toml` `model-service` extras 加 `litellm>=1.50.0`
-- ✅ `ModelServiceConfig` 加 `traj_enabled / traj_file / traj_append / replay_enabled / replay_traj_path / num_retries` 6 个字段
-- ✅ 新模块 `rock/sdk/model/server/integrations/{__init__.py, traj_recorder.py, traj_replayer.py}`
-- ✅ `rock/sdk/model/server/api/proxy.py` 整文件重写为 litellm SDK 调用
-- ✅ `rock/sdk/model/server/main.py` 加 `_configure_litellm_for_proxy()` + 新 CLI flags(`--num-retries / --traj-file / --no-traj / --replay-traj`)
-- ✅ `rock/sdk/model/server/utils.py` 保留 `record_traj` 装饰器(给 local 模式继续用),proxy 模式不再用
-- ✅ `tests/unit/sdk/model/test_proxy.py` 改造完成(把 `patch perform_llm_request` 改为 `patch litellm.acompletion`)
-- ✅ 新测试 `tests/unit/sdk/model/test_traj_recorder.py` + `test_traj_replayer.py`
-- ✅ `examples/model_service/config_record.yaml` + `config_replay.yaml`
-- ✅ **单元测试全部通过**(`uv run pytest tests/unit/sdk/model/` → 47 passed)
-- ✅ **Lint/format 全部干净**(`ruff check` + `ruff format --check`，修了一个 `Optional[str]` → `str | None` 的 UP045)
-
-**未完成 / 阻塞**:
-- ⏳ **集成验证**(实际起 proxy + curl + agent 端到端,见第 4.4 节)
-- ⏳ **PR 描述里的 breaking change 提示**(见第 5 节)
-
-**原始 plan 文件**(更详细的设计推演):`/home/xinshi/.claude/plans/litellm-chat-completions-traj-replay-ser-lucky-rainbow.md`(在主 Claude 配置目录,不在 rock 仓内)。
-
----
-
-## 1. 背景与目标
-
-### 起因
-
-用户问:"litellm 能支持把 chat/completions 接口的轨迹落盘吗,然后我想看看能否支持根据 traj 文件做一个 replay server, 比如给一些其他的 agent (swe-agent, openhands) 等用来做 traj 回放"。
-
-### 需求方向的几次迭代(避免接手者重走弯路)
-
-1. **第一版方向**:做一个独立 Python 项目 `litellm-traj`,里面定义 `CustomLogger` 子类(record)和 `CustomLLM` 子类(replay),通过 dotted-path 注册到 litellm proxy 的 `config.yaml`。**已废弃**。
-2. **第二版方向**:在 rock 仓内把这个能力做进 `rock/sdk/model/server/api/proxy.py`(rock 已有 model-service)。但用户进一步要求:**重构掉 rock 自写的 proxy 实现,改为基于 litellm**。
-3. **最终方向(本次)**:用 **litellm SDK** 替换 `proxy.py` 内手写的 httpx forward + `retry_async`;record 接 `CustomLogger`,replay 接 `CustomLLM` provider。`rock model-service` CLI、`local` 模式、FastAPI app/health/metrics 全部保留不动 —— 只动 proxy 模式。
-
-### 为什么是 litellm SDK 而不是 litellm proxy
-
-我们已经有 rock 自己的 FastAPI app + CLI + auth/metrics middleware,只需要一个"OpenAI 兼容上游调用 + 错误归一化 + 流式聚合 + record/replay 接入点"。**litellm SDK 是这层能力的最小外加**,不需要把 litellm proxy 整套生命周期/配置体系拽进来。litellm proxy 适合"完全没有 server 的人"用,我们已经有 server。
-
-### 用户最终拍板的 4 个关键设计选择
-
-| 维度 | 选择 | 理由 |
-|---|---|---|
-| 集成模式 | **litellm SDK** | 改动面最小,保留 rock 既有 FastAPI/CLI/metrics |
-| traj schema | **`StandardLoggingPayload`(litellm 原生)** | 字段最全(messages/response/usage/timing/error_information/trace_id),与 litellm 生态互通 |
-| 是否本期做 replay | **是,record + replay 一起** | 用户原始诉求就是回放;基础设施一次性铺好 |
-| 流式 | **顺便解禁** | litellm 自动聚合,record/replay 走流式不增加复杂度 |
-
----
-
-## 2. 改动清单(按文件)
-
-### 2.1 `pyproject.toml` —— 修改
-
-`[project.optional-dependencies]` 的 `model-service` 数组追加一项 `"litellm>=1.50.0"`。其它 extras 不动。
-
-```toml
-model-service = [
-    "fastapi",
-    "uvicorn",
-    "psutil",
-    "swebench",
-    "alibabacloud_cr20181201==2.0.5",
-    "litellm>=1.50.0",   # ← 这一行新加
-]
-```
-
-为什么是 `>=1.50.0`:这个版本之后 `StandardLoggingPayload`、`CustomLogger.async_log_success_event` 接口、`async_mock_completion_streaming_obj` 都已稳定。本仓现有 model-service 测试集没装过 litellm,所以全新引入,不存在升级冲突。
-
-### 2.2 `rock/sdk/model/server/config.py` —— 修改
-
-在 `ModelServiceConfig` 末尾新增 6 个字段(注意顺序、类型、默认值):
-
-```python
-num_retries: int = Field(default=6)
-
-traj_enabled: bool = Field(default=True)
-traj_file: str | None = Field(default=None)
-traj_append: bool = Field(default=True)   # 注意:旧默认是 False(覆盖),这里翻成 True
-
-replay_enabled: bool = Field(default=False)
-replay_traj_path: str | None = Field(default=None)
-```
-
-每个字段的语义和取值范围都写在 docstring 里。`traj_append=True` 是这次的**默认行为变更**(旧的 `_write_traj` 默认覆盖,被认为是 bug)。`TRAJ_FILE`、`LOG_FILE`、`LOG_DIR` 模块级常量保留不动。
-
-### 2.3 `rock/sdk/model/server/integrations/__init__.py` —— 新增(空文件)
-
-只为了让 `integrations` 成为一个包,内容为空。
-
-### 2.4 `rock/sdk/model/server/integrations/traj_recorder.py` —— 新增
-
-`TrajectoryRecorder(CustomLogger)`,实现两个钩子:`async_log_success_event` 和 `async_log_failure_event`。每次调用从 `kwargs["standard_logging_object"]` 取出 `StandardLoggingPayload`(dict 形态),append 一行 JSON 到 `traj_file`,同时上报 OTLP `model_service.request.{rt,count}` metrics。
-
-关键设计点(展开见第 3.1 节):
-- streaming 不分支(litellm 已在 callback 触发前把 chunks 聚合写入 `payload.response`)
-- `asyncio.Lock` per recorder + `asyncio.to_thread` 包同步写,避免在 event loop 阻塞
-- `append=False` 模式只在**首次写**时截断(避免每次调用覆盖)
-- metrics 复用 `rock.sdk.model.server.utils._get_or_create_metrics_monitor`,`MODEL_SERVICE_REQUEST_RT/COUNT` 常量
-
-### 2.5 `rock/sdk/model/server/integrations/traj_replayer.py` —— 新增
-
-包含两个类 + 两个 helper:
-
-- `SequentialCursor`:从 jsonl 文件或目录加载 records,`async next()` 返回下一条并推进游标,越界 raise `CustomLLMError(404)`。带 `asyncio.Lock` 防并发推进。`reset()` 用于回到起点。
-- `_record_to_model_response(record)` / `_extract_assistant_text(record)`:把 record 还原成 `litellm.types.utils.ModelResponse` 或抽出 assistant text(给 streaming 用)。
-- `TrajectoryReplayer(CustomLLM)`:实现 `acompletion` 和 `astreaming`。流式拆分直接调 `litellm.utils.async_mock_completion_streaming_obj`,不自己造轮子。
-
-`acompletion`/`astreaming` 的签名是 `(self, model, messages, *args, **kwargs)`。litellm 调 CustomLLM 时**全部用关键字参数**(litellm/main.py:4302-4319 实测),所以 `kwargs.get("model_response")` 能可靠拿到流式拆分需要的目标对象。
-
-### 2.6 `rock/sdk/model/server/utils.py` —— 修改(保留 + 注释更新)
-
-**关键决定**:不删 `record_traj` / `_write_traj`。原因:`local.py` 仍在用 `@record_traj`,plan 阶段说过"local 模式不动";所以 record_traj 保留,docstring 加一段说明"proxy 不再用,只给 local 用",新引导走 `TrajectoryRecorder`。
-
-`_get_or_create_metrics_monitor` / `MODEL_SERVICE_REQUEST_RT` / `MODEL_SERVICE_REQUEST_COUNT` 不动 —— `traj_recorder.py` 复用之。
-
-### 2.7 `rock/sdk/model/server/api/proxy.py` —— 整文件重写
-
-旧实现:
-- `httpx.AsyncClient` 全局 + `@retry_async` 6 次指数退避
-- `perform_llm_request(url, body, headers, config)` 自管 retry
-- `@record_traj` 挂在 handler 上同步落盘 + metrics
-- 强制 `stream=False`(MVP 限制)
-
-新实现:
-- `litellm.acompletion(model, api_base, extra_headers, timeout, num_retries, **body)`
-- 错误归一化:catch `RateLimitError / APIError / BadRequestError / AuthenticationError / Timeout` → `_format_error_response()` 回退到 `{error:{message,type,code}}` schema(agent 端关键字检测兼容)
-- 流式开放:`stream=True` 走 `StreamingResponse(_sse_iter(...))`
-- 不再有装饰器 —— record 落盘改由 `main.py` 在启动时挂的 `litellm.callbacks` 完成
-
-`get_base_url()` 路由优先级**完全保留**(`proxy_base_url` > `proxy_rules[model]` > `proxy_rules["default"]`)。`_filter_headers()` 把 hop-by-hop headers(host/content-length/content-type/transfer-encoding/connection)滤掉,Authorization 等保留。
-
-replay 模式下:`litellm_model = f"traj-replay/{model_name}"`,`api_base=None`。litellm 看到 `traj-replay/` 前缀会查 `litellm.custom_provider_map`,找到 `TrajectoryReplayer` 实例并调它的 `acompletion`/`astreaming`。
-
-### 2.8 `rock/sdk/model/server/main.py` —— 修改
-
-新增私有函数 `_configure_litellm_for_proxy(config)`,在 `main()` 进入 proxy 分支时(`include_router(proxy_router)` 之前)调用一次。两个分支:
-
-```python
-if config.replay_enabled:
-    # 注册 TrajectoryReplayer 到 litellm.custom_provider_map
-    ...
-elif config.traj_enabled:
-    # 把 TrajectoryRecorder 加到 litellm.callbacks
-    ...
-```
-
-**注意**:replay 和 record 互斥(replay 不要再录,否则录回放结果会污染 source-of-truth)。
-
-`create_config_from_args()` 新增 4 个 CLI override:`--num-retries / --traj-file / --no-traj / --replay-traj`。所有用 `getattr(args, "<name>", default)` 的方式取,这样老的调用方(传不带这些字段的 Namespace)不会炸。
-
-`from rock.sdk.model.server.config import TRAJ_FILE, ModelServiceConfig` —— 新增 `TRAJ_FILE` 导入,因为 `_configure_litellm_for_proxy` 在 `traj_file` 未指定时回退到 `TRAJ_FILE`。
-
-### 2.9 `tests/unit/sdk/model/test_proxy.py` —— 重写
-
-- 删除:`test_perform_llm_request_*`(4 个,perform_llm_request 已不存在)
-- 改造:`test_chat_completions_routing_*`、`test_proxy_base_url_overrides_proxy_rules` —— `patch_path` 从 `proxy.perform_llm_request` 改为 `proxy.litellm.acompletion`
-- 改造:断言从"perform_llm_request 第一个位置参数 == URL"改为"litellm.acompletion kwargs 中 `api_base == 期望值`,`model == 'openai/<name>'`"
-- 新增:`test_chat_completions_passes_num_retries_and_timeout` / `test_chat_completions_litellm_error_returns_proxy_schema` / `test_chat_completions_replay_mode_uses_traj_replay_provider` / `test_chat_completions_strips_hop_by_hop_headers` / `test_config_default_traj_and_replay` / `test_config_loads_traj_and_replay_from_file` / `test_cli_replay_traj_enables_replay`
-- 保留:所有 lifespan / config-load / metrics-monitor / record_traj 测试(record_traj 在 utils.py 还在,给 local 用)
-
-mock 返回的 ModelResponse:用 `SimpleNamespace(model_dump=lambda: payload)` 假装一个 pydantic 对象 —— 因为 handler 只调 `.model_dump()`,不需要真 import 整个 ModelResponse。
-
-### 2.10 `tests/unit/sdk/model/test_traj_recorder.py` —— 新增
-
-7 个测试:JSONL append / `append=False` 首次截断 / metrics + sandbox_id / failure 落盘 / 缺 standard_logging_object 跳过 / 自动建父目录 / `response_time` 缺失时回退到 `endTime - startTime`。
-
-mock 思路:`patch("rock.sdk.model.server.integrations.traj_recorder._get_or_create_metrics_monitor", return_value=mock_monitor)` —— recorder 内部 import 了这个函数,mock 它的引用。
-
-### 2.11 `tests/unit/sdk/model/test_traj_replayer.py` —— 新增
-
-11 个测试:cursor 加载单文件/目录(按文件名 sort)/空行/缺失文件 raise / `next()` 顺序返回 / 越界 raise / `reset()` 回到起点 / model mismatch 只 warn / Replayer.acompletion 命中 record / cursor 推进 / streaming chunk 拼回 == 原文 / 越界 raise CustomLLMError。
-
-streaming 测试构造一个 `SimpleNamespace(choices=[SimpleNamespace(delta=SimpleNamespace(role=None, content=None), index=0)])` 当 model_response,因为 `async_mock_completion_streaming_obj` 内部会写 `model_response.choices[0].delta.content = ...`。
-
-### 2.12 `examples/model_service/config_record.yaml` 和 `config_replay.yaml` —— 新增
-
-两份开箱即用的 yaml,带详细注释。`config_record.yaml` 默认开 `traj_enabled: true / traj_append: true`,关 replay。`config_replay.yaml` 默认关 traj_enabled / 开 replay,`replay_traj_path: "/data/logs/LLMTraj.jsonl"` 占位 —— 实际部署时根据 traj 位置改。
-
-### 2.13 `/mnt/xinshi/github/litellm-traj/` —— 已删除
-
-第一版独立项目骨架(`pyproject.toml / src/litellm_traj/cursor.py / .gitignore / LICENSE`)在方向变更时已 `rm -rf`。所有有效内容都迁回了 rock 的 integrations/ 模块。
-
----
-
-## 3. 关键代码细节(踩坑点 + "为什么这么写")
-
-下文展开几个最容易让接手者迷失的设计选择。每一项都标了 litellm 仓内的源码定位(litellm 主仓在 `/mnt/xinshi/github/litellm/`),便于交叉验证。
-
-### 3.1 Streaming 聚合在 litellm 内部完成,Recorder 不需要分支
-
-`StandardLoggingPayload.response` 字段在 `success_handler` 触发前**已经是聚合完整的 OpenAI shape dict**。流式与非流式走同一条路径:litellm 在 streaming 结束时调用 `stream_chunk_builder` 拼出 `complete_streaming_response`(litellm 仓 `litellm/litellm_core_utils/litellm_logging.py:1930-1955`),然后写入 `standard_logging_object.response`。
-
-实际后果:`TrajectoryRecorder.async_log_success_event` 拿到的 payload 永远含完整 response,我**不需要写 `async_log_stream_event`**。这也是为什么 stream 解禁几乎"零成本" —— 录制端无任何额外代码。
-
-### 3.2 `model: "openai/<name>"` 前缀的含义
-
-litellm 把"provider"前缀作为路由依据。`openai/gpt-3.5-turbo` 表示"上游是 OpenAI 兼容协议的服务,模型名叫 gpt-3.5-turbo"。配合 `api_base="https://api.modelscope.cn/v1"` 这种第三方 OpenAI 兼容 endpoint 也能用 —— 这正是 rock 现有 `proxy_rules` 里的 ModelScope/OpenAI 等场景。
-
-`traj-replay/<name>` 是我们注册的自定义 provider。litellm 看到这个前缀会查 `litellm.custom_provider_map`,匹配到 `provider == "traj-replay"` 的项,把 `custom_handler.acompletion`/`astreaming` 当上游调(litellm 仓 `litellm/main.py:4280-4326`)。
-
-### 3.3 错误归一化:为什么 catch 那 5 个 exception
-
-`proxy.py` catch 顺序:`RateLimitError, APIError, BadRequestError, AuthenticationError, Timeout`。这五个在 `litellm/exceptions.py` 全部继承自 `openai.OpenAIError` 派生类,**都带 `.status_code` 属性**。`_format_error_response` 用 `getattr(exc, "status_code", None) or 502` 提取上游真实状态码;message 走 `str(exc)` —— litellm 异常的 `__str__` 已经包含"上游原始 error message",所以 agent 端的关键字检测(如 `"context length exceeded"` / `"content violation"`)继续工作。
-
-`type` 字段用 `type(exc).__name__`(`"BadRequestError"` 等),不再是旧的固定 `"proxy_retry_failed"`。这是 schema 的语义变化:同一个 `error.type` 字段,旧版本返回固定字符串,新版本返回 exception 类名。如果有下游消费 `error.type` 做分支,需要适配。
-
-兜底 `except Exception` 走 `HTTPException(500)`,会被 `main.py` 里的 `global_exception_handler` 接住,返回 `{error:{message,type:"internal_error",code:"internal_error"}}` —— 这条路径与重构前完全一致。
-
-### 3.4 retry 行为:从 `retry_async` 切到 `litellm.num_retries`
-
-旧实现:`@retry_async(max_attempts=6, delay_seconds=2.0, backoff=2.0, jitter=True, exceptions=(TimeoutException, ConnectError, HTTPStatusError))`。仅在 `status_code in retryable_status_codes` 时 raise,这样 401 不会触发 retry,而 429/500 会。
-
-新实现:`config.num_retries`(默认 6) 直接传给 `litellm.acompletion(num_retries=...)`。litellm 内部对 `RateLimitError / APIError / Timeout / ServiceUnavailableError` 自动重试,**不暴露 `retryable_status_codes` 维度**。我保留 `retryable_status_codes` 字段在 config 里,但当前**handler 没用它**(向后兼容旧 yaml,不会因为多了字段而 reject)。
-
-如果将来有人投诉"自定义重试码列表失效",这是已知的语义差异。fallback 方案:在 handler 里手写 `for attempt in range(config.num_retries):` 包一层,根据 status code 做白名单。本期不做,因为 litellm 默认行为已经覆盖最常见的 429/500。
-
-### 3.5 `_filter_headers` 黑名单 vs 白名单
-
-我用黑名单:`host / content-length / content-type / transfer-encoding / connection` 不转发,其余全部透传给 litellm 的 `extra_headers`。这与旧实现保持一致(旧的也是去掉前 4 个,新增 connection 是为了更标准)。Authorization/X-* 等都自动通过。
-
-注意:`extra_headers` 在 litellm 里被合并到上游 HTTP 请求里(litellm 自己的 OpenAI client),不会覆盖 litellm 自己生成的 `Authorization: Bearer <api_key>`。如果 rock 不主动设 `OPENAI_API_KEY`,而 client 又传了 Authorization header,litellm 会用 client 的;反之 litellm 会用环境变量。这一层逻辑全在 litellm 自己。
-
-### 3.6 `traj_append=False` 的"首次截断"行为
-
-旧 `_write_traj` 在 `append=False` 时**每次调用都 `mode="w"`**,导致 jsonl 永远只有最后一行 —— 这是个 bug。
-
-新 `TrajectoryRecorder` 的修复:维护一个 `self._truncated` 实例标志;`append=False` 时,**第一次写**用 `mode="w"`(覆盖上一进程留下的旧 traj),**后续写**用 `mode="a"`(本进程内 append)。所以:
-- 进程启动时:旧 traj 文件清空(如果存在)
-- 进程运行中:每次调用 append 一行
-- 进程重启:再次清空,从头记
-
-效果上等于"per-run 一份完整 traj"。我把这个语义在 docstring 里讲清楚了,因为这是和旧默认行为最不同的一点。
-
-`traj_append=True`(新默认)就是纯 append-only,不管旧文件。
-
-### 3.7 SequentialCursor 的并发模型
-
-`async next()` 用 `asyncio.Lock` 保护索引 + 自增。**单进程多并发请求场景下** cursor 推进是原子的,但**含义是"按到达顺序消费"**,所以多个 agent 并发打过来会被串成一个伪顺序 —— 这是 v1 的已知约束(plan 里明确列出),约定"单 agent 串行回放"。
-
-**model mismatch 只 warn 不 raise**:expected_model 来自调用方传入,recorded model 来自 record 内的 `model` 字段。两者不一致只打 warning,record 仍然返回。理由:agent 端可能切换了 base_url 但没改 model 名(常见调试场景),不该硬阻塞。
-
-### 3.8 CustomLLM 的调用约定 —— `*args, **kwargs` 收尾很重要
-
-`litellm/main.py:4302-4319` 实测调用方式是**全关键字参数**:
-```python
-response = handler_fn(
-    model=model, messages=messages, headers=headers,
-    model_response=model_response, print_verbose=...,
-    api_key=..., api_base=..., acompletion=..., logging_obj=...,
-    optional_params=..., litellm_params=..., logger_fn=...,
-    timeout=..., custom_prompt_dict=..., client=..., encoding=...,
-)
-```
-但 litellm 各小版本会不会增减字段不确定。`TrajectoryReplayer.acompletion(self, model, messages, *args, **kwargs)` 这种"显式 model+messages,其余吞掉"的签名,既能 PEP-484 注解,又对 litellm 后续加字段免疫。
-
-**不要改成 `def acompletion(self, model, messages, *, optional_params, ...)`** 否则 litellm 加新字段时会 TypeError。
-
-### 3.9 `LITELLM_TRAJ_FILE` env vs `traj_file` 字段
-
-我没引入新 env var。`config.traj_file` 在 `main.py:_configure_litellm_for_proxy` 里通过 `config.traj_file or TRAJ_FILE` 取值,而 `TRAJ_FILE` 来自 `config.py:13`,= `LOG_DIR + "/LLMTraj.jsonl"`,`LOG_DIR = env_vars.ROCK_MODEL_SERVICE_DATA_DIR`(默认 `/data/logs`)。
-
-所以路径优先级:`--traj-file CLI` > `traj_file: yaml` > `LOG_DIR/LLMTraj.jsonl`(LOG_DIR 受 `ROCK_MODEL_SERVICE_DATA_DIR` env 控制)。和旧体系一致。
-
-### 3.10 `record_traj` 装饰器为什么保留
-
-`local.py:75` 仍然用 `@record_traj` 装饰它的 chat_completions handler。local 模式不调 litellm,FileHandler 直接通过文件 marker 跟 Roll 通信 —— 没有 litellm callback 触发的窗口。所以为了保留 local 模式的"调用次数 + RT 上报",我把 `record_traj` 留在 `utils.py`,让 local 继续用,docstring 写明"proxy 模式不再用,改走 TrajectoryRecorder"。
-
-代价:local 模式录的 traj schema 是旧的 `{request, response}`,proxy 模式是 `StandardLoggingPayload`。两种 schema 共存于同一个 `LLMTraj.jsonl` 文件路径上(因为 `TRAJ_FILE` 是同一个常量)。**实际部署时 local 和 proxy 用同一个进程的概率为 0**(`--type` 互斥),所以同一个 traj 文件不会混合两种 schema。但如果有人定时切换 `--type` 跑 + `traj_append=true` 不轮换文件,会出现混合。文档建议:**replay 时只读 proxy 模式录的 traj**(StandardLoggingPayload 格式),local 模式的 traj 仅用于 local 调试。
-
----
-
-## 4. 跑测试 / 验证步骤(接手者从这里继续)
-
-### 4.1 准备 Python 环境
-
-**已验证**:`uv sync` 后 litellm 已正常安装。使用 `uv run` 执行,不需要手动激活 venv。
-
-```bash
-cd /mnt/xinshi/github/Self-ROCK
-uv sync --extra model-service --group test
-```
-
-验证依赖(已通过):
-
-```bash
-uv run python -c "from litellm.integrations.custom_logger import CustomLogger; print('ok')"
-uv run python -c "from litellm.llms.custom_llm import CustomLLM, CustomLLMError; print('ok')"
-uv run python -c "from litellm.utils import async_mock_completion_streaming_obj; print('ok')"
-```
-
-### 4.2 静态检查 / lint
-
-```bash
-uv run ruff check rock/sdk/model/server/ tests/unit/sdk/model/
-uv run ruff format --check rock/sdk/model/server/ tests/unit/sdk/model/
-```
-
-如果 ruff format 报 diff,直接 `uv run ruff format rock/sdk/model/server/ tests/unit/sdk/model/` 修。代码写的时候我没跑 ruff,可能有 line-length / import 排序之类的小问题。
-
-### 4.3 单测(已全部通过)
-
-```bash
-uv run pytest tests/unit/sdk/model/ -v
-# → 47 passed in ~4s
-```
-
-**已验证通过的测试集**:
-- `test_proxy.py` (27 个):routing/error/replay/header/cli/config/metrics
-- `test_traj_recorder.py` (7 个):JSONL append/truncate/metrics/failure/missing payload/mkdir/rt fallback
-- `test_traj_replayer.py` (11 个):cursor 加载/顺序/越界/reset/model mismatch/acompletion/streaming/exhaustion
-- `test_model_client.py` (2 个):原有测试保留通过
-
-**已知但不影响测试的边界情况**(生产注意):
-- tool_calls 场景下 `_extract_assistant_text` 返回 `""`,replay 流式会返回空流(已知限制,不在本期范围)
-- `litellm.callbacks` 是全局 list,测试隔离靠 patch,生产只起一次 server 无问题
-
-### 4.4 集成验证(测试通过后)
-
-#### Record 模式
-
-```bash
-# 终端 1
-export OPENAI_API_KEY="sk-..."
-export ROCK_MODEL_SERVICE_DATA_DIR=/tmp/rock-traj
-mkdir -p /tmp/rock-traj
-uv run python -m rock.sdk.model.server.main \
-    --type proxy \
-    --config-file examples/model_service/config_record.yaml \
-    --port 8080
-
-# 终端 2
-curl -X POST http://127.0.0.1:8080/v1/chat/completions \
-    -H "Authorization: Bearer $OPENAI_API_KEY" \
-    -H "Content-Type: application/json" \
-    -d '{"model":"gpt-3.5-turbo","messages":[{"role":"user","content":"say hi"}]}'
-
-# 验证 traj
-cat /tmp/rock-traj/LLMTraj.jsonl | jq '.id, .model, .response.choices[0].message.content'
-# 应该看到 chatcmpl-xxx / gpt-3.5-turbo / "..."
-```
-
-#### Replay 模式
-
-```bash
-# 终端 1
-uv run python -m rock.sdk.model.server.main \
-    --type proxy \
-    --replay-traj /tmp/rock-traj/LLMTraj.jsonl \
-    --port 8081
-
-# 终端 2 - 同样的 curl 打 8081
-curl -X POST http://127.0.0.1:8081/v1/chat/completions \
-    -H "Content-Type: application/json" \
-    -d '{"model":"gpt-3.5-turbo","messages":[{"role":"user","content":"anything (replay ignores msgs)"}]}'
-
-# 应该返回与录制时同样的 response.choices[0].message.content
-# 第二次 curl 会 404(traj exhausted),证明 cursor 在工作
-```
-
-#### Streaming 验证
-
-```bash
-curl -N -X POST http://127.0.0.1:8080/v1/chat/completions \
-    -H "Authorization: Bearer $OPENAI_API_KEY" \
-    -H "Content-Type: application/json" \
-    -d '{"model":"gpt-3.5-turbo","stream":true,"messages":[{"role":"user","content":"count to 5"}]}'
-# 应该看到 SSE chunks: data: {...}\n\n ... data: [DONE]\n\n
-# traj 文件里那一行的 .stream == true,.response 是聚合后的完整 dict
-```
-
-#### Agent 端到端(最终验证)
-
-`mini-swe-agent` 跑一个 SWE-bench 实例,base_url 指向 8080(record),完了用同 instance 接 8081(replay),期望 agent 最终生成的 patch 与录制时一致。这是最强 check,但跑起来麻烦,可以在 PR review 阶段再做。
-
----
-
-## 5. Breaking Changes(PR 描述里必须写清楚)
-
-### 5.1 traj 文件 schema 改变
-
-`LLMTraj.jsonl` 每行从 `{"request": {...}, "response": {...}}` 变成 `StandardLoggingPayload`(几十个字段:`id/trace_id/model/messages/response/model_parameters/usage/startTime/endTime/status/...`)。
-
-如果有下游消费者依赖旧的两字段 schema(脚本、UI、统计),会破坏。本期不提供"双格式输出"或"旧→新转换"工具,如有需要可单独写 `scripts/convert_traj.py`。
-
-### 5.2 `traj_append` 默认值翻转
-
-旧的 `ROCK_MODEL_SERVICE_TRAJ_APPEND_MODE` 默认 `"false"` → `_write_traj` 用 `mode="w"`,实际表现是"每次调用覆盖,文件只剩最后一条"。新的 `ModelServiceConfig.traj_append` 默认 `True`(append-only)。
-
-如果有人**之前依赖每次都覆盖来获取"最近一次调用"**(很罕见但可能),需要在 yaml 显式设 `traj_append: false`。
-
-### 5.3 `error.type` 字段语义变化
-
-旧值:固定字符串 `"proxy_retry_failed"`(retry 用尽)或 `"internal_error"`(其他)。
-新值:litellm 异常类名,如 `"BadRequestError" / "RateLimitError" / "Timeout" / "AuthenticationError" / "APIError"`。
-
-`error.message` 仍以 `"LLM backend error: ..."` 开头,关键字检测兼容。
-
-### 5.4 `retryable_status_codes` 字段不再生效
-
-旧版本根据 `retryable_status_codes` 白名单决定哪些状态码触发 retry(如 401 不 retry,429/500 retry)。新版本由 litellm 内部决定(对 `RateLimitError / APIError / Timeout / ServiceUnavailableError` 自动 retry,4xx 一般不 retry)。
-
-字段保留在 yaml 不报错,但 handler 不读它。如果将来需要恢复白名单,见 3.4 节"fallback 方案"。
-
-### 5.5 `stream=true` 不再被强制拒绝
-
-旧版本对 `stream=true` 返回 400 + `"Streaming requests (stream=True) are not supported"`。新版本正常处理,返回 SSE。
-
-如果有 client 之前**依赖** 400 来探测"是否启用流式",会破坏。但这种用法很反常,基本不会有。
-
-### 5.6 `perform_llm_request` 函数已删除
-
-下游不应该 import 这个 —— 它本来就是 proxy.py 内的 helper。如果有 test/script 直接 import 它,需要适配。`tests/unit/sdk/model/test_proxy.py` 我已改完。
-
-### 5.7 新的依赖
-
-`pip install rl-rock[model-service]` 会多装 litellm(及其依赖链:`openai>=1.x / tiktoken / aiohttp / tokenizers / ...`)。包大小 +~50MB。
-
----
-
-## 6. 已知坑 / 接手时的注意事项
-
-### 6.1 `local.py` 仍在 import `record_traj`
-
-我**没改 local.py**(plan 明确"local 不动")。`local.py:12` 的 `from rock.sdk.model.server.utils import record_traj` 仍然成立,因为 utils.py 保留了 record_traj。如果接手者看到这个 import 想清理,**不要清理** —— 那会破坏 local 模式。
-
-### 6.2 `litellm.callbacks` 是全局 list
-
-`main.py:_configure_litellm_for_proxy` 用 `litellm.callbacks.append(recorder)`。如果同一进程多次启动(测试场景),会注册多次,导致每次调用落多份 traj。生产部署只跑一次没问题。**如果要写"重复初始化也安全"的逻辑**,可以改成 `if not any(isinstance(cb, TrajectoryRecorder) for cb in litellm.callbacks): litellm.callbacks.append(recorder)`。我没做,因为生产路径是"启动一次"。
-
-同理 `litellm.custom_provider_map = [...]` 是赋值不是 append,所以 replay 重复初始化是幂等的。
-
-### 6.3 SequentialCursor 在测试里要小心 cursor 跨用例
-
-`SequentialCursor` 是实例属性 `self._idx`,每个测试自己 `SequentialCursor.load(p)` 都是新实例,不会跨用例污染。但如果有人写"模块级单例 replayer + 多个测试调它"的 fixture,会撞 idx。当前测试都是 per-test 实例,OK。
-
-### 6.4 `litellm` import 较慢
-
-litellm import 时会加载几个 OpenAI/HuggingFace 客户端,首次 import 可能 1-2 秒。`main.py` 把 `import litellm` 放在 `_configure_litellm_for_proxy()` 内部(函数级延迟 import),只在 proxy 模式启动时触发。`proxy.py` 是模块顶级 `import litellm`,handler 文件首次加载就触发 —— 这是 fastapi 路由注册时的开销,不影响请求路径性能。
-
-### 6.5 `pyproject.toml` 的 `tzdata` 依赖
-
-我看到 pyproject.toml 里 ide_diagnostics 报 `httpx/uuid/anyio/tzdata/...` 未安装 —— 这是 ide 当前 Python 环境没装 rock 主仓依赖,与本次改动无关。`uv sync` 后这些 hint 自动消失。
-
-### 6.6 `__pycache__` 残留
-
-旧 `proxy.py` 有 `__pycache__/proxy.cpython-310.pyc`。重写后第一次 import 会重新生成,**正常情况下没问题**。如果跑测试时报 `ImportError: cannot import name 'perform_llm_request'`,先 `find rock -name __pycache__ -exec rm -rf {} +` 清掉缓存。
-
-### 6.7 别忘了 `extra_headers` 可能含敏感信息
-
-`_filter_headers` 把所有非 hop-by-hop header 透传给上游,包括 client 传的 `Authorization`。这是**故意的** —— 让 client 自己带 API key 是 rock 现有约定。但意味着 traj 录的 `StandardLoggingPayload.metadata.headers`(如果有) 可能含 Bearer token。litellm 自己有 `turn_off_message_logging` / `redact_user_api_key_info` 等开关,**目前没启用**。如果将来 traj 文件要分发,需要先脱敏。
-
----
-
-## 7. 不在本次范围 / 后续扩展(v2)
-
-### 不在范围(明确不做)
-
-- local 模式(`--type local`)的任何改动
-- DB 持久化(traj 只走 JSONL)
-- 旧 `{request, response}` traj 的兼容读取(replay 只接受新 schema)
-- SWE-agent / OpenHands 原生 traj 格式互转
-- replay 时 streaming 的细粒度时序还原(只保证 chunk 序列正确)
-- tool_calls 的增量流式拆分(本期 streaming replay 只到 message-level chunk)
-
-### 后续扩展(留了接口)
-
-- **基于 messages hash 的乱序匹配**:`SequentialCursor` 旁加 `HashMatcher`,通过 `replay_mode: sequential | hash` 切换。当 agent 内部不严格按录制顺序调 LLM(分支/retry)时用。
-- **多并发回放**:用请求 metadata 中的 `run_id` 路由到不同 cursor;`SequentialCursor` 改成 `dict[run_id, Cursor]`。
-- **passthrough on miss**:cursor 用尽时回落到真 LLM(`import litellm; await litellm.acompletion(...)`)。用于"录到一半 traj 不够长"的调试场景。
-- **`/admin/reset` HTTP 端点**:不重启 proxy 即可把 cursor 归零。
-- **`scripts/convert_traj.py`**:把 SWE-agent `.traj` 或 OpenHands event log 转成 StandardLoggingPayload,反向也行。
-- **traj 脱敏 hook**:写盘前过 `redact_keys: list[str]` 把指定字段抹掉。
-
----
-
-## 8. 关键路径速查
-
-### Rock 仓内(本次改动的)
-
-| 路径 | 角色 |
-|---|---|
-| `pyproject.toml` | model-service extras 加 litellm |
-| `rock/sdk/model/server/config.py` | ModelServiceConfig 新字段 |
-| `rock/sdk/model/server/api/proxy.py` | 重写为 litellm SDK |
-| `rock/sdk/model/server/main.py` | `_configure_litellm_for_proxy` + 新 CLI flags |
-| `rock/sdk/model/server/utils.py` | 保留 record_traj 给 local |
-| `rock/sdk/model/server/integrations/__init__.py` | 空,只为成包 |
-| `rock/sdk/model/server/integrations/traj_recorder.py` | TrajectoryRecorder(CustomLogger) |
-| `rock/sdk/model/server/integrations/traj_replayer.py` | SequentialCursor + TrajectoryReplayer(CustomLLM) |
-| `rock/sdk/model/server/api/local.py` | **没改**(仍用 record_traj) |
-| `tests/unit/sdk/model/test_proxy.py` | 改造完 |
-| `tests/unit/sdk/model/test_traj_recorder.py` | 新 |
-| `tests/unit/sdk/model/test_traj_replayer.py` | 新 |
-| `examples/model_service/config_record.yaml` | 新 |
-| `examples/model_service/config_replay.yaml` | 新 |
-
-### litellm 仓(交叉验证用,在 `/mnt/xinshi/github/litellm/`)
-
-| 关注点 | 路径 |
-|---|---|
-| CustomLogger 接口(基类) | `litellm/integrations/custom_logger.py:67` |
-| CustomLLM 接口(基类) | `litellm/llms/custom_llm.py:47` |
-| StandardLoggingPayload schema | `litellm/types/utils.py:2764` |
-| streaming 聚合写入 payload | `litellm/litellm_core_utils/litellm_logging.py:1930-1955` |
-| async_mock_completion_streaming_obj | `litellm/utils.py:6831` |
-| custom_provider_map 加载流程(实际是怎么调 acompletion 的) | `litellm/main.py:4280-4326` |
-| LiteLLM 异常基类(status_code 来源) | `litellm/exceptions.py` |
-
-### 历史 / 对话产物
-
-- 原始 plan 文件(详细设计推演): `/home/xinshi/.claude/plans/litellm-chat-completions-traj-replay-ser-lucky-rainbow.md`
-- 已废弃的独立项目骨架: `/mnt/xinshi/github/litellm-traj/`(已 `rm -rf`)
-
----
-
-## 9. 给接手者的 1 分钟上手
-
-1. `cd /mnt/xinshi/github/Self-ROCK && uv sync --extra model-service --group test`
-2. `uv run pytest tests/unit/sdk/model/ -v` → 应得 **47 passed**(已验证)
-3. 跑集成验证(第 4.4 节)
-4. 写 PR 描述,**重点说第 5 节的 breaking changes**
-5. PR 评审里如果有人问"为什么不沿用 retry_async 的 status code 白名单",答:见第 3.4 节(litellm 默认 retry 已覆盖最常见场景,白名单后续可选加)
-
-如果想了解整个项目背景而不只是这次 refactor,看顶层 `CLAUDE.md`。如果想知道 litellm 内部细节,看 `/mnt/xinshi/github/litellm/CLAUDE.md`(litellm 主仓的)。

From 0679d3271085761812264e3ac831ccadf8a518c3 Mon Sep 17 00:00:00 2001
From: "pengshixin.psx" <pengshixin.psx@alibaba-inc.com>
Date: Tue, 12 May 2026 08:36:09 +0000
Subject: [PATCH 15/25] =?UTF-8?q?refactor(model-service):=20flatten=20layo?=
 =?UTF-8?q?ut=20=E2=80=94=20drop=20integrations/,=20rename=20sse=5Futils?=
 =?UTF-8?q?=E2=86=92sse,=20merge=20traj=5F*=E2=86=92traj?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The integrations/ directory only ever held two files (traj_recorder, traj_replayer)
and the litellm CustomLogger angle that justified the name is long gone. Both
modules share one JSONL schema, so collapsing them into a single traj.py keeps
the schema and its read/write halves visible together.

sse_utils.py → sse.py: the codec is the module's whole purpose, the _utils
suffix added nothing.

Drop traj_recorder.now() — a one-line wrapper around time.time() with no
callers.

Also remove a stray _get_or_create_metrics_monitor patch in
test_forward_invokes_recorder_on_success: OTLP create-time failure only logs
a warning, so the patch was protecting against nothing.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 rock/sdk/model/server/api/proxy.py            |   5 +-
 .../sdk/model/server/integrations/__init__.py |   0
 .../server/integrations/traj_recorder.py      |  90 ----------
 .../server/integrations/traj_replayer.py      |  82 ---------
 rock/sdk/model/server/main.py                 |   4 +-
 .../sdk/model/server/{sse_utils.py => sse.py} |   0
 rock/sdk/model/server/traj.py                 | 156 ++++++++++++++++++
 tests/unit/sdk/model/test_proxy.py            |  11 +-
 .../sdk/model/test_proxy_record_replay.py     |   5 +-
 .../model/{test_sse_utils.py => test_sse.py}  |   2 +-
 tests/unit/sdk/model/test_traj_recorder.py    |   4 +-
 tests/unit/sdk/model/test_traj_replayer.py    |   2 +-
 12 files changed, 169 insertions(+), 192 deletions(-)
 delete mode 100644 rock/sdk/model/server/integrations/__init__.py
 delete mode 100644 rock/sdk/model/server/integrations/traj_recorder.py
 delete mode 100644 rock/sdk/model/server/integrations/traj_replayer.py
 rename rock/sdk/model/server/{sse_utils.py => sse.py} (100%)
 create mode 100644 rock/sdk/model/server/traj.py
 rename tests/unit/sdk/model/{test_sse_utils.py => test_sse.py} (99%)

diff --git a/rock/sdk/model/server/api/proxy.py b/rock/sdk/model/server/api/proxy.py
index 5fa750e5d2..462788fe90 100644
--- a/rock/sdk/model/server/api/proxy.py
+++ b/rock/sdk/model/server/api/proxy.py
@@ -33,14 +33,13 @@
 
 from rock.logger import init_logger
 from rock.sdk.model.server.config import ModelServiceConfig
-from rock.sdk.model.server.integrations.traj_recorder import TrajectoryRecorder
-from rock.sdk.model.server.integrations.traj_replayer import SequentialCursor, TrajectoryExhausted
-from rock.sdk.model.server.sse_utils import (
+from rock.sdk.model.server.sse import (
     SSE_DONE,
     completion_to_chunk_dict,
     encode_sse_event,
     parse_sse_data_chunks,
 )
+from rock.sdk.model.server.traj import SequentialCursor, TrajectoryExhausted, TrajectoryRecorder
 
 logger = init_logger(__name__)
 
diff --git a/rock/sdk/model/server/integrations/__init__.py b/rock/sdk/model/server/integrations/__init__.py
deleted file mode 100644
index e69de29bb2..0000000000
diff --git a/rock/sdk/model/server/integrations/traj_recorder.py b/rock/sdk/model/server/integrations/traj_recorder.py
deleted file mode 100644
index a0c7e08fc7..0000000000
--- a/rock/sdk/model/server/integrations/traj_recorder.py
+++ /dev/null
@@ -1,90 +0,0 @@
-"""Append chat/completions trajectories as JSONL.
-
-The recorder is invoked **explicitly** from ``proxy.py`` after each forwarded
-call (success or failure). It is no longer a litellm CustomLogger — we removed
-the litellm SDK dependency in favor of httpx-based byte forwarding, and call
-this object directly so writes stay deterministic and locally testable.
-
-Schema per line: a small dict with ``request`` / ``response`` / ``status`` /
-``response_time`` / ``model`` / ``stream``. Faithful enough to drive the
-sequential replayer; not a full StandardLoggingPayload.
-"""
-
-from __future__ import annotations
-
-import asyncio
-import json
-import os
-import time
-from pathlib import Path
-from typing import Any
-
-from rock.logger import init_logger
-from rock.sdk.model.server.utils import (
-    MODEL_SERVICE_REQUEST_COUNT,
-    MODEL_SERVICE_REQUEST_RT,
-    _get_or_create_metrics_monitor,
-)
-
-logger = init_logger(__name__)
-
-
-class TrajectoryRecorder:
-    """Appends one JSONL line per chat/completions call and reports OTLP metrics."""
-
-    def __init__(self, traj_file: str | os.PathLike) -> None:
-        self.traj_file = Path(traj_file)
-        self.traj_file.parent.mkdir(parents=True, exist_ok=True)
-        self._lock = asyncio.Lock()
-        self._monitor = _get_or_create_metrics_monitor()
-
-    async def record(
-        self,
-        *,
-        request: dict[str, Any],
-        response: dict[str, Any] | None,
-        status: str,
-        start_time: float,
-        end_time: float,
-        error: str | None = None,
-    ) -> None:
-        """Persist one call to the JSONL file and report RT/count metrics.
-
-        ``request`` / ``response`` are stored verbatim (whatever the upstream
-        returned, including provider-specific fields like ``reasoning_content``).
-        For streaming calls, ``response`` is the aggregated final ChatCompletion
-        produced by ``ChatCompletionStreamState.get_final_completion().model_dump()``.
-        """
-        rt_seconds = end_time - start_time
-        payload = {
-            "model": request.get("model"),
-            "stream": bool(request.get("stream")),
-            "status": status,
-            "response_time": rt_seconds,
-            "start_time": start_time,
-            "end_time": end_time,
-            "request": request,
-            "response": response,
-            "error": error,
-        }
-
-        line = json.dumps(payload, ensure_ascii=False, default=str) + "\n"
-        async with self._lock:
-            await asyncio.to_thread(self._write_line, line)
-
-        attrs = {
-            "type": "chat_completions",
-            "status": status,
-            "sandbox_id": os.getenv("ROCK_SANDBOX_ID", "unknown"),
-        }
-        self._monitor.record_gauge_by_name(MODEL_SERVICE_REQUEST_RT, rt_seconds * 1000.0, attributes=attrs)
-        self._monitor.record_counter_by_name(MODEL_SERVICE_REQUEST_COUNT, 1, attributes=attrs)
-
-    def _write_line(self, line: str) -> None:
-        with self.traj_file.open("a", encoding="utf-8") as f:
-            f.write(line)
-
-
-def now() -> float:
-    """Wall-clock seconds (single shim so callers don't import time directly)."""
-    return time.time()
diff --git a/rock/sdk/model/server/integrations/traj_replayer.py b/rock/sdk/model/server/integrations/traj_replayer.py
deleted file mode 100644
index af2fdd6bb4..0000000000
--- a/rock/sdk/model/server/integrations/traj_replayer.py
+++ /dev/null
@@ -1,82 +0,0 @@
-"""Sequential cursor over a recorded JSONL trajectory.
-
-Loaded once at startup; ``await cursor.next(expected_model=...)`` hands out the
-next record (full StandardLoggingPayload dict) and advances. Going past the end
-raises :class:`TrajectoryExhausted` so the proxy can return a clean 404 without
-involving litellm — that's the whole point: replay does NOT need to go through
-litellm's CustomLLM machinery, the proxy serves recorded responses directly.
-"""
-
-from __future__ import annotations
-
-import asyncio
-import json
-import os
-from pathlib import Path
-
-from rock.logger import init_logger
-
-logger = init_logger(__name__)
-
-
-class TrajectoryExhausted(Exception):
-    """Raised by ``SequentialCursor.next`` when all recorded steps have been served."""
-
-    def __init__(self, position: int, total: int) -> None:
-        super().__init__(f"trajectory exhausted at step {position} (total recorded steps={total})")
-        self.position = position
-        self.total = total
-
-
-class SequentialCursor:
-    """Hands out trajectory records one at a time, in recorded order."""
-
-    def __init__(self, records: list[dict]) -> None:
-        self.records = records
-        self._idx = 0
-        self._lock = asyncio.Lock()
-
-    @classmethod
-    def load(cls, path: str | os.PathLike) -> SequentialCursor:
-        path = Path(path)
-        if not path.is_file():
-            raise FileNotFoundError(f"traj file not found: {path}")
-
-        records: list[dict] = []
-        with path.open("r", encoding="utf-8") as fp:
-            for line in fp:
-                line = line.strip()
-                if not line:
-                    continue
-                records.append(json.loads(line))
-
-        logger.info(f"[traj-replay] loaded {len(records)} record(s) from {path}")
-        return cls(records)
-
-    async def next(self, expected_model: str | None = None) -> dict:
-        async with self._lock:
-            if self._idx >= len(self.records):
-                raise TrajectoryExhausted(position=self._idx, total=len(self.records))
-            record = self.records[self._idx]
-            self._idx += 1
-            current_idx = self._idx - 1
-
-        if expected_model:
-            recorded_model = record.get("model")
-            if recorded_model and recorded_model != expected_model:
-                logger.warning(
-                    f"[traj-replay] step {current_idx} model mismatch: "
-                    f"recorded={recorded_model!r} requested={expected_model!r}"
-                )
-        return record
-
-    def reset(self) -> None:
-        self._idx = 0
-
-    @property
-    def position(self) -> int:
-        return self._idx
-
-    @property
-    def total(self) -> int:
-        return len(self.records)
diff --git a/rock/sdk/model/server/main.py b/rock/sdk/model/server/main.py
index 951e0d5dff..063d4fa31f 100644
--- a/rock/sdk/model/server/main.py
+++ b/rock/sdk/model/server/main.py
@@ -63,14 +63,14 @@ def _configure_proxy_integrations(app: FastAPI, config: ModelServiceConfig) -> N
     from rock.sdk.model.server.api.proxy import ForwardBackend, ReplayBackend
 
     if config.replay_traj_file:
-        from rock.sdk.model.server.integrations.traj_replayer import SequentialCursor
+        from rock.sdk.model.server.traj import SequentialCursor
 
         cursor = SequentialCursor.load(config.replay_traj_file)
         app.state.backend = ReplayBackend(cursor)
         logger.info(f"replay backend attached, traj_path={config.replay_traj_file}")
         return
 
-    from rock.sdk.model.server.integrations.traj_recorder import TrajectoryRecorder
+    from rock.sdk.model.server.traj import TrajectoryRecorder
 
     traj_path = config.traj_file or TRAJ_FILE
     recorder = TrajectoryRecorder(traj_file=traj_path)
diff --git a/rock/sdk/model/server/sse_utils.py b/rock/sdk/model/server/sse.py
similarity index 100%
rename from rock/sdk/model/server/sse_utils.py
rename to rock/sdk/model/server/sse.py
diff --git a/rock/sdk/model/server/traj.py b/rock/sdk/model/server/traj.py
new file mode 100644
index 0000000000..e12c229c7f
--- /dev/null
+++ b/rock/sdk/model/server/traj.py
@@ -0,0 +1,156 @@
+"""Trajectory record + replay for the chat/completions proxy.
+
+Two halves around the same JSONL schema (one record per line):
+
+- :class:`TrajectoryRecorder` — invoked by the forward path after each upstream
+  call (success or failure). Appends a small dict with
+  ``request`` / ``response`` / ``status`` / ``response_time`` / ``model`` /
+  ``stream``, and reports OTLP RT/count metrics. Stores responses verbatim
+  (provider-specific fields like ``reasoning_content`` survive); for streaming
+  calls ``response`` is the aggregated final ChatCompletion produced by
+  ``ChatCompletionStreamState.get_final_completion().model_dump()``.
+
+- :class:`SequentialCursor` — loads a JSONL trajectory once at startup;
+  ``await cursor.next(expected_model=...)`` hands out the next record (full
+  payload dict) and advances. Going past the end raises
+  :class:`TrajectoryExhausted` so the proxy can return a clean 404.
+"""
+
+from __future__ import annotations
+
+import asyncio
+import json
+import os
+from pathlib import Path
+from typing import Any
+
+from rock.logger import init_logger
+from rock.sdk.model.server.utils import (
+    MODEL_SERVICE_REQUEST_COUNT,
+    MODEL_SERVICE_REQUEST_RT,
+    _get_or_create_metrics_monitor,
+)
+
+logger = init_logger(__name__)
+
+
+# ---------------------------------------------------------------------------
+# Recorder
+# ---------------------------------------------------------------------------
+
+
+class TrajectoryRecorder:
+    """Appends one JSONL line per chat/completions call and reports OTLP metrics."""
+
+    def __init__(self, traj_file: str | os.PathLike) -> None:
+        self.traj_file = Path(traj_file)
+        self.traj_file.parent.mkdir(parents=True, exist_ok=True)
+        self._lock = asyncio.Lock()
+        self._monitor = _get_or_create_metrics_monitor()
+
+    async def record(
+        self,
+        *,
+        request: dict[str, Any],
+        response: dict[str, Any] | None,
+        status: str,
+        start_time: float,
+        end_time: float,
+        error: str | None = None,
+    ) -> None:
+        rt_seconds = end_time - start_time
+        payload = {
+            "model": request.get("model"),
+            "stream": bool(request.get("stream")),
+            "status": status,
+            "response_time": rt_seconds,
+            "start_time": start_time,
+            "end_time": end_time,
+            "request": request,
+            "response": response,
+            "error": error,
+        }
+
+        line = json.dumps(payload, ensure_ascii=False, default=str) + "\n"
+        async with self._lock:
+            await asyncio.to_thread(self._write_line, line)
+
+        attrs = {
+            "type": "chat_completions",
+            "status": status,
+            "sandbox_id": os.getenv("ROCK_SANDBOX_ID", "unknown"),
+        }
+        self._monitor.record_gauge_by_name(MODEL_SERVICE_REQUEST_RT, rt_seconds * 1000.0, attributes=attrs)
+        self._monitor.record_counter_by_name(MODEL_SERVICE_REQUEST_COUNT, 1, attributes=attrs)
+
+    def _write_line(self, line: str) -> None:
+        with self.traj_file.open("a", encoding="utf-8") as f:
+            f.write(line)
+
+
+# ---------------------------------------------------------------------------
+# Replay cursor
+# ---------------------------------------------------------------------------
+
+
+class TrajectoryExhausted(Exception):
+    """Raised by ``SequentialCursor.next`` when all recorded steps have been served."""
+
+    def __init__(self, position: int, total: int) -> None:
+        super().__init__(f"trajectory exhausted at step {position} (total recorded steps={total})")
+        self.position = position
+        self.total = total
+
+
+class SequentialCursor:
+    """Hands out trajectory records one at a time, in recorded order."""
+
+    def __init__(self, records: list[dict]) -> None:
+        self.records = records
+        self._idx = 0
+        self._lock = asyncio.Lock()
+
+    @classmethod
+    def load(cls, path: str | os.PathLike) -> SequentialCursor:
+        path = Path(path)
+        if not path.is_file():
+            raise FileNotFoundError(f"traj file not found: {path}")
+
+        records: list[dict] = []
+        with path.open("r", encoding="utf-8") as fp:
+            for line in fp:
+                line = line.strip()
+                if not line:
+                    continue
+                records.append(json.loads(line))
+
+        logger.info(f"[traj-replay] loaded {len(records)} record(s) from {path}")
+        return cls(records)
+
+    async def next(self, expected_model: str | None = None) -> dict:
+        async with self._lock:
+            if self._idx >= len(self.records):
+                raise TrajectoryExhausted(position=self._idx, total=len(self.records))
+            record = self.records[self._idx]
+            self._idx += 1
+            current_idx = self._idx - 1
+
+        if expected_model:
+            recorded_model = record.get("model")
+            if recorded_model and recorded_model != expected_model:
+                logger.warning(
+                    f"[traj-replay] step {current_idx} model mismatch: "
+                    f"recorded={recorded_model!r} requested={expected_model!r}"
+                )
+        return record
+
+    def reset(self) -> None:
+        self._idx = 0
+
+    @property
+    def position(self) -> int:
+        return self._idx
+
+    @property
+    def total(self) -> int:
+        return len(self.records)
diff --git a/tests/unit/sdk/model/test_proxy.py b/tests/unit/sdk/model/test_proxy.py
index f47813ac8b..cd262df2e0 100644
--- a/tests/unit/sdk/model/test_proxy.py
+++ b/tests/unit/sdk/model/test_proxy.py
@@ -17,8 +17,8 @@
 
 from rock.sdk.model.server.api.proxy import proxy_router
 from rock.sdk.model.server.config import ModelServiceConfig
-from rock.sdk.model.server.integrations.traj_replayer import SequentialCursor
 from rock.sdk.model.server.main import create_config_from_args, lifespan
+from rock.sdk.model.server.traj import SequentialCursor
 from rock.sdk.model.server.utils import (
     MODEL_SERVICE_REQUEST_COUNT,
     MODEL_SERVICE_REQUEST_RT,
@@ -386,7 +386,7 @@ def handler(request: httpx.Request) -> httpx.Response:
 @pytest.mark.asyncio
 async def test_forward_invokes_recorder_on_success(tmp_path):
     """When a recorder is attached to the backend, success calls write a JSONL line."""
-    from rock.sdk.model.server.integrations.traj_recorder import TrajectoryRecorder
+    from rock.sdk.model.server.traj import TrajectoryRecorder
 
     upstream_payload = _success_response_json(content="recorded reply")
     traj_file = tmp_path / "traj.jsonl"
@@ -396,12 +396,7 @@ def handler(request: httpx.Request) -> httpx.Response:
 
     config = ModelServiceConfig()
 
-    with (
-        _patch_httpx_with_handler(handler),
-        patch(
-            "rock.sdk.model.server.integrations.traj_recorder._get_or_create_metrics_monitor", return_value=MagicMock()
-        ),
-    ):
+    with _patch_httpx_with_handler(handler):
         recorder = TrajectoryRecorder(traj_file=traj_file)
         app = _build_app(config, recorder=recorder)
         transport = ASGITransport(app=app)
diff --git a/tests/unit/sdk/model/test_proxy_record_replay.py b/tests/unit/sdk/model/test_proxy_record_replay.py
index 1bb9832784..0b70ed0cf8 100644
--- a/tests/unit/sdk/model/test_proxy_record_replay.py
+++ b/tests/unit/sdk/model/test_proxy_record_replay.py
@@ -23,9 +23,8 @@
 
 from rock.sdk.model.server.api.proxy import ForwardBackend, ReplayBackend, proxy_router
 from rock.sdk.model.server.config import ModelServiceConfig
-from rock.sdk.model.server.integrations.traj_recorder import TrajectoryRecorder
-from rock.sdk.model.server.integrations.traj_replayer import SequentialCursor
-from rock.sdk.model.server.sse_utils import parse_sse_data_chunks
+from rock.sdk.model.server.sse import parse_sse_data_chunks
+from rock.sdk.model.server.traj import SequentialCursor, TrajectoryRecorder
 from rock.utils.system import find_free_port
 
 # ---------------------------------------------------------------------------
diff --git a/tests/unit/sdk/model/test_sse_utils.py b/tests/unit/sdk/model/test_sse.py
similarity index 99%
rename from tests/unit/sdk/model/test_sse_utils.py
rename to tests/unit/sdk/model/test_sse.py
index f660d2751b..251016a0a8 100644
--- a/tests/unit/sdk/model/test_sse_utils.py
+++ b/tests/unit/sdk/model/test_sse.py
@@ -2,7 +2,7 @@
 
 import json
 
-from rock.sdk.model.server.sse_utils import (
+from rock.sdk.model.server.sse import (
     SSE_DONE,
     completion_to_chunk_dict,
     encode_sse_event,
diff --git a/tests/unit/sdk/model/test_traj_recorder.py b/tests/unit/sdk/model/test_traj_recorder.py
index 6eb3b49571..3f06481639 100644
--- a/tests/unit/sdk/model/test_traj_recorder.py
+++ b/tests/unit/sdk/model/test_traj_recorder.py
@@ -5,14 +5,14 @@
 
 import pytest
 
-from rock.sdk.model.server.integrations.traj_recorder import TrajectoryRecorder
+from rock.sdk.model.server.traj import TrajectoryRecorder
 
 
 @pytest.fixture
 def mock_monitor():
     monitor = MagicMock()
     with patch(
-        "rock.sdk.model.server.integrations.traj_recorder._get_or_create_metrics_monitor",
+        "rock.sdk.model.server.traj._get_or_create_metrics_monitor",
         return_value=monitor,
     ):
         yield monitor
diff --git a/tests/unit/sdk/model/test_traj_replayer.py b/tests/unit/sdk/model/test_traj_replayer.py
index e4a379bd0d..ffcc5c4011 100644
--- a/tests/unit/sdk/model/test_traj_replayer.py
+++ b/tests/unit/sdk/model/test_traj_replayer.py
@@ -9,7 +9,7 @@
 
 import pytest
 
-from rock.sdk.model.server.integrations.traj_replayer import SequentialCursor, TrajectoryExhausted
+from rock.sdk.model.server.traj import SequentialCursor, TrajectoryExhausted
 
 
 def _record(*, msg: str, model: str = "gpt-3.5-turbo", call_id: str = "x") -> dict:

From dcd7905ff6249ca25c6d42ab03185df9396ae2bd Mon Sep 17 00:00:00 2001
From: "pengshixin.psx" <pengshixin.psx@alibaba-inc.com>
Date: Tue, 12 May 2026 08:42:49 +0000
Subject: [PATCH 16/25] =?UTF-8?q?refactor(model-service):=20rename=20traj?=
 =?UTF-8?q?=5Ffile=E2=86=92recording=5Ffile,=20replay=5Ftraj=5Ffile?=
 =?UTF-8?q?=E2=86=92replay=5Ffile?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The old names were ambiguous: "traj_file" alone gave no hint of write vs read,
and the CLI flag --traj-file was actually wired to config.replay_traj_file —
same word pointing in opposite directions depending on context.

New names mirror the backend pair (ForwardBackend = recording, ReplayBackend =
replay) so the role is obvious at the field. CLI is split into two independent
flags accordingly.

Recorder constructor still takes traj_file= since it names the JSONL file type,
not its role; only the config-field / CLI surface changes.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 rock/sdk/model/server/api/proxy.py |  2 +-
 rock/sdk/model/server/config.py    | 11 +++++-----
 rock/sdk/model/server/main.py      | 34 +++++++++++++++++++-----------
 tests/unit/sdk/model/test_proxy.py | 34 +++++++++++++++---------------
 4 files changed, 46 insertions(+), 35 deletions(-)

diff --git a/rock/sdk/model/server/api/proxy.py b/rock/sdk/model/server/api/proxy.py
index 462788fe90..6364d4a131 100644
--- a/rock/sdk/model/server/api/proxy.py
+++ b/rock/sdk/model/server/api/proxy.py
@@ -12,7 +12,7 @@
    returns (provider-specific ``reasoning_content``, ``citations``, ...) is
    passed through untouched.
 
-2. **ReplayBackend** (``replay_traj_file`` set) — the request is served
+2. **ReplayBackend** (``replay_file`` set) — the request is served
    directly from the next record in the ``SequentialCursor`` without any
    upstream call. Streaming emits the recorded response as one SSE chunk +
    ``[DONE]``.
diff --git a/rock/sdk/model/server/config.py b/rock/sdk/model/server/config.py
index 76e080305c..0f6ed69b66 100644
--- a/rock/sdk/model/server/config.py
+++ b/rock/sdk/model/server/config.py
@@ -51,12 +51,13 @@ class ModelServiceConfig(BaseModel):
     request_timeout: int = Field(default=120)
     """Request timeout in seconds."""
 
-    traj_file: str | None = Field(default=None)
-    """Override default trajectory file path. None → uses TRAJ_FILE (LOG_DIR/LLMTraj.jsonl)."""
+    recording_file: str | None = Field(default=None)
+    """Recording mode output: where ForwardBackend writes the trajectory JSONL.
+    None → uses TRAJ_FILE (LOG_DIR/LLMTraj.jsonl)."""
 
-    replay_traj_file: str | None = Field(default=None)
-    """Path to a .jsonl trajectory file for replay mode.
-    When set, requests are served from recorded responses instead of a real upstream."""
+    replay_file: str | None = Field(default=None)
+    """Replay mode input: a .jsonl trajectory file. When set, ReplayBackend serves
+    requests from recorded responses instead of calling a real upstream."""
 
     @classmethod
     def from_file(cls, config_path: str | None = None):
diff --git a/rock/sdk/model/server/main.py b/rock/sdk/model/server/main.py
index 063d4fa31f..89e87ac0f9 100644
--- a/rock/sdk/model/server/main.py
+++ b/rock/sdk/model/server/main.py
@@ -55,27 +55,28 @@ async def global_exception_handler(request, exc):
 def _configure_proxy_integrations(app: FastAPI, config: ModelServiceConfig) -> None:
     """Attach the appropriate backend to ``app.state.backend``.
 
-    - Replay mode (``replay_traj_file`` set): ``ReplayBackend`` wrapping a
+    - Replay mode (``replay_file`` set): ``ReplayBackend`` wrapping a
       ``SequentialCursor``; no recorder — replaying back into the source file
       would corrupt it.
-    - Forward mode (default): ``ForwardBackend`` with a ``TrajectoryRecorder``.
+    - Forward mode (default): ``ForwardBackend`` with a ``TrajectoryRecorder``
+      writing to ``recording_file`` (or ``TRAJ_FILE`` if unset).
     """
     from rock.sdk.model.server.api.proxy import ForwardBackend, ReplayBackend
 
-    if config.replay_traj_file:
+    if config.replay_file:
         from rock.sdk.model.server.traj import SequentialCursor
 
-        cursor = SequentialCursor.load(config.replay_traj_file)
+        cursor = SequentialCursor.load(config.replay_file)
         app.state.backend = ReplayBackend(cursor)
-        logger.info(f"replay backend attached, traj_path={config.replay_traj_file}")
+        logger.info(f"replay backend attached, replay_file={config.replay_file}")
         return
 
     from rock.sdk.model.server.traj import TrajectoryRecorder
 
-    traj_path = config.traj_file or TRAJ_FILE
-    recorder = TrajectoryRecorder(traj_file=traj_path)
+    recording_path = config.recording_file or TRAJ_FILE
+    recorder = TrajectoryRecorder(traj_file=recording_path)
     app.state.backend = ForwardBackend(config, recorder=recorder)
-    logger.info(f"forward backend attached, traj_file={traj_path}")
+    logger.info(f"forward backend attached, recording_file={recording_path}")
 
 
 def main(
@@ -127,9 +128,12 @@ def create_config_from_args(args) -> ModelServiceConfig:
     if args.request_timeout:
         config.request_timeout = args.request_timeout
         logger.info(f"request_timeout set from command line: {args.request_timeout}s")
-    if args.traj_file:
-        config.replay_traj_file = args.traj_file
-        logger.info(f"replay mode enabled via --traj-file: {args.traj_file}")
+    if args.recording_file:
+        config.recording_file = args.recording_file
+        logger.info(f"recording_file set from command line: {args.recording_file}")
+    if args.replay_file:
+        config.replay_file = args.replay_file
+        logger.info(f"replay mode enabled via --replay-file: {args.replay_file}")
 
     return config
 
@@ -173,7 +177,13 @@ def create_config_from_args(args) -> ModelServiceConfig:
         "--request-timeout", type=int, default=None, help="Request timeout in seconds. Overrides config file."
     )
     parser.add_argument(
-        "--traj-file",
+        "--recording-file",
+        type=str,
+        default=None,
+        help="Forward mode: where to write the trajectory JSONL. Defaults to TRAJ_FILE.",
+    )
+    parser.add_argument(
+        "--replay-file",
         type=str,
         default=None,
         help="Replay mode: path to a recorded .jsonl traj file. Disables real LLM upstreams.",
diff --git a/tests/unit/sdk/model/test_proxy.py b/tests/unit/sdk/model/test_proxy.py
index cd262df2e0..b658b9343e 100644
--- a/tests/unit/sdk/model/test_proxy.py
+++ b/tests/unit/sdk/model/test_proxy.py
@@ -439,7 +439,7 @@ async def test_replay_returns_recorded_response_no_upstream_call(tmp_path):
     traj.write_text(json.dumps(record) + "\n", encoding="utf-8")
 
     config = ModelServiceConfig()
-    config.replay_traj_file = str(traj)
+    config.replay_file = str(traj)
     app = _build_app(config, replay_cursor=SequentialCursor.load(traj))
 
     transport = ASGITransport(app=app)
@@ -474,7 +474,7 @@ async def test_replay_streaming_emits_recorded_response_as_sse(tmp_path):
     traj.write_text(json.dumps(record) + "\n", encoding="utf-8")
 
     config = ModelServiceConfig()
-    config.replay_traj_file = str(traj)
+    config.replay_file = str(traj)
     app = _build_app(config, replay_cursor=SequentialCursor.load(traj))
 
     transport = ASGITransport(app=app)
@@ -504,7 +504,7 @@ async def test_replay_returns_404_when_cursor_exhausted(tmp_path):
     traj.write_text(json.dumps(record) + "\n", encoding="utf-8")
 
     config = ModelServiceConfig()
-    config.replay_traj_file = str(traj)
+    config.replay_file = str(traj)
     app = _build_app(config, replay_cursor=SequentialCursor.load(traj))
 
     transport = ASGITransport(app=app)
@@ -550,26 +550,26 @@ def test_config_default_host_and_port():
     assert config.port == 8080
 
 
-def test_config_default_traj_and_replay():
+def test_config_default_recording_and_replay():
     config = ModelServiceConfig()
-    assert config.traj_file is None
-    assert config.replay_traj_file is None
+    assert config.recording_file is None
+    assert config.replay_file is None
 
 
 @pytest.mark.asyncio
-async def test_config_loads_traj_and_replay_from_file(tmp_path):
+async def test_config_loads_recording_and_replay_from_file(tmp_path):
     conf_file = tmp_path / "proxy.yml"
     conf_file.write_text(
         yaml.dump(
             {
-                "traj_file": "/tmp/my-traj.jsonl",
-                "replay_traj_file": "/tmp/in.jsonl",
+                "recording_file": "/tmp/my-traj.jsonl",
+                "replay_file": "/tmp/in.jsonl",
             }
         )
     )
     config = ModelServiceConfig.from_file(str(conf_file))
-    assert config.traj_file == "/tmp/my-traj.jsonl"
-    assert config.replay_traj_file == "/tmp/in.jsonl"
+    assert config.recording_file == "/tmp/my-traj.jsonl"
+    assert config.replay_file == "/tmp/in.jsonl"
 
 
 def test_cli_args_override_config_file(tmp_path):
@@ -591,8 +591,8 @@ def test_cli_args_override_config_file(tmp_path):
         proxy_base_url="https://cli-url.example.com/v1",
         retryable_status_codes=None,
         request_timeout=30,
-        num_retries=None,
-        traj_file=None,
+        recording_file=None,
+        replay_file=None,
     )
     config = create_config_from_args(args)
     assert config.host == "0.0.0.0"
@@ -601,7 +601,7 @@ def test_cli_args_override_config_file(tmp_path):
     assert config.request_timeout == 30
 
 
-def test_cli_traj_file_enables_replay():
+def test_cli_replay_file_enables_replay():
     args = argparse.Namespace(
         config_file=None,
         host=None,
@@ -609,11 +609,11 @@ def test_cli_traj_file_enables_replay():
         proxy_base_url=None,
         retryable_status_codes=None,
         request_timeout=None,
-        num_retries=None,
-        traj_file="/tmp/in.jsonl",
+        recording_file=None,
+        replay_file="/tmp/in.jsonl",
     )
     config = create_config_from_args(args)
-    assert config.replay_traj_file == "/tmp/in.jsonl"
+    assert config.replay_file == "/tmp/in.jsonl"
 
 
 # ---------- Metrics singleton + legacy record_traj (still used by local mode) ----------

From b87b61d52e1eeb380b59f4c4d1a92ec730cf6d7d Mon Sep 17 00:00:00 2001
From: "pengshixin.psx" <pengshixin.psx@alibaba-inc.com>
Date: Tue, 12 May 2026 08:47:40 +0000
Subject: [PATCH 17/25] feat(model-service): enforce recording_file/replay_file
 mutex via model_validator

Setting both at once was silently resolved in favor of replay (the backend
factory checks replay_file first), masking what is really a configuration
error. A Pydantic model_validator now rejects the combination at construction
time; validate_assignment=True extends the check to CLI-style field-by-field
overrides applied after a yaml load.

Three tests added: construction-time mutex, assignment-time mutex, and the
existing yaml-load test split into one-side-only variants since the original
deliberately set both fields.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 rock/sdk/model/server/config.py    | 15 +++++++++++++-
 tests/unit/sdk/model/test_proxy.py | 33 ++++++++++++++++++++++--------
 2 files changed, 38 insertions(+), 10 deletions(-)

diff --git a/rock/sdk/model/server/config.py b/rock/sdk/model/server/config.py
index 0f6ed69b66..e734c29878 100644
--- a/rock/sdk/model/server/config.py
+++ b/rock/sdk/model/server/config.py
@@ -1,7 +1,7 @@
 from pathlib import Path
 
 import yaml
-from pydantic import BaseModel, Field
+from pydantic import BaseModel, ConfigDict, Field, model_validator
 
 from rock import env_vars
 
@@ -27,6 +27,10 @@
 class ModelServiceConfig(BaseModel):
     """Configuration for the LLM Model Service."""
 
+    # validate_assignment=True so the recording/replay mutex below also fires when
+    # CLI overrides are applied field-by-field (not only at construction time).
+    model_config = ConfigDict(validate_assignment=True)
+
     host: str = "0.0.0.0"
     """Server host address."""
 
@@ -59,6 +63,15 @@ class ModelServiceConfig(BaseModel):
     """Replay mode input: a .jsonl trajectory file. When set, ReplayBackend serves
     requests from recorded responses instead of calling a real upstream."""
 
+    @model_validator(mode="after")
+    def _recording_replay_mutually_exclusive(self):
+        if self.recording_file and self.replay_file:
+            raise ValueError(
+                "recording_file and replay_file are mutually exclusive — "
+                "set one (recording mode) or the other (replay mode), not both."
+            )
+        return self
+
     @classmethod
     def from_file(cls, config_path: str | None = None):
         """
diff --git a/tests/unit/sdk/model/test_proxy.py b/tests/unit/sdk/model/test_proxy.py
index b658b9343e..345ea31775 100644
--- a/tests/unit/sdk/model/test_proxy.py
+++ b/tests/unit/sdk/model/test_proxy.py
@@ -557,19 +557,34 @@ def test_config_default_recording_and_replay():
 
 
 @pytest.mark.asyncio
-async def test_config_loads_recording_and_replay_from_file(tmp_path):
+async def test_config_loads_recording_file_from_yaml(tmp_path):
     conf_file = tmp_path / "proxy.yml"
-    conf_file.write_text(
-        yaml.dump(
-            {
-                "recording_file": "/tmp/my-traj.jsonl",
-                "replay_file": "/tmp/in.jsonl",
-            }
-        )
-    )
+    conf_file.write_text(yaml.dump({"recording_file": "/tmp/my-traj.jsonl"}))
     config = ModelServiceConfig.from_file(str(conf_file))
     assert config.recording_file == "/tmp/my-traj.jsonl"
+    assert config.replay_file is None
+
+
+@pytest.mark.asyncio
+async def test_config_loads_replay_file_from_yaml(tmp_path):
+    conf_file = tmp_path / "proxy.yml"
+    conf_file.write_text(yaml.dump({"replay_file": "/tmp/in.jsonl"}))
+    config = ModelServiceConfig.from_file(str(conf_file))
     assert config.replay_file == "/tmp/in.jsonl"
+    assert config.recording_file is None
+
+
+def test_config_recording_and_replay_are_mutually_exclusive():
+    """Setting both at construction time fails Pydantic validation."""
+    with pytest.raises(ValueError, match="mutually exclusive"):
+        ModelServiceConfig(recording_file="/tmp/a.jsonl", replay_file="/tmp/b.jsonl")
+
+
+def test_config_recording_replay_mutex_fires_on_assignment():
+    """validate_assignment=True so CLI-style field-by-field overrides also trip the mutex."""
+    config = ModelServiceConfig(recording_file="/tmp/a.jsonl")
+    with pytest.raises(ValueError, match="mutually exclusive"):
+        config.replay_file = "/tmp/b.jsonl"
 
 
 def test_cli_args_override_config_file(tmp_path):

From 8448ef24cda380831ff9768b1d60ae0f0d682766 Mon Sep 17 00:00:00 2001
From: "pengshixin.psx" <pengshixin.psx@alibaba-inc.com>
Date: Tue, 12 May 2026 08:50:48 +0000
Subject: [PATCH 18/25] refactor(model-service): move _replay_sse_iter into
 ReplayBackend as a staticmethod

Module-level function with a single call site inside ReplayBackend; the SSE
chunk-emit shape is purely a replay-mode implementation detail. Moving it
inside the class also makes the pairing with the JSON branch in serve()
visible at a glance.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 rock/sdk/model/server/api/proxy.py | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/rock/sdk/model/server/api/proxy.py b/rock/sdk/model/server/api/proxy.py
index 6364d4a131..3ada8615c8 100644
--- a/rock/sdk/model/server/api/proxy.py
+++ b/rock/sdk/model/server/api/proxy.py
@@ -142,12 +142,6 @@ def _filter_headers(headers) -> dict[str, str]:
     return out
 
 
-async def _replay_sse_iter(response: dict, *, model: str) -> AsyncIterator[bytes]:
-    """Emit a recorded response as one SSE chunk + ``[DONE]``."""
-    yield encode_sse_event(completion_to_chunk_dict(response, model=model))
-    yield SSE_DONE
-
-
 async def _forward_stream_and_record(
     *,
     upstream_url: str,
@@ -264,11 +258,17 @@ async def serve(self, *, model_name: str, is_stream: bool, **_: Any) -> Response
 
         if is_stream:
             return StreamingResponse(
-                _replay_sse_iter(response_dict, model=model_name),
+                self._sse_iter(response_dict, model=model_name),
                 media_type="text/event-stream",
             )
         return JSONResponse(status_code=200, content=response_dict)
 
+    @staticmethod
+    async def _sse_iter(response: dict, *, model: str) -> AsyncIterator[bytes]:
+        """Emit a recorded response as one SSE chunk + ``[DONE]``."""
+        yield encode_sse_event(completion_to_chunk_dict(response, model=model))
+        yield SSE_DONE
+
 
 class ForwardBackend:
     """Forwards requests byte-for-byte to the upstream and optionally records the trajectory."""

From d169ebe004beb54ac8177ba4bba79e2e325f147f Mon Sep 17 00:00:00 2001
From: "pengshixin.psx" <pengshixin.psx@alibaba-inc.com>
Date: Tue, 12 May 2026 08:52:26 +0000
Subject: [PATCH 19/25] refactor(model-service): move get_base_url into
 ForwardBackend as _resolve_base_url
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Single call site inside ForwardBackend.serve, and the function reads only
self._config — drop the redundant config parameter and rename to
_resolve_base_url to make the multi-source fallback (proxy_base_url →
proxy_rules[model] → proxy_rules['default']) explicit.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 rock/sdk/model/server/api/proxy.py | 49 +++++++++++++++---------------
 1 file changed, 24 insertions(+), 25 deletions(-)

diff --git a/rock/sdk/model/server/api/proxy.py b/rock/sdk/model/server/api/proxy.py
index 3ada8615c8..b9f9910edd 100644
--- a/rock/sdk/model/server/api/proxy.py
+++ b/rock/sdk/model/server/api/proxy.py
@@ -106,30 +106,6 @@ async def _send_with_retry(
     raise last_exc  # pragma: no cover  # unreachable
 
 
-def get_base_url(model_name: str, config: ModelServiceConfig) -> str:
-    """Pick the upstream base URL by model name.
-
-    ``proxy_base_url`` takes precedence; falls back to ``proxy_rules[model]`` and
-    then ``proxy_rules["default"]``. Trailing slashes are stripped so the caller
-    can append ``/chat/completions`` directly.
-    """
-    if config.proxy_base_url:
-        return config.proxy_base_url.rstrip("/")
-
-    if not model_name:
-        raise HTTPException(status_code=400, detail="Model name is required for routing.")
-
-    rules = config.proxy_rules
-    base_url = rules.get(model_name) or rules.get("default")
-    if not base_url:
-        raise HTTPException(
-            status_code=400,
-            detail=f"Model '{model_name}' is not configured and no 'default' rule found.",
-        )
-
-    return base_url.rstrip("/")
-
-
 def _filter_headers(headers) -> dict[str, str]:
     """Drop headers that are scoped to the client↔proxy hop or rebuilt by httpx.
     ``Authorization`` is forwarded verbatim — proxy stays stateless about which
@@ -277,6 +253,29 @@ def __init__(self, config: ModelServiceConfig, recorder: TrajectoryRecorder | No
         self._config = config
         self._recorder = recorder
 
+    def _resolve_base_url(self, model_name: str) -> str:
+        """Pick the upstream base URL by model name.
+
+        ``proxy_base_url`` takes precedence; falls back to ``proxy_rules[model]`` and
+        then ``proxy_rules["default"]``. Trailing slashes are stripped so the caller
+        can append ``/chat/completions`` directly.
+        """
+        if self._config.proxy_base_url:
+            return self._config.proxy_base_url.rstrip("/")
+
+        if not model_name:
+            raise HTTPException(status_code=400, detail="Model name is required for routing.")
+
+        rules = self._config.proxy_rules
+        base_url = rules.get(model_name) or rules.get("default")
+        if not base_url:
+            raise HTTPException(
+                status_code=400,
+                detail=f"Model '{model_name}' is not configured and no 'default' rule found.",
+            )
+
+        return base_url.rstrip("/")
+
     async def serve(
         self,
         *,
@@ -287,7 +286,7 @@ async def serve(
         request_dict: dict[str, Any],
         **_: Any,
     ) -> Response:
-        upstream_url = f"{get_base_url(model_name, self._config)}/chat/completions"
+        upstream_url = f"{self._resolve_base_url(model_name)}/chat/completions"
         logger.info(f"Routing model {model_name!r} to {upstream_url}")
 
         if is_stream:

From 1d6b2d1b60dd09dd5142823805a8b000de63edba Mon Sep 17 00:00:00 2001
From: "pengshixin.psx" <pengshixin.psx@alibaba-inc.com>
Date: Tue, 12 May 2026 08:59:43 +0000
Subject: [PATCH 20/25] refactor(model-service): move
 _forward_stream_and_record into ForwardBackend as _stream_and_record
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Same single-call-site argument as the previous moves: 3 of the 7 kwargs were
just relaying self._config / self._recorder. As an instance method the
parameter list drops to 4 and the streaming path mirrors the structure of
ReplayBackend._sse_iter.

_send_with_retry stays at module scope — it's a pure helper bound to the
httpx.AsyncClient lifecycle, not to any backend state.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 rock/sdk/model/server/api/proxy.py | 190 ++++++++++++++---------------
 1 file changed, 92 insertions(+), 98 deletions(-)

diff --git a/rock/sdk/model/server/api/proxy.py b/rock/sdk/model/server/api/proxy.py
index b9f9910edd..73f74e3f62 100644
--- a/rock/sdk/model/server/api/proxy.py
+++ b/rock/sdk/model/server/api/proxy.py
@@ -118,100 +118,6 @@ def _filter_headers(headers) -> dict[str, str]:
     return out
 
 
-async def _forward_stream_and_record(
-    *,
-    upstream_url: str,
-    body_bytes: bytes,
-    fwd_headers: dict[str, str],
-    timeout: float,
-    request_dict: dict[str, Any],
-    recorder: TrajectoryRecorder | None,
-    retryable_codes: list[int],
-) -> AsyncIterator[bytes]:
-    """SSE bytes are forwarded verbatim; chunks are parsed in parallel and
-    aggregated into the final ChatCompletion that the recorder writes to JSONL.
-
-    Retry on connection errors and whitelisted statuses happens BEFORE any byte
-    is yielded; mid-stream connection drops are not retried (would corrupt the
-    client transmission)."""
-    # openai SDK is used purely as a stream-aggregation parser — keep the import
-    # local so module load doesn't pull it in for callers that never stream.
-    from openai.lib.streaming.chat import ChatCompletionStreamState
-    from openai.types.chat import ChatCompletionChunk
-
-    state = ChatCompletionStreamState()
-    start = time.time()
-    parse_buffer = b""
-    upstream_status = 0
-
-    async with httpx.AsyncClient(timeout=timeout) as client:
-        try:
-            resp = await _send_with_retry(
-                client,
-                upstream_url,
-                body_bytes=body_bytes,
-                headers=fwd_headers,
-                retryable_codes=retryable_codes,
-            )
-        except (httpx.TimeoutException, httpx.ConnectError) as exc:
-            if recorder is not None:
-                await recorder.record(
-                    request=request_dict,
-                    response=None,
-                    status="failure",
-                    start_time=start,
-                    end_time=time.time(),
-                    error=f"{type(exc).__name__}: {exc}",
-                )
-            return
-
-        try:
-            upstream_status = resp.status_code
-            async for chunk in resp.aiter_bytes():
-                yield chunk
-                chunk_dicts, parse_buffer = parse_sse_data_chunks(parse_buffer + chunk)
-                for chunk_dict in chunk_dicts:
-                    try:
-                        state.handle_chunk(ChatCompletionChunk.model_validate(chunk_dict))
-                    except Exception as exc:  # parser error: forward continues, traj will be partial
-                        logger.debug(f"[record] chunk parse failed (forward continues): {exc}")
-        except httpx.RequestError as exc:
-            # Connection died mid-stream — bytes already sent reach the client;
-            # record what we got and return.
-            if recorder is not None:
-                await recorder.record(
-                    request=request_dict,
-                    response=None,
-                    status="failure",
-                    start_time=start,
-                    end_time=time.time(),
-                    error=f"{type(exc).__name__}: {exc}",
-                )
-            return
-        finally:
-            await resp.aclose()
-
-    if recorder is None:
-        return
-
-    status = "success" if upstream_status < 400 else "failure"
-    final_dict: dict | None = None
-    if status == "success":
-        try:
-            final_dict = state.get_final_completion().model_dump()
-        except Exception as exc:
-            logger.warning(f"[record] stream aggregation failed: {exc}")
-
-    await recorder.record(
-        request=request_dict,
-        response=final_dict,
-        status=status,
-        start_time=start,
-        end_time=time.time(),
-        error=None if status == "success" else f"upstream_status={upstream_status}",
-    )
-
-
 class ReplayBackend:
     """Serves requests from a pre-recorded trajectory; no upstream calls made."""
 
@@ -291,14 +197,11 @@ async def serve(
 
         if is_stream:
             return StreamingResponse(
-                _forward_stream_and_record(
+                self._stream_and_record(
                     upstream_url=upstream_url,
                     body_bytes=body_bytes,
                     fwd_headers=fwd_headers,
-                    timeout=self._config.request_timeout,
                     request_dict=request_dict,
-                    recorder=self._recorder,
-                    retryable_codes=self._config.retryable_status_codes,
                 ),
                 media_type="text/event-stream",
             )
@@ -366,6 +269,97 @@ async def serve(
         # Forward bytes verbatim — preserves any provider-specific fields untouched.
         return Response(content=response_bytes, status_code=status_code, media_type=content_type)
 
+    async def _stream_and_record(
+        self,
+        *,
+        upstream_url: str,
+        body_bytes: bytes,
+        fwd_headers: dict[str, str],
+        request_dict: dict[str, Any],
+    ) -> AsyncIterator[bytes]:
+        """SSE bytes are forwarded verbatim; chunks are parsed in parallel and
+        aggregated into the final ChatCompletion that the recorder writes to JSONL.
+
+        Retry on connection errors and whitelisted statuses happens BEFORE any byte
+        is yielded; mid-stream connection drops are not retried (would corrupt the
+        client transmission)."""
+        # openai SDK is used purely as a stream-aggregation parser — keep the import
+        # local so module load doesn't pull it in for callers that never stream.
+        from openai.lib.streaming.chat import ChatCompletionStreamState
+        from openai.types.chat import ChatCompletionChunk
+
+        state = ChatCompletionStreamState()
+        start = time.time()
+        parse_buffer = b""
+        upstream_status = 0
+
+        async with httpx.AsyncClient(timeout=self._config.request_timeout) as client:
+            try:
+                resp = await _send_with_retry(
+                    client,
+                    upstream_url,
+                    body_bytes=body_bytes,
+                    headers=fwd_headers,
+                    retryable_codes=self._config.retryable_status_codes,
+                )
+            except (httpx.TimeoutException, httpx.ConnectError) as exc:
+                if self._recorder is not None:
+                    await self._recorder.record(
+                        request=request_dict,
+                        response=None,
+                        status="failure",
+                        start_time=start,
+                        end_time=time.time(),
+                        error=f"{type(exc).__name__}: {exc}",
+                    )
+                return
+
+            try:
+                upstream_status = resp.status_code
+                async for chunk in resp.aiter_bytes():
+                    yield chunk
+                    chunk_dicts, parse_buffer = parse_sse_data_chunks(parse_buffer + chunk)
+                    for chunk_dict in chunk_dicts:
+                        try:
+                            state.handle_chunk(ChatCompletionChunk.model_validate(chunk_dict))
+                        except Exception as exc:  # parser error: forward continues, traj will be partial
+                            logger.debug(f"[record] chunk parse failed (forward continues): {exc}")
+            except httpx.RequestError as exc:
+                # Connection died mid-stream — bytes already sent reach the client;
+                # record what we got and return.
+                if self._recorder is not None:
+                    await self._recorder.record(
+                        request=request_dict,
+                        response=None,
+                        status="failure",
+                        start_time=start,
+                        end_time=time.time(),
+                        error=f"{type(exc).__name__}: {exc}",
+                    )
+                return
+            finally:
+                await resp.aclose()
+
+        if self._recorder is None:
+            return
+
+        status = "success" if upstream_status < 400 else "failure"
+        final_dict: dict | None = None
+        if status == "success":
+            try:
+                final_dict = state.get_final_completion().model_dump()
+            except Exception as exc:
+                logger.warning(f"[record] stream aggregation failed: {exc}")
+
+        await self._recorder.record(
+            request=request_dict,
+            response=final_dict,
+            status=status,
+            start_time=start,
+            end_time=time.time(),
+            error=None if status == "success" else f"upstream_status={upstream_status}",
+        )
+
 
 CompletionBackend = ReplayBackend | ForwardBackend
 

From 55475496b83652596a6f4bc78508b263b2291ae5 Mon Sep 17 00:00:00 2001
From: "pengshixin.psx" <pengshixin.psx@alibaba-inc.com>
Date: Tue, 12 May 2026 09:03:27 +0000
Subject: [PATCH 21/25] docs(model-service): move + rewrite proxy README under
 docs/dev/model-service/

Old examples/model_service/README.md was stale: still mentioned litellm,
StandardLoggingPayload, --num-retries, and the conflated --traj-file flag.

Rewritten to reflect current shape: ForwardBackend / ReplayBackend pair,
recording_file / replay_file mutex, retry-on-status-code with the documented
attempt budget, openai SDK only used as the stream-state aggregator behind the
forwarding path. Also calls out that the rock model-service start subcommand
hasn't been wired up with the new flags yet.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 docs/dev/model-service/proxy.md  | 155 +++++++++++++++++++++++++++++++
 examples/model_service/README.md |  90 ------------------
 2 files changed, 155 insertions(+), 90 deletions(-)
 create mode 100644 docs/dev/model-service/proxy.md
 delete mode 100644 examples/model_service/README.md

diff --git a/docs/dev/model-service/proxy.md b/docs/dev/model-service/proxy.md
new file mode 100644
index 0000000000..bcf8a64266
--- /dev/null
+++ b/docs/dev/model-service/proxy.md
@@ -0,0 +1,155 @@
+# model-service `proxy` 模式
+
+`rock model-service` 的 proxy 模式在 `/v1/chat/completions` 上提供一个 OpenAI 兼容的转发层，
+两种工作模式互斥：
+
+| 模式      | 触发条件                              | 上游调用 | 写盘                 |
+|-----------|---------------------------------------|----------|----------------------|
+| Recording | 默认                                  | 真实调用 | append 到 JSONL traj |
+| Replay    | `--replay-file` / `replay_file` 设置  | 不调用   | 不写                 |
+
+设计目标是让 SWE-agent / mini-swe-agent / OpenHands 等 agent 框架在录制 → 回放之间无感切换：
+agent 不变，只换 base URL。
+
+下文所有命令以 `python -m rock.sdk.model.server.main` 启动。注意 `rock model-service start`
+子命令目前**还没**对外暴露 `--recording-file` / `--replay-file`（CLI argparse 在
+[rock/cli/command/model_service.py](../../../rock/cli/command/model_service.py) 单独定义），
+所以涉及录制/回放的场景必须走 `python -m` 入口。
+
+---
+
+## 1. Recording（默认）
+
+转发到单个上游，每次调用 append 一行 JSONL 到 `recording_file`（缺省 `LOG_DIR/LLMTraj.jsonl`，
+其中 `LOG_DIR = $ROCK_MODEL_SERVICE_DATA_DIR`）：
+
+```bash
+export OPENAI_API_KEY="sk-..."
+export ROCK_MODEL_SERVICE_DATA_DIR=/tmp/rock-traj
+
+python -m rock.sdk.model.server.main \
+    --type proxy \
+    --proxy-base-url https://api.openai.com/v1 \
+    --port 8080
+```
+
+调用：
+
+```bash
+curl -X POST http://127.0.0.1:8080/v1/chat/completions \
+    -H "Authorization: Bearer $OPENAI_API_KEY" \
+    -H "Content-Type: application/json" \
+    -d '{"model":"gpt-3.5-turbo","messages":[{"role":"user","content":"hi"}]}'
+
+cat /tmp/rock-traj/LLMTraj.jsonl | jq '.model, .response.choices[0].message.content'
+```
+
+流式同样支持，上游字节原样转给客户端，recorder 在后台聚合最终的 `ChatCompletion` 写盘
+（用 openai SDK 的 `ChatCompletionStreamState`，所以 `tool_calls.function.arguments` 等
+跨 chunk 拼接的字段会被还原成完整形态）：
+
+```bash
+curl -N -X POST http://127.0.0.1:8080/v1/chat/completions \
+    -H "Authorization: Bearer $OPENAI_API_KEY" \
+    -H "Content-Type: application/json" \
+    -d '{"model":"gpt-3.5-turbo","stream":true,"messages":[{"role":"user","content":"count to 5"}]}'
+```
+
+显式指定写到别的路径：
+
+```bash
+python -m rock.sdk.model.server.main \
+    --type proxy \
+    --proxy-base-url https://api.openai.com/v1 \
+    --recording-file /tmp/my-session.jsonl \
+    --port 8080
+```
+
+---
+
+## 2. Replay
+
+把 `--replay-file` 指到一个录好的 jsonl，proxy 不再访问真实 LLM，按录制顺序返回响应；
+agent 把 base URL 换成 `http://127.0.0.1:8081/v1` 即可重放：
+
+```bash
+python -m rock.sdk.model.server.main \
+    --type proxy \
+    --replay-file /tmp/rock-traj/LLMTraj.jsonl \
+    --port 8081
+```
+
+行为细节：
+
+- cursor 单调推进，每次请求消耗一条记录；用尽后返回 **404**。
+- 流式请求会拿录制的 `ChatCompletion` 重新发一帧 SSE chunk + `[DONE]`。
+  `tool_calls` 的 `index` 字段会被自动注入（OpenAI 的流式协议要求 chunk delta 上有 `index`，
+  但录制态的 `message.tool_calls` 没有）。
+- request 里的 `model` 会跟录制的 `model` 比对，不一致只打 warning，不阻断。
+
+`recording_file` 和 `replay_file` 是**互斥**的——同时配置（无论是 CLI 还是 YAML）会在启动时
+被 Pydantic `model_validator` 拦下并报 `ValidationError`，避免"录到一半把源文件覆盖"这类隐性 bug。
+
+---
+
+## 3. 重试和超时
+
+- 默认对 connection error / timeout 和 `retryable_status_codes`（默认 `[429, 500]`）触发重试，
+  最多 6 次，指数退避 2s 起步 ×2 + 抖动；最后一次仍失败时把上游响应原样转给客户端
+  （**不**包装成 502/504，让 agent 自己看到真实状态码）。
+- 对**流式**请求，重试只发生在第一个字节抵达客户端**之前**——一旦字节流开始转发，
+  连接中断不会重试（已发出去的字节无法收回）。
+
+```bash
+python -m rock.sdk.model.server.main \
+    --type proxy \
+    --proxy-base-url https://api.openai.com/v1 \
+    --retryable-status-codes 429,500,502,503 \
+    --request-timeout 60 \
+    --port 8080
+```
+
+---
+
+## 4. 多模型路由（YAML）
+
+按 model name 分流到不同上游需要 YAML（CLI 只暴露单一 `--proxy-base-url`）。新建 `routes.yaml`：
+
+```yaml
+proxy_rules:
+  gpt-3.5-turbo: "https://api.openai.com/v1"
+  gpt-4o:        "https://api.openai.com/v1"
+  default:       "https://api-inference.modelscope.cn/v1"
+
+retryable_status_codes: [429, 500, 502]
+request_timeout: 60
+recording_file: /tmp/rock-traj/multi.jsonl
+```
+
+启动：
+
+```bash
+python -m rock.sdk.model.server.main \
+    --type proxy \
+    --config-file routes.yaml \
+    --port 8080
+```
+
+CLI flag（`--proxy-base-url` / `--port` / `--retryable-status-codes` / ...）覆盖 YAML 同名字段。
+路由解析顺序：`proxy_base_url` → `proxy_rules[model]` → `proxy_rules["default"]`，都没有则 400。
+
+---
+
+## 5. 实现要点（仅供参考）
+
+- `chat_completions` endpoint 把请求分发给 `app.state.backend`，后者要么是 `ForwardBackend`
+  要么是 `ReplayBackend`，由启动时的 `_configure_proxy_integrations` 根据 `replay_file`
+  是否设置二选一注入。
+- `ForwardBackend` 走 httpx 字节透传：non-stream 是 `await resp.aread()`，stream 是
+  `resp.aiter_bytes()` 直接 yield 给客户端，**不**经过任何 SDK 的反序列化/再序列化，所以上游
+  返回的 `reasoning_content` / `provider_specific_fields` 等任意 vendor 字段都不会被吃掉。
+  recorder 在另一条独立路径上把字节流喂给 openai SDK 的 stream-state aggregator，仅用于写盘。
+- `ReplayBackend` 完全本地，不持有 httpx client。
+
+更深入的代码导览看 [rock/sdk/model/server/api/proxy.py](../../../rock/sdk/model/server/api/proxy.py)
+顶部的 module docstring。
diff --git a/examples/model_service/README.md b/examples/model_service/README.md
deleted file mode 100644
index 7a169764fe..0000000000
--- a/examples/model_service/README.md
+++ /dev/null
@@ -1,90 +0,0 @@
-# model-service proxy 用法示例
-
-`rock model-service` 的 `proxy` 模式把 `/v1/chat/completions` 转发到上游 LLM，并把每次调用以
-`StandardLoggingPayload` 格式 append 到 JSONL traj 文件。配合 `--traj-file` 可以让相同 base URL 的
-agent（SWE-agent / mini-swe-agent / OpenHands）从录制的 traj 回放，实现"无 LLM 成本"调试。
-
-下面所有命令都用 `python -m rock.sdk.model.server.main` 启动，等价于 `rock model-service start`。
-
-## 1. Record 模式（默认）
-
-转发到单个上游，每次调用 append 到 `LOG_DIR/LLMTraj.jsonl`：
-
-```bash
-export OPENAI_API_KEY="sk-..."
-export ROCK_MODEL_SERVICE_DATA_DIR=/tmp/rock-traj   # traj 文件落盘根目录
-
-python -m rock.sdk.model.server.main \
-    --type proxy \
-    --proxy-base-url https://api.openai.com/v1 \
-    --port 8080
-```
-
-调用：
-
-```bash
-curl -X POST http://127.0.0.1:8080/v1/chat/completions \
-    -H "Authorization: Bearer $OPENAI_API_KEY" \
-    -H "Content-Type: application/json" \
-    -d '{"model":"gpt-3.5-turbo","messages":[{"role":"user","content":"hi"}]}'
-
-# 查看 traj
-cat /tmp/rock-traj/LLMTraj.jsonl | jq '.id, .model, .response.choices[0].message.content'
-```
-
-支持流式（litellm 自动聚合写入 traj）：
-
-```bash
-curl -N -X POST http://127.0.0.1:8080/v1/chat/completions \
-    -H "Authorization: Bearer $OPENAI_API_KEY" \
-    -H "Content-Type: application/json" \
-    -d '{"model":"gpt-3.5-turbo","stream":true,"messages":[{"role":"user","content":"count to 5"}]}'
-```
-
-## 2. Replay 模式
-
-把 `--traj-file` 指到一个录好的 jsonl，proxy 不再访问真实 LLM，按录制顺序返回响应：
-
-```bash
-python -m rock.sdk.model.server.main \
-    --type proxy \
-    --traj-file /tmp/rock-traj/LLMTraj.jsonl \
-    --port 8081
-```
-
-agent 把 base URL 换成 `http://127.0.0.1:8081/v1` 即可重放，cursor 用尽后返回 404。
-`--traj-file` 必须是单个 jsonl 文件路径。
-
-## 3. 调整重试和超时
-
-```bash
-python -m rock.sdk.model.server.main \
-    --type proxy \
-    --proxy-base-url https://api.openai.com/v1 \
-    --num-retries 3 \
-    --request-timeout 60 \
-    --port 8080
-```
-
-## 4. 多模型路由（需要 YAML）
-
-只有在按 model name 分流到不同上游时才需要 YAML（CLI 只暴露单一 `--proxy-base-url`）。新建
-`routes.yaml`：
-
-```yaml
-proxy_rules:
-  gpt-3.5-turbo: "https://api.openai.com/v1"
-  gpt-4o: "https://api.openai.com/v1"
-  default: "https://api-inference.modelscope.cn/v1"
-```
-
-启动时配合 CLI：
-
-```bash
-python -m rock.sdk.model.server.main \
-    --type proxy \
-    --config-file routes.yaml \
-    --port 8080
-```
-
-CLI 上指定的 `--proxy-base-url` / `--port` / `--num-retries` 等仍会覆盖 YAML 的同名字段。

From d65269c8a511a2167162daed516389ff7337c0b1 Mon Sep 17 00:00:00 2001
From: "pengshixin.psx" <pengshixin.psx@alibaba-inc.com>
Date: Tue, 12 May 2026 09:09:40 +0000
Subject: [PATCH 22/25] feat(model-service): expose --recording-file /
 --replay-file on rock model-service start
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The two flags previously existed only on the python -m rock.sdk.model.server.main
entry; rock model-service start (which subprocess-spawns that same module) had
no way to thread them through, forcing users to bypass the CLI for any
record/replay scenario.

Wire them through ModelServiceCommand argparse → ModelService.start →
start_sandbox_service → subprocess argv. Add three tests around the argv
construction (default omits both flags, recording_file forwarded,
replay_file forwarded).

Doc updated: drop the python -m caveat and switch all example commands to
rock model-service start.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 docs/dev/model-service/proxy.md      | 17 ++++++-------
 rock/cli/command/model_service.py    | 14 ++++++++++
 rock/sdk/model/service.py            | 10 ++++++++
 tests/unit/sdk/model/test_service.py | 38 ++++++++++++++++++++++++++++
 4 files changed, 70 insertions(+), 9 deletions(-)
 create mode 100644 tests/unit/sdk/model/test_service.py

diff --git a/docs/dev/model-service/proxy.md b/docs/dev/model-service/proxy.md
index bcf8a64266..b0f77da0f3 100644
--- a/docs/dev/model-service/proxy.md
+++ b/docs/dev/model-service/proxy.md
@@ -11,10 +11,9 @@
 设计目标是让 SWE-agent / mini-swe-agent / OpenHands 等 agent 框架在录制 → 回放之间无感切换：
 agent 不变，只换 base URL。
 
-下文所有命令以 `python -m rock.sdk.model.server.main` 启动。注意 `rock model-service start`
-子命令目前**还没**对外暴露 `--recording-file` / `--replay-file`（CLI argparse 在
-[rock/cli/command/model_service.py](../../../rock/cli/command/model_service.py) 单独定义），
-所以涉及录制/回放的场景必须走 `python -m` 入口。
+下文所有命令以 `rock model-service start` 启动；该子命令最终会 `subprocess` 拉起
+`rock.sdk.model.server.main`，两者支持的 flag 一致。直接调试时也可以用
+`python -m rock.sdk.model.server.main` 跳过 PID 文件管理。
 
 ---
 
@@ -27,7 +26,7 @@ agent 不变，只换 base URL。
 export OPENAI_API_KEY="sk-..."
 export ROCK_MODEL_SERVICE_DATA_DIR=/tmp/rock-traj
 
-python -m rock.sdk.model.server.main \
+rock model-service start \
     --type proxy \
     --proxy-base-url https://api.openai.com/v1 \
     --port 8080
@@ -58,7 +57,7 @@ curl -N -X POST http://127.0.0.1:8080/v1/chat/completions \
 显式指定写到别的路径：
 
 ```bash
-python -m rock.sdk.model.server.main \
+rock model-service start \
     --type proxy \
     --proxy-base-url https://api.openai.com/v1 \
     --recording-file /tmp/my-session.jsonl \
@@ -73,7 +72,7 @@ python -m rock.sdk.model.server.main \
 agent 把 base URL 换成 `http://127.0.0.1:8081/v1` 即可重放：
 
 ```bash
-python -m rock.sdk.model.server.main \
+rock model-service start \
     --type proxy \
     --replay-file /tmp/rock-traj/LLMTraj.jsonl \
     --port 8081
@@ -101,7 +100,7 @@ python -m rock.sdk.model.server.main \
   连接中断不会重试（已发出去的字节无法收回）。
 
 ```bash
-python -m rock.sdk.model.server.main \
+rock model-service start \
     --type proxy \
     --proxy-base-url https://api.openai.com/v1 \
     --retryable-status-codes 429,500,502,503 \
@@ -129,7 +128,7 @@ recording_file: /tmp/rock-traj/multi.jsonl
 启动：
 
 ```bash
-python -m rock.sdk.model.server.main \
+rock model-service start \
     --type proxy \
     --config-file routes.yaml \
     --port 8080
diff --git a/rock/cli/command/model_service.py b/rock/cli/command/model_service.py
index 87e6ca60e6..03cc59582d 100644
--- a/rock/cli/command/model_service.py
+++ b/rock/cli/command/model_service.py
@@ -82,6 +82,8 @@ async def arun(self, args: argparse.Namespace):
                 proxy_base_url=args.proxy_base_url,
                 retryable_status_codes=args.retryable_status_codes,
                 request_timeout=args.request_timeout,
+                recording_file=args.recording_file,
+                replay_file=args.replay_file,
             )
             logger.info(f"model service started, pid: {pid}")
             with open(self.DEFAULT_MODEL_SERVICE_PID_FILE, "w") as f:
@@ -178,6 +180,18 @@ async def add_parser_to(subparsers: argparse._SubParsersAction):
             default=None,
             help="Request timeout in seconds. Overrides config file.",
         )
+        start_parser.add_argument(
+            "--recording-file",
+            type=str,
+            default=None,
+            help="Proxy mode only: where to write the trajectory JSONL. Defaults to LOG_DIR/LLMTraj.jsonl.",
+        )
+        start_parser.add_argument(
+            "--replay-file",
+            type=str,
+            default=None,
+            help="Proxy mode only: replay from a recorded .jsonl traj file. Mutually exclusive with --recording-file.",
+        )
 
         watch_agent_parser = model_service_subparsers.add_parser(
             "watch-agent",
diff --git a/rock/sdk/model/service.py b/rock/sdk/model/service.py
index b1b523ed27..24cd7ede38 100644
--- a/rock/sdk/model/service.py
+++ b/rock/sdk/model/service.py
@@ -17,6 +17,8 @@ def start_sandbox_service(
         proxy_base_url: str | None = None,
         retryable_status_codes: str | None = None,
         request_timeout: int | None = None,
+        recording_file: str | None = None,
+        replay_file: str | None = None,
     ) -> subprocess.Popen:
         """start sandbox service"""
         current_file = Path(__file__).resolve()
@@ -38,6 +40,10 @@ def start_sandbox_service(
             cmd.extend(["--retryable-status-codes", retryable_status_codes])
         if request_timeout:
             cmd.extend(["--request-timeout", str(request_timeout)])
+        if recording_file:
+            cmd.extend(["--recording-file", recording_file])
+        if replay_file:
+            cmd.extend(["--replay-file", replay_file])
         process = subprocess.Popen(cmd, cwd=str(service_dir))
         return process
 
@@ -51,6 +57,8 @@ async def start(
         proxy_base_url: str | None = None,
         retryable_status_codes: str | None = None,
         request_timeout: int | None = None,
+        recording_file: str | None = None,
+        replay_file: str | None = None,
     ) -> str:
         process = self.start_sandbox_service(
             model_service_type=model_service_type,
@@ -60,6 +68,8 @@ async def start(
             proxy_base_url=proxy_base_url,
             retryable_status_codes=retryable_status_codes,
             request_timeout=request_timeout,
+            recording_file=recording_file,
+            replay_file=replay_file,
         )
         pid = process.pid
 
diff --git a/tests/unit/sdk/model/test_service.py b/tests/unit/sdk/model/test_service.py
new file mode 100644
index 0000000000..61176173bd
--- /dev/null
+++ b/tests/unit/sdk/model/test_service.py
@@ -0,0 +1,38 @@
+"""Tests for ModelService.start_sandbox_service subprocess command construction.
+
+Covers the CLI flag wiring without actually spawning a subprocess: mock Popen
+and inspect the argv it would have been called with.
+"""
+
+from unittest.mock import patch
+
+from rock.sdk.model.service import ModelService
+
+
+def _captured_argv(**start_kwargs) -> list[str]:
+    with patch("rock.sdk.model.service.subprocess.Popen") as mock_popen:
+        ModelService().start_sandbox_service(**start_kwargs)
+    return mock_popen.call_args[0][0]
+
+
+def test_start_sandbox_service_omits_recording_and_replay_flags_by_default():
+    argv = _captured_argv(model_service_type="proxy", proxy_base_url="https://api.openai.com/v1", port=8080)
+    assert argv[1:5] == ["-m", "main", "--type", "proxy"]
+    assert "--proxy-base-url" in argv and "https://api.openai.com/v1" in argv
+    assert "--port" in argv and "8080" in argv
+    assert "--recording-file" not in argv
+    assert "--replay-file" not in argv
+
+
+def test_start_sandbox_service_passes_recording_file():
+    argv = _captured_argv(model_service_type="proxy", recording_file="/tmp/my-traj.jsonl")
+    idx = argv.index("--recording-file")
+    assert argv[idx + 1] == "/tmp/my-traj.jsonl"
+    assert "--replay-file" not in argv
+
+
+def test_start_sandbox_service_passes_replay_file():
+    argv = _captured_argv(model_service_type="proxy", replay_file="/tmp/in.jsonl")
+    idx = argv.index("--replay-file")
+    assert argv[idx + 1] == "/tmp/in.jsonl"
+    assert "--recording-file" not in argv

From 5cf33bd2ca504ac1f2911b651376637d56b6f367 Mon Sep 17 00:00:00 2001
From: "pengshixin.psx" <pengshixin.psx@alibaba-inc.com>
Date: Tue, 12 May 2026 09:12:06 +0000
Subject: [PATCH 23/25] test(model-service): add CLI-layer coverage for
 --recording-file / --replay-file wiring
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The existing tests/unit/sdk/model/test_service.py covers the subprocess argv
construction (catches cmd-string typos like --recording_file vs --recording-file)
but mocks nothing above the SDK layer, so a missing kwarg in
ModelServiceCommand.arun would slip through.

Add tests/unit/cli/command/test_model_service.py mirroring the test_job.py
pattern: drive the real argparse sub-parser end-to-end and mock
ModelService.start to assert the kwargs it receives. Covers the new flags
both in isolation and in their default (omitted) state.

Two layers, two bug surfaces — together they cover the full path from CLI
argv to subprocess argv.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 tests/unit/cli/command/test_model_service.py | 120 +++++++++++++++++++
 1 file changed, 120 insertions(+)
 create mode 100644 tests/unit/cli/command/test_model_service.py

diff --git a/tests/unit/cli/command/test_model_service.py b/tests/unit/cli/command/test_model_service.py
new file mode 100644
index 0000000000..86849c718b
--- /dev/null
+++ b/tests/unit/cli/command/test_model_service.py
@@ -0,0 +1,120 @@
+"""Unit tests for rock.cli.command.model_service.ModelServiceCommand.
+
+Drive the sub-parser end-to-end with argparse so the surface that users
+actually type at the terminal is what we exercise. ``ModelService.start`` is
+mocked — these tests assert wiring (argparse → handler → SDK call), not the
+subprocess command construction (covered separately in
+tests/unit/sdk/model/test_service.py).
+"""
+
+from __future__ import annotations
+
+import argparse
+import asyncio
+from unittest.mock import AsyncMock
+
+import pytest
+
+from rock.cli.command.model_service import ModelServiceCommand
+
+
+def _build_parser() -> argparse.ArgumentParser:
+    """Top-level parser with `model-service` subcommand wired in, same as the CLI."""
+    top = argparse.ArgumentParser(prog="rock")
+    subparsers = top.add_subparsers(dest="command")
+    asyncio.run(ModelServiceCommand.add_parser_to(subparsers))
+    return top
+
+
+@pytest.fixture
+def isolate_pid_file(monkeypatch, tmp_path):
+    """Redirect PID dir/file into tmp so arun() doesn't touch ./data/cli/model."""
+    monkeypatch.setattr(ModelServiceCommand, "DEFAULT_MODEL_SERVICE_DIR", str(tmp_path))
+    monkeypatch.setattr(ModelServiceCommand, "DEFAULT_MODEL_SERVICE_PID_FILE", str(tmp_path / "pid.txt"))
+
+
+@pytest.fixture
+def fake_start(monkeypatch):
+    """Replace ModelService.start with an AsyncMock returning a fixed pid."""
+    mock = AsyncMock(return_value="12345")
+    monkeypatch.setattr("rock.cli.command.model_service.ModelService.start", mock)
+    return mock
+
+
+# ---------- argparse: the new flags must parse ----------
+
+
+def test_recording_file_flag_parses():
+    parser = _build_parser()
+    ns = parser.parse_args(["model-service", "start", "--type", "proxy", "--recording-file", "/tmp/out.jsonl"])
+    assert ns.recording_file == "/tmp/out.jsonl"
+    assert ns.replay_file is None
+
+
+def test_replay_file_flag_parses():
+    parser = _build_parser()
+    ns = parser.parse_args(["model-service", "start", "--type", "proxy", "--replay-file", "/tmp/in.jsonl"])
+    assert ns.replay_file == "/tmp/in.jsonl"
+    assert ns.recording_file is None
+
+
+def test_neither_flag_defaults_to_none():
+    parser = _build_parser()
+    ns = parser.parse_args(["model-service", "start", "--type", "proxy"])
+    assert ns.recording_file is None
+    assert ns.replay_file is None
+
+
+# ---------- handler: passes parsed args through to ModelService.start ----------
+
+
+def test_start_handler_forwards_recording_file(isolate_pid_file, fake_start):
+    parser = _build_parser()
+    ns = parser.parse_args(
+        [
+            "model-service",
+            "start",
+            "--type",
+            "proxy",
+            "--proxy-base-url",
+            "https://api.openai.com/v1",
+            "--recording-file",
+            "/tmp/out.jsonl",
+        ]
+    )
+    asyncio.run(ModelServiceCommand().arun(ns))
+
+    kwargs = fake_start.call_args.kwargs
+    assert kwargs["recording_file"] == "/tmp/out.jsonl"
+    assert kwargs["replay_file"] is None
+    assert kwargs["proxy_base_url"] == "https://api.openai.com/v1"
+    assert kwargs["model_service_type"] == "proxy"
+
+
+def test_start_handler_forwards_replay_file(isolate_pid_file, fake_start):
+    parser = _build_parser()
+    ns = parser.parse_args(
+        [
+            "model-service",
+            "start",
+            "--type",
+            "proxy",
+            "--replay-file",
+            "/tmp/in.jsonl",
+        ]
+    )
+    asyncio.run(ModelServiceCommand().arun(ns))
+
+    kwargs = fake_start.call_args.kwargs
+    assert kwargs["replay_file"] == "/tmp/in.jsonl"
+    assert kwargs["recording_file"] is None
+
+
+def test_start_handler_omits_both_when_unset(isolate_pid_file, fake_start):
+    parser = _build_parser()
+    ns = parser.parse_args(["model-service", "start", "--type", "proxy"])
+    asyncio.run(ModelServiceCommand().arun(ns))
+
+    kwargs = fake_start.call_args.kwargs
+    assert kwargs["recording_file"] is None
+    assert kwargs["replay_file"] is None

From dd35b6dafb5d42e5e17bbdde91a2868d1c82ce8a Mon Sep 17 00:00:00 2001
From: "pengshixin.psx" <pengshixin.psx@alibaba-inc.com>
Date: Tue, 12 May 2026 09:28:21 +0000
Subject: [PATCH 24/25] =?UTF-8?q?test(model-service):=20rename=20test=5Fse?=
 =?UTF-8?q?rvice.py=20=E2=86=92=20test=5Fservice=5Fsubprocess.py=20to=20fi?=
 =?UTF-8?q?x=20CI=20collision?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

tests/integration/envhub/test_service.py shares the same basename, and pytest's
default importmode (prepend) collapses both into a single 'test_service' module
in sys.modules, so collection fails as soon as both files are picked up:

    import file mismatch:
    imported module 'test_service' has this __file__ attribute: .../envhub/test_service.py
    which is not the same as the test file we want to collect: .../sdk/model/test_service.py

Renaming the new file is the smallest fix and keeps importmode=prepend behavior
unchanged for the rest of the suite. The new name also describes the file
better (it tests how start_sandbox_service builds the subprocess argv).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 .../sdk/model/{test_service.py => test_service_subprocess.py}     | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename tests/unit/sdk/model/{test_service.py => test_service_subprocess.py} (100%)

diff --git a/tests/unit/sdk/model/test_service.py b/tests/unit/sdk/model/test_service_subprocess.py
similarity index 100%
rename from tests/unit/sdk/model/test_service.py
rename to tests/unit/sdk/model/test_service_subprocess.py

From bc52c25c46116f3be89c8293d7ac05760be9f178 Mon Sep 17 00:00:00 2001
From: "pengshixin.psx" <pengshixin.psx@alibaba-inc.com>
Date: Tue, 12 May 2026 09:31:48 +0000
Subject: [PATCH 25/25] =?UTF-8?q?test(model-service):=20rename=20test=5Fpr?=
 =?UTF-8?q?oxy=5Frecord=5Freplay.py=20=E2=86=92=20...=5Fe2e.py?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The file is the only one in the model-service unit suite that boots a real
uvicorn upstream in a background thread and drives the proxy through real
HTTP — append _e2e to make that scope obvious in the file name. It stays
under tests/unit/ because the project's integration/ tier is reserved for
tests requiring out-of-process services (Docker, Ray, admin), which this one
doesn't.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 ...est_proxy_record_replay.py => test_proxy_record_replay_e2e.py} | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename tests/unit/sdk/model/{test_proxy_record_replay.py => test_proxy_record_replay_e2e.py} (100%)

diff --git a/tests/unit/sdk/model/test_proxy_record_replay.py b/tests/unit/sdk/model/test_proxy_record_replay_e2e.py
similarity index 100%
rename from tests/unit/sdk/model/test_proxy_record_replay.py
rename to tests/unit/sdk/model/test_proxy_record_replay_e2e.py