feat(model-service): proxy supports stream + replay, byte passthrough, ForwardBackend/ReplayBackend by BCeZn · Pull Request #935 · alibaba/ROCK

BCeZn · 2026-05-12T09:17:15Z

Summary

Refit rock model-service proxy to cover three previously missing capabilities while keeping the existing forward path drop-in compatible.

Stream support — resp.aiter_bytes() yielded byte-for-byte to the client; recorder runs on an independent path that feeds chunks into openai SDK's ChatCompletionStreamState to aggregate the final ChatCompletion for the trajectory.
Replay mode — ReplayBackend serves recorded responses directly from a SequentialCursor without any upstream call. Stream replay re-emits the recorded message as one SSE chunk + [DONE]. Cursor exhaustion → 404.
Byte passthrough — proxy never reshapes the upstream body, so vendor fields (reasoning_content, provider_specific_fields, …) survive intact in both forward and replay paths.

Strategy split: ForwardBackend and ReplayBackend are attached to app.state.backend at startup; the endpoint just dispatches.

What changed for users

New CLI flags on rock model-service start: --recording-file PATH / --replay-file PATH (mutually exclusive). The same fields land in YAML as recording_file / replay_file.
Old traj_file / replay_traj_file config fields were renamed to make write-vs-read direction explicit; the CLI flag --traj-file (which previously aliased to replay_traj_file) is gone.
Pydantic model_validator rejects setting both fields at once (also fires on field-level reassignment via validate_assignment=True), preventing the silent "record into the file you're replaying from" bug.
Retry behavior preserved: connection errors + retryable_status_codes (default [429, 500]), 6 attempts, 2s × 2 backoff with jitter; for streaming, retries only fire before the first byte reaches the client.

File layout

rock/sdk/model/server/integrations/{traj_recorder,traj_replayer}.py → merged into rock/sdk/model/server/traj.py; the integrations/ subdir is removed.
rock/sdk/model/server/sse_utils.py → renamed to sse.py (the codec is the module's whole purpose).

Tests

Three layers, all in tests/unit/:

Pure codec (test_sse.py) — incremental SSE parsing, completion_to_chunk_dict round-trips, the OpenAI-spec-required index injection on replay-stream tool_calls (regression for a real ChatCompletionChunk.model_validate failure).
End-to-end via real TCP (test_proxy_record_replay_e2e.py) — in-thread uvicorn mock upstream, no patches anywhere. MockUpstream owns the canonical reply and assertion helper; TestProxyRecordReplay covers forward stream/non-stream + the 2×2 record/replay matrix with parallel tool_calls.
CLI wiring two layers — test_model_service.py drives the real argparse sub-parser and asserts kwargs the ModelServiceCommand handler passes to ModelService.start; test_service_subprocess.py mocks subprocess.Popen and asserts the argv start_sandbox_service builds. Together they catch both handler-omits-a-kwarg and cmd-string typos.

108 unit tests pass in <3s.

Test plan

uv run pytest tests/unit/sdk/model/ tests/unit/cli/command/ -q → 108 passed
uv run ruff check . → clean (pre-commit hooks gate every commit)
Manual smoke: rock model-service start --type proxy --proxy-base-url https://dashscope.aliyuncs.com/compatible-mode/v1 --recording-file /tmp/r.jsonl → recorded qwen-plus non-stream + glm-5 non-stream + glm-5 stream; restarted with --replay-file and re-issued the three calls — replayed responses match the recorded ones byte for content / reasoning_content (368 chars on the non-stream call, 1787 chars aggregated on the stream call), stream replay emits one chunk + [DONE]
Manual smoke: forward-mode stream request with glm-5 — every SSE chunk delivered to curl carries delta.reasoning_content (and delta.content once it starts), upstream 61.7 KB stream forwarded byte-for-byte to the client; recorder writes one aggregated trajectory line at the end

fixes #934

🤖 Generated with Claude Code

…d/replay 替换 model-service `proxy` 模式手写的 httpx forward + retry_async,改为基于 litellm SDK 调用,同时新增 chat/completions 轨迹的录制与顺序回放能力,服务 SWE-agent / mini-swe-agent / OpenHands 等 deterministic agent 的无 LLM 成本调试。主要改动: - proxy.py 改用 litellm.acompletion(num_retries / extra_headers / streaming) - 新增 TrajectoryRecorder(CustomLogger) 录制 StandardLoggingPayload 到 JSONL - 新增 TrajectoryReplayer(CustomLLM) + SequentialCursor 顺序回放单个 jsonl 文件 - ModelServiceConfig 新增 num_retries / traj_enabled / traj_file / replay_traj_path - CLI 新增 --num-retries / --traj-file(同时承担 replay 入口) - local 模式保留旧 record_traj 装饰器,不受影响 - 删除 examples 旧 YAML,改 README 主推纯 CLI 启动方式 - docs/dev/litellm_proxy_refactor.md 写明设计与 breaking change Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…s cost calc Three fixes to proxy.py uncovered while testing against DashScope (glm-5): - Extract Bearer token from incoming Authorization header and pass as litellm api_key kwarg; setting it via extra_headers does not work because litellm always regenerates Authorization from api_key. Authorization is now stripped from the forwarded header set. - Switch upstream prefix from openai/<model> to custom_openai/<model>. This is litellm's standard pattern for OpenAI-compatible third-party endpoints (DashScope, ModelScope, Groq, Mistral, ...) and avoids "model isn't mapped" on arbitrary upstream model names. - Pass input_cost_per_token=0 / output_cost_per_token=0 so litellm's cost calculator does not raise "model isn't mapped" on unknown models and pollute StandardLoggingPayload.response_cost_failure_debug_information. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ursor The replay path no longer goes through litellm. We have a complete OpenAI-shape response on disk, so routing it through CustomLLM/CustomStreamWrapper just to translate formats was pure overhead — and the source of every replay-side bug (cursor-exhausted retried 6× and wrapped as APIConnectionError, GenericStreamingChunk type gymnastics, finish_reason hardcoded to "stop", reasoning_content dropped on streaming, tool_calls reconstruction left as TODO). Changes: - traj_replayer.py: delete TrajectoryReplayer(CustomLLM) and helpers; keep just SequentialCursor. Cursor exhaustion now raises a plain TrajectoryExhausted. - proxy.py: in replay mode, fetch from app.state.replay_cursor and emit either the raw response dict (non-stream) or one SSE chunk + [DONE] (stream). The stream path renames message → delta and preserves all fields verbatim (finish_reason, tool_calls, reasoning_content, ...). - main.py: rename _configure_litellm_for_proxy → _configure_proxy_integrations. Replay branch now just attaches a SequentialCursor to app.state; no litellm.custom_provider_map registration. - Tests: drop the CustomLLM-based replayer tests; keep cursor tests; add three end-to-end proxy replay tests covering non-stream / stream / cursor exhausted. 43 passed. Direct curl against DashScope glm-5: record + replay (both modes) verified end-to-end. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…penai SDK as parser Replaces litellm.acompletion with raw httpx forwarding. The proxy no longer parses or rewrites the OpenAI protocol on the forward path — body bytes go upstream as-is, response bytes come back as-is. The openai SDK is kept solely as a parser library for the recording side: ChatCompletionChunk + the official ChatCompletionStreamState aggregate streaming chunks into a final ChatCompletion that the recorder writes to JSONL. This restores the proxy's original "transparent forward" intent and eliminates several litellm-specific pain points encountered during testing: - No "model isn't mapped yet" cost-calc exception (no calc happens at all). - No need for the input_cost_per_token=0 / custom_openai prefix workarounds. - Authorization header passes through verbatim (no api_key extraction kludge). - Provider-specific fields (reasoning_content, citations, ...) are preserved byte-for-byte going to the client AND auto-aggregated in the recorded traj (openai SDK uses extra="allow" pydantic mode). - Cursor exhaustion in replay returns 404 directly, never gets retried. Changes: - pyproject.toml: drop litellm>=1.50.0, add openai>=1.50.0 and httpx - proxy.py: rewrite forward path with httpx; record streams via dual-purpose byte forwarding + parallel SSE parsing into ChatCompletionStreamState - traj_recorder.py: drop CustomLogger inheritance; expose explicit recorder.record(request, response, status, ...) API called from proxy.py - main.py: attach recorder/cursor to app.state instead of registering with litellm.callbacks / litellm.custom_provider_map - test_proxy.py: rewritten to use httpx.MockTransport for upstream mocking; cover byte passthrough, provider-specific field preservation, error forwarding, recorder invocation, replay paths - test_traj_recorder.py: rewritten for the explicit-call API 36 passed. End-to-end verified against DashScope glm-5: streaming record, non-stream replay, streaming replay, cursor exhausted -> 404 all work. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ments) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…orwardBackend Strategy pattern eliminates the replay/forward branch inside chat_completions. The backend is selected once at startup (_configure_proxy_integrations) and attached to app.state.backend; the endpoint just parses the request and dispatches. Each backend keeps the stream/non-stream branch local to itself. A union type alias _CompletionBackend documents the closed set of backends and a typed _get_backend(request) accessor wraps the app.state read. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ename replay_traj_path → replay_traj_file The Backend classes are imported by main.py and tests, so the leading underscore mis-signalled them as module-internal. Rename the config field to align with traj_file naming. Also drop the defensive getattr() for args.traj_file — argparse always sets it. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…tion errors Retry was lost when litellm was dropped (it provided num_retries internally). Restored using a unified _send_with_retry helper that: - Always opens upstream with stream=True so the same code path serves both stream and non-stream callers (non-stream just await resp.aread()). - Retries on httpx.TimeoutException, httpx.ConnectError, and HTTP statuses in config.retryable_status_codes (default [429, 500]). - Defaults: 6 attempts, exponential backoff 2s→32s with jitter — matches the original perform_llm_request behavior. - For stream: retry happens before any byte is yielded; mid-stream drops are not retried (would corrupt downstream). Module-level retry constants are read at call time so tests can monkeypatch them. Added 4 tests covering: success after retry, exhausted retries returning last response, non-whitelisted status not retried, and stream retry path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…upstream A tiny FastAPI mock app runs in a background thread via uvicorn; the proxy calls it over real TCP through its own httpx.AsyncClient — production code path, no transport injection or patching. Three scenarios: - non-stream forward: vendor field round-trips, recorder writes JSONL - stream forward: SSE chunks reach the client, recorder gets aggregated final completion - record-then-replay: replay phase uses a bogus base_url to prove the upstream isn't called Tests use FastAPI's TestClient (sync) so the test bodies read top-down with no async noise; the async wiring lives inside MockUpstreamServer. Drive-by cleanups in proxy.py: localize the openai SDK imports inside the streaming aggregator (only needed there), and drop the now-unused _RETRY_EXCEPTIONS constant. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…alls deltas A recorded non-stream message.tool_calls carries no 'index' field, but the OpenAI streaming spec requires it on chunk deltas. Without it, downstream clients using the openai SDK reject the replay-stream chunk with a pydantic ValidationError ('Field required: index'). completion_to_chunk_dict now injects a positional index when missing (existing 'index' is preserved). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…RecordReplay Rename test_proxy_e2e.py → test_proxy_record_replay.py to make the file purpose explicit (the suite revolves around the record→replay capability). Refactor the test surface: - MockUpstream class encapsulates the FastAPI app, server lifecycle, the canonical reply, and an assert_message() helper. Test data and the handler stay in sync because they share the same constants. - TestProxyRecordReplay class groups the three tests with shorter names: test_forward_non_stream test_forward_stream test_replay (parametrized over record × replay stream/non-stream) - _call_chat_completions helper unifies stream/non-stream call sites. Expand coverage to 2 parallel tool_calls (get_weather + get_time) — exercises the openai SDK aggregator's multi-index tool_call assembly. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…sse_utils→sse, merge traj_*→traj The integrations/ directory only ever held two files (traj_recorder, traj_replayer) and the litellm CustomLogger angle that justified the name is long gone. Both modules share one JSONL schema, so collapsing them into a single traj.py keeps the schema and its read/write halves visible together. sse_utils.py → sse.py: the codec is the module's whole purpose, the _utils suffix added nothing. Drop traj_recorder.now() — a one-line wrapper around time.time() with no callers. Also remove a stray _get_or_create_metrics_monitor patch in test_forward_invokes_recorder_on_success: OTLP create-time failure only logs a warning, so the patch was protecting against nothing. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…_file→replay_file The old names were ambiguous: "traj_file" alone gave no hint of write vs read, and the CLI flag --traj-file was actually wired to config.replay_traj_file — same word pointing in opposite directions depending on context. New names mirror the backend pair (ForwardBackend = recording, ReplayBackend = replay) so the role is obvious at the field. CLI is split into two independent flags accordingly. Recorder constructor still takes traj_file= since it names the JSONL file type, not its role; only the config-field / CLI surface changes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…el_validator Setting both at once was silently resolved in favor of replay (the backend factory checks replay_file first), masking what is really a configuration error. A Pydantic model_validator now rejects the combination at construction time; validate_assignment=True extends the check to CLI-style field-by-field overrides applied after a yaml load. Three tests added: construction-time mutex, assignment-time mutex, and the existing yaml-load test split into one-side-only variants since the original deliberately set both fields. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…a staticmethod Module-level function with a single call site inside ReplayBackend; the SSE chunk-emit shape is purely a replay-mode implementation detail. Moving it inside the class also makes the pairing with the JSON branch in serve() visible at a glance. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…solve_base_url Single call site inside ForwardBackend.serve, and the function reads only self._config — drop the redundant config parameter and rename to _resolve_base_url to make the multi-source fallback (proxy_base_url → proxy_rules[model] → proxy_rules['default']) explicit. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…Backend as _stream_and_record Same single-call-site argument as the previous moves: 3 of the 7 kwargs were just relaying self._config / self._recorder. As an instance method the parameter list drops to 4 and the streaming path mirrors the structure of ReplayBackend._sse_iter. _send_with_retry stays at module scope — it's a pure helper bound to the httpx.AsyncClient lifecycle, not to any backend state. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…-service/ Old examples/model_service/README.md was stale: still mentioned litellm, StandardLoggingPayload, --num-retries, and the conflated --traj-file flag. Rewritten to reflect current shape: ForwardBackend / ReplayBackend pair, recording_file / replay_file mutex, retry-on-status-code with the documented attempt budget, openai SDK only used as the stream-state aggregator behind the forwarding path. Also calls out that the rock model-service start subcommand hasn't been wired up with the new flags yet. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…model-service start The two flags previously existed only on the python -m rock.sdk.model.server.main entry; rock model-service start (which subprocess-spawns that same module) had no way to thread them through, forcing users to bypass the CLI for any record/replay scenario. Wire them through ModelServiceCommand argparse → ModelService.start → start_sandbox_service → subprocess argv. Add three tests around the argv construction (default omits both flags, recording_file forwarded, replay_file forwarded). Doc updated: drop the python -m caveat and switch all example commands to rock model-service start. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…replay-file wiring The existing tests/unit/sdk/model/test_service.py covers the subprocess argv construction (catches cmd-string typos like --recording_file vs --recording-file) but mocks nothing above the SDK layer, so a missing kwarg in ModelServiceCommand.arun would slip through. Add tests/unit/cli/command/test_model_service.py mirroring the test_job.py pattern: drive the real argparse sub-parser end-to-end and mock ModelService.start to assert the kwargs it receives. Covers the new flags both in isolation and in their default (omitted) state. Two layers, two bug surfaces — together they cover the full path from CLI argv to subprocess argv. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

….py to fix CI collision tests/integration/envhub/test_service.py shares the same basename, and pytest's default importmode (prepend) collapses both into a single 'test_service' module in sys.modules, so collection fails as soon as both files are picked up: import file mismatch: imported module 'test_service' has this __file__ attribute: .../envhub/test_service.py which is not the same as the test file we want to collect: .../sdk/model/test_service.py Renaming the new file is the smallest fix and keeps importmode=prepend behavior unchanged for the rest of the suite. The new name also describes the file better (it tests how start_sandbox_service builds the subprocess argv). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The file is the only one in the model-service unit suite that boots a real uvicorn upstream in a background thread and drives the proxy through real HTTP — append _e2e to make that scope obvious in the file name. It stays under tests/unit/ because the project's integration/ tier is reserved for tests requiring out-of-process services (Docker, Ray, admin), which this one doesn't. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

StephenRi

LGTM

…, ForwardBackend/ReplayBackend (alibaba#935) * refactor(model-service): rebuild proxy on litellm SDK with traj record/replay 替换 model-service `proxy` 模式手写的 httpx forward + retry_async,改为基于 litellm SDK 调用,同时新增 chat/completions 轨迹的录制与顺序回放能力,服务 SWE-agent / mini-swe-agent / OpenHands 等 deterministic agent 的无 LLM 成本调试。主要改动: - proxy.py 改用 litellm.acompletion(num_retries / extra_headers / streaming) - 新增 TrajectoryRecorder(CustomLogger) 录制 StandardLoggingPayload 到 JSONL - 新增 TrajectoryReplayer(CustomLLM) + SequentialCursor 顺序回放单个 jsonl 文件 - ModelServiceConfig 新增 num_retries / traj_enabled / traj_file / replay_traj_path - CLI 新增 --num-retries / --traj-file(同时承担 replay 入口) - local 模式保留旧 record_traj 装饰器,不受影响 - 删除 examples 旧 YAML,改 README 主推纯 CLI 启动方式 - docs/dev/litellm_proxy_refactor.md 写明设计与 breaking change Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(model-service): pass api_key + use custom_openai prefix + suppress cost calc Three fixes to proxy.py uncovered while testing against DashScope (glm-5): - Extract Bearer token from incoming Authorization header and pass as litellm api_key kwarg; setting it via extra_headers does not work because litellm always regenerates Authorization from api_key. Authorization is now stripped from the forwarded header set. - Switch upstream prefix from openai/<model> to custom_openai/<model>. This is litellm's standard pattern for OpenAI-compatible third-party endpoints (DashScope, ModelScope, Groq, Mistral, ...) and avoids "model isn't mapped" on arbitrary upstream model names. - Pass input_cost_per_token=0 / output_cost_per_token=0 so litellm's cost calculator does not raise "model isn't mapped" on unknown models and pollute StandardLoggingPayload.response_cost_failure_debug_information. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * refactor(model-service): drop CustomLLM, serve replay directly from cursor The replay path no longer goes through litellm. We have a complete OpenAI-shape response on disk, so routing it through CustomLLM/CustomStreamWrapper just to translate formats was pure overhead — and the source of every replay-side bug (cursor-exhausted retried 6× and wrapped as APIConnectionError, GenericStreamingChunk type gymnastics, finish_reason hardcoded to "stop", reasoning_content dropped on streaming, tool_calls reconstruction left as TODO). Changes: - traj_replayer.py: delete TrajectoryReplayer(CustomLLM) and helpers; keep just SequentialCursor. Cursor exhaustion now raises a plain TrajectoryExhausted. - proxy.py: in replay mode, fetch from app.state.replay_cursor and emit either the raw response dict (non-stream) or one SSE chunk + [DONE] (stream). The stream path renames message → delta and preserves all fields verbatim (finish_reason, tool_calls, reasoning_content, ...). - main.py: rename _configure_litellm_for_proxy → _configure_proxy_integrations. Replay branch now just attaches a SequentialCursor to app.state; no litellm.custom_provider_map registration. - Tests: drop the CustomLLM-based replayer tests; keep cursor tests; add three end-to-end proxy replay tests covering non-stream / stream / cursor exhausted. 43 passed. Direct curl against DashScope glm-5: record + replay (both modes) verified end-to-end. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * refactor(model-service): drop litellm, use httpx byte-passthrough + openai SDK as parser Replaces litellm.acompletion with raw httpx forwarding. The proxy no longer parses or rewrites the OpenAI protocol on the forward path — body bytes go upstream as-is, response bytes come back as-is. The openai SDK is kept solely as a parser library for the recording side: ChatCompletionChunk + the official ChatCompletionStreamState aggregate streaming chunks into a final ChatCompletion that the recorder writes to JSONL. This restores the proxy's original "transparent forward" intent and eliminates several litellm-specific pain points encountered during testing: - No "model isn't mapped yet" cost-calc exception (no calc happens at all). - No need for the input_cost_per_token=0 / custom_openai prefix workarounds. - Authorization header passes through verbatim (no api_key extraction kludge). - Provider-specific fields (reasoning_content, citations, ...) are preserved byte-for-byte going to the client AND auto-aggregated in the recorded traj (openai SDK uses extra="allow" pydantic mode). - Cursor exhaustion in replay returns 404 directly, never gets retried. Changes: - pyproject.toml: drop litellm>=1.50.0, add openai>=1.50.0 and httpx - proxy.py: rewrite forward path with httpx; record streams via dual-purpose byte forwarding + parallel SSE parsing into ChatCompletionStreamState - traj_recorder.py: drop CustomLogger inheritance; expose explicit recorder.record(request, response, status, ...) API called from proxy.py - main.py: attach recorder/cursor to app.state instead of registering with litellm.callbacks / litellm.custom_provider_map - test_proxy.py: rewritten to use httpx.MockTransport for upstream mocking; cover byte passthrough, provider-specific field preservation, error forwarding, recorder invocation, replay paths - test_traj_recorder.py: rewritten for the explicit-call API 36 passed. End-to-end verified against DashScope glm-5: streaming record, non-stream replay, streaming replay, cursor exhausted -> 404 all work. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * refactor: use openai sdk * chore(model-service): remove litellm remnants (num_retries, stale comments) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * refactor(model-service): split proxy handler into _ReplayBackend / _ForwardBackend Strategy pattern eliminates the replay/forward branch inside chat_completions. The backend is selected once at startup (_configure_proxy_integrations) and attached to app.state.backend; the endpoint just parses the request and dispatches. Each backend keeps the stream/non-stream branch local to itself. A union type alias _CompletionBackend documents the closed set of backends and a typed _get_backend(request) accessor wraps the app.state read. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * refactor(model-service): drop _ prefix from public Backend classes, rename replay_traj_path → replay_traj_file The Backend classes are imported by main.py and tests, so the leading underscore mis-signalled them as module-internal. Rename the config field to align with traj_file naming. Also drop the defensive getattr() for args.traj_file — argparse always sets it. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat(model-service): restore retry on retryable_status_codes + connection errors Retry was lost when litellm was dropped (it provided num_retries internally). Restored using a unified _send_with_retry helper that: - Always opens upstream with stream=True so the same code path serves both stream and non-stream callers (non-stream just await resp.aread()). - Retries on httpx.TimeoutException, httpx.ConnectError, and HTTP statuses in config.retryable_status_codes (default [429, 500]). - Defaults: 6 attempts, exponential backoff 2s→32s with jitter — matches the original perform_llm_request behavior. - For stream: retry happens before any byte is yielded; mid-stream drops are not retried (would corrupt downstream). Module-level retry constants are read at call time so tests can monkeypatch them. Added 4 tests covering: success after retry, exhausted retries returning last response, non-whitelisted status not retried, and stream retry path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * test(model-service): add e2e tests against an in-thread uvicorn mock upstream A tiny FastAPI mock app runs in a background thread via uvicorn; the proxy calls it over real TCP through its own httpx.AsyncClient — production code path, no transport injection or patching. Three scenarios: - non-stream forward: vendor field round-trips, recorder writes JSONL - stream forward: SSE chunks reach the client, recorder gets aggregated final completion - record-then-replay: replay phase uses a bogus base_url to prove the upstream isn't called Tests use FastAPI's TestClient (sync) so the test bodies read top-down with no async noise; the async wiring lives inside MockUpstreamServer. Drive-by cleanups in proxy.py: localize the openai SDK imports inside the streaming aggregator (only needed there), and drop the now-unused _RETRY_EXCEPTIONS constant. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(model-service): inject positional index into replay-stream tool_calls deltas A recorded non-stream message.tool_calls carries no 'index' field, but the OpenAI streaming spec requires it on chunk deltas. Without it, downstream clients using the openai SDK reject the replay-stream chunk with a pydantic ValidationError ('Field required: index'). completion_to_chunk_dict now injects a positional index when missing (existing 'index' is preserved). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * test(model-service): refactor proxy e2e into MockUpstream + TestProxyRecordReplay Rename test_proxy_e2e.py → test_proxy_record_replay.py to make the file purpose explicit (the suite revolves around the record→replay capability). Refactor the test surface: - MockUpstream class encapsulates the FastAPI app, server lifecycle, the canonical reply, and an assert_message() helper. Test data and the handler stay in sync because they share the same constants. - TestProxyRecordReplay class groups the three tests with shorter names: test_forward_non_stream test_forward_stream test_replay (parametrized over record × replay stream/non-stream) - _call_chat_completions helper unifies stream/non-stream call sites. Expand coverage to 2 parallel tool_calls (get_weather + get_time) — exercises the openai SDK aggregator's multi-index tool_call assembly. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * chore: remove useless comment in pyproject.toml * chore: remove uesless dev docs * refactor(model-service): flatten layout — drop integrations/, rename sse_utils→sse, merge traj_*→traj The integrations/ directory only ever held two files (traj_recorder, traj_replayer) and the litellm CustomLogger angle that justified the name is long gone. Both modules share one JSONL schema, so collapsing them into a single traj.py keeps the schema and its read/write halves visible together. sse_utils.py → sse.py: the codec is the module's whole purpose, the _utils suffix added nothing. Drop traj_recorder.now() — a one-line wrapper around time.time() with no callers. Also remove a stray _get_or_create_metrics_monitor patch in test_forward_invokes_recorder_on_success: OTLP create-time failure only logs a warning, so the patch was protecting against nothing. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(model-service): rename traj_file→recording_file, replay_traj_file→replay_file The old names were ambiguous: "traj_file" alone gave no hint of write vs read, and the CLI flag --traj-file was actually wired to config.replay_traj_file — same word pointing in opposite directions depending on context. New names mirror the backend pair (ForwardBackend = recording, ReplayBackend = replay) so the role is obvious at the field. CLI is split into two independent flags accordingly. Recorder constructor still takes traj_file= since it names the JSONL file type, not its role; only the config-field / CLI surface changes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(model-service): enforce recording_file/replay_file mutex via model_validator Setting both at once was silently resolved in favor of replay (the backend factory checks replay_file first), masking what is really a configuration error. A Pydantic model_validator now rejects the combination at construction time; validate_assignment=True extends the check to CLI-style field-by-field overrides applied after a yaml load. Three tests added: construction-time mutex, assignment-time mutex, and the existing yaml-load test split into one-side-only variants since the original deliberately set both fields. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(model-service): move _replay_sse_iter into ReplayBackend as a staticmethod Module-level function with a single call site inside ReplayBackend; the SSE chunk-emit shape is purely a replay-mode implementation detail. Moving it inside the class also makes the pairing with the JSON branch in serve() visible at a glance. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(model-service): move get_base_url into ForwardBackend as _resolve_base_url Single call site inside ForwardBackend.serve, and the function reads only self._config — drop the redundant config parameter and rename to _resolve_base_url to make the multi-source fallback (proxy_base_url → proxy_rules[model] → proxy_rules['default']) explicit. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(model-service): move _forward_stream_and_record into ForwardBackend as _stream_and_record Same single-call-site argument as the previous moves: 3 of the 7 kwargs were just relaying self._config / self._recorder. As an instance method the parameter list drops to 4 and the streaming path mirrors the structure of ReplayBackend._sse_iter. _send_with_retry stays at module scope — it's a pure helper bound to the httpx.AsyncClient lifecycle, not to any backend state. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(model-service): move + rewrite proxy README under docs/dev/model-service/ Old examples/model_service/README.md was stale: still mentioned litellm, StandardLoggingPayload, --num-retries, and the conflated --traj-file flag. Rewritten to reflect current shape: ForwardBackend / ReplayBackend pair, recording_file / replay_file mutex, retry-on-status-code with the documented attempt budget, openai SDK only used as the stream-state aggregator behind the forwarding path. Also calls out that the rock model-service start subcommand hasn't been wired up with the new flags yet. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(model-service): expose --recording-file / --replay-file on rock model-service start The two flags previously existed only on the python -m rock.sdk.model.server.main entry; rock model-service start (which subprocess-spawns that same module) had no way to thread them through, forcing users to bypass the CLI for any record/replay scenario. Wire them through ModelServiceCommand argparse → ModelService.start → start_sandbox_service → subprocess argv. Add three tests around the argv construction (default omits both flags, recording_file forwarded, replay_file forwarded). Doc updated: drop the python -m caveat and switch all example commands to rock model-service start. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * test(model-service): add CLI-layer coverage for --recording-file / --replay-file wiring The existing tests/unit/sdk/model/test_service.py covers the subprocess argv construction (catches cmd-string typos like --recording_file vs --recording-file) but mocks nothing above the SDK layer, so a missing kwarg in ModelServiceCommand.arun would slip through. Add tests/unit/cli/command/test_model_service.py mirroring the test_job.py pattern: drive the real argparse sub-parser end-to-end and mock ModelService.start to assert the kwargs it receives. Covers the new flags both in isolation and in their default (omitted) state. Two layers, two bug surfaces — together they cover the full path from CLI argv to subprocess argv. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * test(model-service): rename test_service.py → test_service_subprocess.py to fix CI collision tests/integration/envhub/test_service.py shares the same basename, and pytest's default importmode (prepend) collapses both into a single 'test_service' module in sys.modules, so collection fails as soon as both files are picked up: import file mismatch: imported module 'test_service' has this __file__ attribute: .../envhub/test_service.py which is not the same as the test file we want to collect: .../sdk/model/test_service.py Renaming the new file is the smallest fix and keeps importmode=prepend behavior unchanged for the rest of the suite. The new name also describes the file better (it tests how start_sandbox_service builds the subprocess argv). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * test(model-service): rename test_proxy_record_replay.py → ..._e2e.py The file is the only one in the model-service unit suite that boots a real uvicorn upstream in a background thread and drives the proxy through real HTTP — append _e2e to make that scope obvious in the file name. It stays under tests/unit/ because the project's integration/ tier is reserved for tests requiring out-of-process services (Docker, Ray, admin), which this one doesn't. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

BCeZn and others added 25 commits May 12, 2026 02:52

refactor: use openai sdk

164e427

chore(model-service): remove litellm remnants (num_retries, stale com…

a3459c9

…ments) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

chore: remove useless comment in pyproject.toml

c7bbd5a

chore: remove uesless dev docs

56a644c

StephenRi approved these changes May 13, 2026

View reviewed changes

BCeZn merged commit a72f7d1 into alibaba:master May 13, 2026
7 of 8 checks passed

BCeZn mentioned this pull request May 13, 2026

[Feature] Integrate in-sandbox model-service proxy into Job layer for record/replay #937

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(model-service): proxy supports stream + replay, byte passthrough, ForwardBackend/ReplayBackend#935

feat(model-service): proxy supports stream + replay, byte passthrough, ForwardBackend/ReplayBackend#935
BCeZn merged 25 commits into
alibaba:masterfrom
BCeZn:feautre/xinshi/litellm-proxy

BCeZn commented May 12, 2026 •

edited

Loading

Uh oh!

StephenRi left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

BCeZn commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed for users

File layout

Tests

Test plan

Uh oh!

StephenRi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

BCeZn commented May 12, 2026 •

edited

Loading