Skip to content

feat(model-service): proxy supports stream + replay, byte passthrough, ForwardBackend/ReplayBackend#935

Merged
BCeZn merged 25 commits into
alibaba:masterfrom
BCeZn:feautre/xinshi/litellm-proxy
May 13, 2026
Merged

feat(model-service): proxy supports stream + replay, byte passthrough, ForwardBackend/ReplayBackend#935
BCeZn merged 25 commits into
alibaba:masterfrom
BCeZn:feautre/xinshi/litellm-proxy

Conversation

@BCeZn

@BCeZn BCeZn commented May 12, 2026

Copy link
Copy Markdown
Collaborator

Summary

Refit rock model-service proxy to cover three previously missing capabilities while keeping the existing forward path drop-in compatible.

  • Stream supportresp.aiter_bytes() yielded byte-for-byte to the client; recorder runs on an independent path that feeds chunks into openai SDK's ChatCompletionStreamState to aggregate the final ChatCompletion for the trajectory.
  • Replay modeReplayBackend serves recorded responses directly from a SequentialCursor without any upstream call. Stream replay re-emits the recorded message as one SSE chunk + [DONE]. Cursor exhaustion → 404.
  • Byte passthrough — proxy never reshapes the upstream body, so vendor fields (reasoning_content, provider_specific_fields, …) survive intact in both forward and replay paths.

Strategy split: ForwardBackend and ReplayBackend are attached to app.state.backend at startup; the endpoint just dispatches.

What changed for users

  • New CLI flags on rock model-service start: --recording-file PATH / --replay-file PATH (mutually exclusive). The same fields land in YAML as recording_file / replay_file.
  • Old traj_file / replay_traj_file config fields were renamed to make write-vs-read direction explicit; the CLI flag --traj-file (which previously aliased to replay_traj_file) is gone.
  • Pydantic model_validator rejects setting both fields at once (also fires on field-level reassignment via validate_assignment=True), preventing the silent "record into the file you're replaying from" bug.
  • Retry behavior preserved: connection errors + retryable_status_codes (default [429, 500]), 6 attempts, 2s × 2 backoff with jitter; for streaming, retries only fire before the first byte reaches the client.

File layout

  • rock/sdk/model/server/integrations/{traj_recorder,traj_replayer}.py → merged into rock/sdk/model/server/traj.py; the integrations/ subdir is removed.
  • rock/sdk/model/server/sse_utils.py → renamed to sse.py (the codec is the module's whole purpose).

Tests

Three layers, all in tests/unit/:

  • Pure codec (test_sse.py) — incremental SSE parsing, completion_to_chunk_dict round-trips, the OpenAI-spec-required index injection on replay-stream tool_calls (regression for a real ChatCompletionChunk.model_validate failure).
  • End-to-end via real TCP (test_proxy_record_replay_e2e.py) — in-thread uvicorn mock upstream, no patches anywhere. MockUpstream owns the canonical reply and assertion helper; TestProxyRecordReplay covers forward stream/non-stream + the 2×2 record/replay matrix with parallel tool_calls.
  • CLI wiring two layerstest_model_service.py drives the real argparse sub-parser and asserts kwargs the ModelServiceCommand handler passes to ModelService.start; test_service_subprocess.py mocks subprocess.Popen and asserts the argv start_sandbox_service builds. Together they catch both handler-omits-a-kwarg and cmd-string typos.

108 unit tests pass in <3s.

Test plan

  • uv run pytest tests/unit/sdk/model/ tests/unit/cli/command/ -q → 108 passed
  • uv run ruff check . → clean (pre-commit hooks gate every commit)
  • Manual smoke: rock model-service start --type proxy --proxy-base-url https://dashscope.aliyuncs.com/compatible-mode/v1 --recording-file /tmp/r.jsonl → recorded qwen-plus non-stream + glm-5 non-stream + glm-5 stream; restarted with --replay-file and re-issued the three calls — replayed responses match the recorded ones byte for content / reasoning_content (368 chars on the non-stream call, 1787 chars aggregated on the stream call), stream replay emits one chunk + [DONE]
  • Manual smoke: forward-mode stream request with glm-5 — every SSE chunk delivered to curl carries delta.reasoning_content (and delta.content once it starts), upstream 61.7 KB stream forwarded byte-for-byte to the client; recorder writes one aggregated trajectory line at the end

fixes #934

🤖 Generated with Claude Code

BCeZn and others added 25 commits May 12, 2026 02:52
…d/replay

替换 model-service `proxy` 模式手写的 httpx forward + retry_async,改为基于
litellm SDK 调用,同时新增 chat/completions 轨迹的录制与顺序回放能力,服务
SWE-agent / mini-swe-agent / OpenHands 等 deterministic agent 的无 LLM 成本调试。

主要改动:
- proxy.py 改用 litellm.acompletion(num_retries / extra_headers / streaming)
- 新增 TrajectoryRecorder(CustomLogger) 录制 StandardLoggingPayload 到 JSONL
- 新增 TrajectoryReplayer(CustomLLM) + SequentialCursor 顺序回放单个 jsonl 文件
- ModelServiceConfig 新增 num_retries / traj_enabled / traj_file / replay_traj_path
- CLI 新增 --num-retries / --traj-file(同时承担 replay 入口)
- local 模式保留旧 record_traj 装饰器,不受影响
- 删除 examples 旧 YAML,改 README 主推纯 CLI 启动方式
- docs/dev/litellm_proxy_refactor.md 写明设计与 breaking change

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…s cost calc

Three fixes to proxy.py uncovered while testing against DashScope (glm-5):

- Extract Bearer token from incoming Authorization header and pass as litellm
  api_key kwarg; setting it via extra_headers does not work because litellm
  always regenerates Authorization from api_key. Authorization is now stripped
  from the forwarded header set.
- Switch upstream prefix from openai/<model> to custom_openai/<model>. This is
  litellm's standard pattern for OpenAI-compatible third-party endpoints
  (DashScope, ModelScope, Groq, Mistral, ...) and avoids "model isn't mapped"
  on arbitrary upstream model names.
- Pass input_cost_per_token=0 / output_cost_per_token=0 so litellm's cost
  calculator does not raise "model isn't mapped" on unknown models and pollute
  StandardLoggingPayload.response_cost_failure_debug_information.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ursor

The replay path no longer goes through litellm. We have a complete OpenAI-shape
response on disk, so routing it through CustomLLM/CustomStreamWrapper just to
translate formats was pure overhead — and the source of every replay-side bug
(cursor-exhausted retried 6× and wrapped as APIConnectionError, GenericStreamingChunk
type gymnastics, finish_reason hardcoded to "stop", reasoning_content dropped on
streaming, tool_calls reconstruction left as TODO).

Changes:
- traj_replayer.py: delete TrajectoryReplayer(CustomLLM) and helpers; keep just
  SequentialCursor. Cursor exhaustion now raises a plain TrajectoryExhausted.
- proxy.py: in replay mode, fetch from app.state.replay_cursor and emit either
  the raw response dict (non-stream) or one SSE chunk + [DONE] (stream). The
  stream path renames message → delta and preserves all fields verbatim
  (finish_reason, tool_calls, reasoning_content, ...).
- main.py: rename _configure_litellm_for_proxy → _configure_proxy_integrations.
  Replay branch now just attaches a SequentialCursor to app.state; no
  litellm.custom_provider_map registration.
- Tests: drop the CustomLLM-based replayer tests; keep cursor tests; add three
  end-to-end proxy replay tests covering non-stream / stream / cursor exhausted.

43 passed. Direct curl against DashScope glm-5: record + replay (both modes)
verified end-to-end.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…penai SDK as parser

Replaces litellm.acompletion with raw httpx forwarding. The proxy no longer
parses or rewrites the OpenAI protocol on the forward path — body bytes go
upstream as-is, response bytes come back as-is. The openai SDK is kept solely
as a parser library for the recording side: ChatCompletionChunk + the official
ChatCompletionStreamState aggregate streaming chunks into a final ChatCompletion
that the recorder writes to JSONL.

This restores the proxy's original "transparent forward" intent and eliminates
several litellm-specific pain points encountered during testing:
- No "model isn't mapped yet" cost-calc exception (no calc happens at all).
- No need for the input_cost_per_token=0 / custom_openai prefix workarounds.
- Authorization header passes through verbatim (no api_key extraction kludge).
- Provider-specific fields (reasoning_content, citations, ...) are preserved
  byte-for-byte going to the client AND auto-aggregated in the recorded traj
  (openai SDK uses extra="allow" pydantic mode).
- Cursor exhaustion in replay returns 404 directly, never gets retried.

Changes:
- pyproject.toml: drop litellm>=1.50.0, add openai>=1.50.0 and httpx
- proxy.py: rewrite forward path with httpx; record streams via dual-purpose
  byte forwarding + parallel SSE parsing into ChatCompletionStreamState
- traj_recorder.py: drop CustomLogger inheritance; expose explicit
  recorder.record(request, response, status, ...) API called from proxy.py
- main.py: attach recorder/cursor to app.state instead of registering with
  litellm.callbacks / litellm.custom_provider_map
- test_proxy.py: rewritten to use httpx.MockTransport for upstream mocking;
  cover byte passthrough, provider-specific field preservation, error
  forwarding, recorder invocation, replay paths
- test_traj_recorder.py: rewritten for the explicit-call API

36 passed. End-to-end verified against DashScope glm-5: streaming record,
non-stream replay, streaming replay, cursor exhausted -> 404 all work.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ments)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…orwardBackend

Strategy pattern eliminates the replay/forward branch inside chat_completions.
The backend is selected once at startup (_configure_proxy_integrations) and
attached to app.state.backend; the endpoint just parses the request and
dispatches. Each backend keeps the stream/non-stream branch local to itself.

A union type alias _CompletionBackend documents the closed set of backends
and a typed _get_backend(request) accessor wraps the app.state read.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ename replay_traj_path → replay_traj_file

The Backend classes are imported by main.py and tests, so the leading
underscore mis-signalled them as module-internal. Rename the config field
to align with traj_file naming. Also drop the defensive getattr() for
args.traj_file — argparse always sets it.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tion errors

Retry was lost when litellm was dropped (it provided num_retries internally).
Restored using a unified _send_with_retry helper that:
  - Always opens upstream with stream=True so the same code path serves both
    stream and non-stream callers (non-stream just await resp.aread()).
  - Retries on httpx.TimeoutException, httpx.ConnectError, and HTTP statuses
    in config.retryable_status_codes (default [429, 500]).
  - Defaults: 6 attempts, exponential backoff 2s→32s with jitter — matches
    the original perform_llm_request behavior.
  - For stream: retry happens before any byte is yielded; mid-stream drops
    are not retried (would corrupt downstream).

Module-level retry constants are read at call time so tests can monkeypatch
them. Added 4 tests covering: success after retry, exhausted retries returning
last response, non-whitelisted status not retried, and stream retry path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…upstream

A tiny FastAPI mock app runs in a background thread via uvicorn; the proxy
calls it over real TCP through its own httpx.AsyncClient — production code
path, no transport injection or patching. Three scenarios:
  - non-stream forward: vendor field round-trips, recorder writes JSONL
  - stream forward: SSE chunks reach the client, recorder gets aggregated
    final completion
  - record-then-replay: replay phase uses a bogus base_url to prove the
    upstream isn't called

Tests use FastAPI's TestClient (sync) so the test bodies read top-down with
no async noise; the async wiring lives inside MockUpstreamServer.

Drive-by cleanups in proxy.py: localize the openai SDK imports inside the
streaming aggregator (only needed there), and drop the now-unused
_RETRY_EXCEPTIONS constant.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…alls deltas

A recorded non-stream message.tool_calls carries no 'index' field, but the
OpenAI streaming spec requires it on chunk deltas. Without it, downstream
clients using the openai SDK reject the replay-stream chunk with a pydantic
ValidationError ('Field required: index'). completion_to_chunk_dict now
injects a positional index when missing (existing 'index' is preserved).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…RecordReplay

Rename test_proxy_e2e.py → test_proxy_record_replay.py to make the file
purpose explicit (the suite revolves around the record→replay capability).

Refactor the test surface:
  - MockUpstream class encapsulates the FastAPI app, server lifecycle, the
    canonical reply, and an assert_message() helper. Test data and the
    handler stay in sync because they share the same constants.
  - TestProxyRecordReplay class groups the three tests with shorter names:
      test_forward_non_stream
      test_forward_stream
      test_replay  (parametrized over record × replay stream/non-stream)
  - _call_chat_completions helper unifies stream/non-stream call sites.

Expand coverage to 2 parallel tool_calls (get_weather + get_time) — exercises
the openai SDK aggregator's multi-index tool_call assembly.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…sse_utils→sse, merge traj_*→traj

The integrations/ directory only ever held two files (traj_recorder, traj_replayer)
and the litellm CustomLogger angle that justified the name is long gone. Both
modules share one JSONL schema, so collapsing them into a single traj.py keeps
the schema and its read/write halves visible together.

sse_utils.py → sse.py: the codec is the module's whole purpose, the _utils
suffix added nothing.

Drop traj_recorder.now() — a one-line wrapper around time.time() with no
callers.

Also remove a stray _get_or_create_metrics_monitor patch in
test_forward_invokes_recorder_on_success: OTLP create-time failure only logs
a warning, so the patch was protecting against nothing.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…_file→replay_file

The old names were ambiguous: "traj_file" alone gave no hint of write vs read,
and the CLI flag --traj-file was actually wired to config.replay_traj_file —
same word pointing in opposite directions depending on context.

New names mirror the backend pair (ForwardBackend = recording, ReplayBackend =
replay) so the role is obvious at the field. CLI is split into two independent
flags accordingly.

Recorder constructor still takes traj_file= since it names the JSONL file type,
not its role; only the config-field / CLI surface changes.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…el_validator

Setting both at once was silently resolved in favor of replay (the backend
factory checks replay_file first), masking what is really a configuration
error. A Pydantic model_validator now rejects the combination at construction
time; validate_assignment=True extends the check to CLI-style field-by-field
overrides applied after a yaml load.

Three tests added: construction-time mutex, assignment-time mutex, and the
existing yaml-load test split into one-side-only variants since the original
deliberately set both fields.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…a staticmethod

Module-level function with a single call site inside ReplayBackend; the SSE
chunk-emit shape is purely a replay-mode implementation detail. Moving it
inside the class also makes the pairing with the JSON branch in serve()
visible at a glance.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…solve_base_url

Single call site inside ForwardBackend.serve, and the function reads only
self._config — drop the redundant config parameter and rename to
_resolve_base_url to make the multi-source fallback (proxy_base_url →
proxy_rules[model] → proxy_rules['default']) explicit.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…Backend as _stream_and_record

Same single-call-site argument as the previous moves: 3 of the 7 kwargs were
just relaying self._config / self._recorder. As an instance method the
parameter list drops to 4 and the streaming path mirrors the structure of
ReplayBackend._sse_iter.

_send_with_retry stays at module scope — it's a pure helper bound to the
httpx.AsyncClient lifecycle, not to any backend state.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…-service/

Old examples/model_service/README.md was stale: still mentioned litellm,
StandardLoggingPayload, --num-retries, and the conflated --traj-file flag.

Rewritten to reflect current shape: ForwardBackend / ReplayBackend pair,
recording_file / replay_file mutex, retry-on-status-code with the documented
attempt budget, openai SDK only used as the stream-state aggregator behind the
forwarding path. Also calls out that the rock model-service start subcommand
hasn't been wired up with the new flags yet.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…model-service start

The two flags previously existed only on the python -m rock.sdk.model.server.main
entry; rock model-service start (which subprocess-spawns that same module) had
no way to thread them through, forcing users to bypass the CLI for any
record/replay scenario.

Wire them through ModelServiceCommand argparse → ModelService.start →
start_sandbox_service → subprocess argv. Add three tests around the argv
construction (default omits both flags, recording_file forwarded,
replay_file forwarded).

Doc updated: drop the python -m caveat and switch all example commands to
rock model-service start.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…replay-file wiring

The existing tests/unit/sdk/model/test_service.py covers the subprocess argv
construction (catches cmd-string typos like --recording_file vs --recording-file)
but mocks nothing above the SDK layer, so a missing kwarg in
ModelServiceCommand.arun would slip through.

Add tests/unit/cli/command/test_model_service.py mirroring the test_job.py
pattern: drive the real argparse sub-parser end-to-end and mock
ModelService.start to assert the kwargs it receives. Covers the new flags
both in isolation and in their default (omitted) state.

Two layers, two bug surfaces — together they cover the full path from CLI
argv to subprocess argv.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
….py to fix CI collision

tests/integration/envhub/test_service.py shares the same basename, and pytest's
default importmode (prepend) collapses both into a single 'test_service' module
in sys.modules, so collection fails as soon as both files are picked up:

    import file mismatch:
    imported module 'test_service' has this __file__ attribute: .../envhub/test_service.py
    which is not the same as the test file we want to collect: .../sdk/model/test_service.py

Renaming the new file is the smallest fix and keeps importmode=prepend behavior
unchanged for the rest of the suite. The new name also describes the file
better (it tests how start_sandbox_service builds the subprocess argv).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The file is the only one in the model-service unit suite that boots a real
uvicorn upstream in a background thread and drives the proxy through real
HTTP — append _e2e to make that scope obvious in the file name. It stays
under tests/unit/ because the project's integration/ tier is reserved for
tests requiring out-of-process services (Docker, Ray, admin), which this one
doesn't.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

@StephenRi StephenRi left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@BCeZn BCeZn merged commit a72f7d1 into alibaba:master May 13, 2026
7 of 8 checks passed
zhongwen666 pushed a commit to zhongwen666/ROCK that referenced this pull request May 17, 2026
…, ForwardBackend/ReplayBackend (alibaba#935)

* refactor(model-service): rebuild proxy on litellm SDK with traj record/replay

替换 model-service `proxy` 模式手写的 httpx forward + retry_async,改为基于
litellm SDK 调用,同时新增 chat/completions 轨迹的录制与顺序回放能力,服务
SWE-agent / mini-swe-agent / OpenHands 等 deterministic agent 的无 LLM 成本调试。

主要改动:
- proxy.py 改用 litellm.acompletion(num_retries / extra_headers / streaming)
- 新增 TrajectoryRecorder(CustomLogger) 录制 StandardLoggingPayload 到 JSONL
- 新增 TrajectoryReplayer(CustomLLM) + SequentialCursor 顺序回放单个 jsonl 文件
- ModelServiceConfig 新增 num_retries / traj_enabled / traj_file / replay_traj_path
- CLI 新增 --num-retries / --traj-file(同时承担 replay 入口)
- local 模式保留旧 record_traj 装饰器,不受影响
- 删除 examples 旧 YAML,改 README 主推纯 CLI 启动方式
- docs/dev/litellm_proxy_refactor.md 写明设计与 breaking change

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(model-service): pass api_key + use custom_openai prefix + suppress cost calc

Three fixes to proxy.py uncovered while testing against DashScope (glm-5):

- Extract Bearer token from incoming Authorization header and pass as litellm
  api_key kwarg; setting it via extra_headers does not work because litellm
  always regenerates Authorization from api_key. Authorization is now stripped
  from the forwarded header set.
- Switch upstream prefix from openai/<model> to custom_openai/<model>. This is
  litellm's standard pattern for OpenAI-compatible third-party endpoints
  (DashScope, ModelScope, Groq, Mistral, ...) and avoids "model isn't mapped"
  on arbitrary upstream model names.
- Pass input_cost_per_token=0 / output_cost_per_token=0 so litellm's cost
  calculator does not raise "model isn't mapped" on unknown models and pollute
  StandardLoggingPayload.response_cost_failure_debug_information.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* refactor(model-service): drop CustomLLM, serve replay directly from cursor

The replay path no longer goes through litellm. We have a complete OpenAI-shape
response on disk, so routing it through CustomLLM/CustomStreamWrapper just to
translate formats was pure overhead — and the source of every replay-side bug
(cursor-exhausted retried 6× and wrapped as APIConnectionError, GenericStreamingChunk
type gymnastics, finish_reason hardcoded to "stop", reasoning_content dropped on
streaming, tool_calls reconstruction left as TODO).

Changes:
- traj_replayer.py: delete TrajectoryReplayer(CustomLLM) and helpers; keep just
  SequentialCursor. Cursor exhaustion now raises a plain TrajectoryExhausted.
- proxy.py: in replay mode, fetch from app.state.replay_cursor and emit either
  the raw response dict (non-stream) or one SSE chunk + [DONE] (stream). The
  stream path renames message → delta and preserves all fields verbatim
  (finish_reason, tool_calls, reasoning_content, ...).
- main.py: rename _configure_litellm_for_proxy → _configure_proxy_integrations.
  Replay branch now just attaches a SequentialCursor to app.state; no
  litellm.custom_provider_map registration.
- Tests: drop the CustomLLM-based replayer tests; keep cursor tests; add three
  end-to-end proxy replay tests covering non-stream / stream / cursor exhausted.

43 passed. Direct curl against DashScope glm-5: record + replay (both modes)
verified end-to-end.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* refactor(model-service): drop litellm, use httpx byte-passthrough + openai SDK as parser

Replaces litellm.acompletion with raw httpx forwarding. The proxy no longer
parses or rewrites the OpenAI protocol on the forward path — body bytes go
upstream as-is, response bytes come back as-is. The openai SDK is kept solely
as a parser library for the recording side: ChatCompletionChunk + the official
ChatCompletionStreamState aggregate streaming chunks into a final ChatCompletion
that the recorder writes to JSONL.

This restores the proxy's original "transparent forward" intent and eliminates
several litellm-specific pain points encountered during testing:
- No "model isn't mapped yet" cost-calc exception (no calc happens at all).
- No need for the input_cost_per_token=0 / custom_openai prefix workarounds.
- Authorization header passes through verbatim (no api_key extraction kludge).
- Provider-specific fields (reasoning_content, citations, ...) are preserved
  byte-for-byte going to the client AND auto-aggregated in the recorded traj
  (openai SDK uses extra="allow" pydantic mode).
- Cursor exhaustion in replay returns 404 directly, never gets retried.

Changes:
- pyproject.toml: drop litellm>=1.50.0, add openai>=1.50.0 and httpx
- proxy.py: rewrite forward path with httpx; record streams via dual-purpose
  byte forwarding + parallel SSE parsing into ChatCompletionStreamState
- traj_recorder.py: drop CustomLogger inheritance; expose explicit
  recorder.record(request, response, status, ...) API called from proxy.py
- main.py: attach recorder/cursor to app.state instead of registering with
  litellm.callbacks / litellm.custom_provider_map
- test_proxy.py: rewritten to use httpx.MockTransport for upstream mocking;
  cover byte passthrough, provider-specific field preservation, error
  forwarding, recorder invocation, replay paths
- test_traj_recorder.py: rewritten for the explicit-call API

36 passed. End-to-end verified against DashScope glm-5: streaming record,
non-stream replay, streaming replay, cursor exhausted -> 404 all work.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* refactor: use openai sdk

* chore(model-service): remove litellm remnants (num_retries, stale comments)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* refactor(model-service): split proxy handler into _ReplayBackend / _ForwardBackend

Strategy pattern eliminates the replay/forward branch inside chat_completions.
The backend is selected once at startup (_configure_proxy_integrations) and
attached to app.state.backend; the endpoint just parses the request and
dispatches. Each backend keeps the stream/non-stream branch local to itself.

A union type alias _CompletionBackend documents the closed set of backends
and a typed _get_backend(request) accessor wraps the app.state read.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* refactor(model-service): drop _ prefix from public Backend classes, rename replay_traj_path → replay_traj_file

The Backend classes are imported by main.py and tests, so the leading
underscore mis-signalled them as module-internal. Rename the config field
to align with traj_file naming. Also drop the defensive getattr() for
args.traj_file — argparse always sets it.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat(model-service): restore retry on retryable_status_codes + connection errors

Retry was lost when litellm was dropped (it provided num_retries internally).
Restored using a unified _send_with_retry helper that:
  - Always opens upstream with stream=True so the same code path serves both
    stream and non-stream callers (non-stream just await resp.aread()).
  - Retries on httpx.TimeoutException, httpx.ConnectError, and HTTP statuses
    in config.retryable_status_codes (default [429, 500]).
  - Defaults: 6 attempts, exponential backoff 2s→32s with jitter — matches
    the original perform_llm_request behavior.
  - For stream: retry happens before any byte is yielded; mid-stream drops
    are not retried (would corrupt downstream).

Module-level retry constants are read at call time so tests can monkeypatch
them. Added 4 tests covering: success after retry, exhausted retries returning
last response, non-whitelisted status not retried, and stream retry path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* test(model-service): add e2e tests against an in-thread uvicorn mock upstream

A tiny FastAPI mock app runs in a background thread via uvicorn; the proxy
calls it over real TCP through its own httpx.AsyncClient — production code
path, no transport injection or patching. Three scenarios:
  - non-stream forward: vendor field round-trips, recorder writes JSONL
  - stream forward: SSE chunks reach the client, recorder gets aggregated
    final completion
  - record-then-replay: replay phase uses a bogus base_url to prove the
    upstream isn't called

Tests use FastAPI's TestClient (sync) so the test bodies read top-down with
no async noise; the async wiring lives inside MockUpstreamServer.

Drive-by cleanups in proxy.py: localize the openai SDK imports inside the
streaming aggregator (only needed there), and drop the now-unused
_RETRY_EXCEPTIONS constant.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(model-service): inject positional index into replay-stream tool_calls deltas

A recorded non-stream message.tool_calls carries no 'index' field, but the
OpenAI streaming spec requires it on chunk deltas. Without it, downstream
clients using the openai SDK reject the replay-stream chunk with a pydantic
ValidationError ('Field required: index'). completion_to_chunk_dict now
injects a positional index when missing (existing 'index' is preserved).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* test(model-service): refactor proxy e2e into MockUpstream + TestProxyRecordReplay

Rename test_proxy_e2e.py → test_proxy_record_replay.py to make the file
purpose explicit (the suite revolves around the record→replay capability).

Refactor the test surface:
  - MockUpstream class encapsulates the FastAPI app, server lifecycle, the
    canonical reply, and an assert_message() helper. Test data and the
    handler stay in sync because they share the same constants.
  - TestProxyRecordReplay class groups the three tests with shorter names:
      test_forward_non_stream
      test_forward_stream
      test_replay  (parametrized over record × replay stream/non-stream)
  - _call_chat_completions helper unifies stream/non-stream call sites.

Expand coverage to 2 parallel tool_calls (get_weather + get_time) — exercises
the openai SDK aggregator's multi-index tool_call assembly.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* chore: remove useless comment in pyproject.toml

* chore: remove uesless dev docs

* refactor(model-service): flatten layout — drop integrations/, rename sse_utils→sse, merge traj_*→traj

The integrations/ directory only ever held two files (traj_recorder, traj_replayer)
and the litellm CustomLogger angle that justified the name is long gone. Both
modules share one JSONL schema, so collapsing them into a single traj.py keeps
the schema and its read/write halves visible together.

sse_utils.py → sse.py: the codec is the module's whole purpose, the _utils
suffix added nothing.

Drop traj_recorder.now() — a one-line wrapper around time.time() with no
callers.

Also remove a stray _get_or_create_metrics_monitor patch in
test_forward_invokes_recorder_on_success: OTLP create-time failure only logs
a warning, so the patch was protecting against nothing.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* refactor(model-service): rename traj_file→recording_file, replay_traj_file→replay_file

The old names were ambiguous: "traj_file" alone gave no hint of write vs read,
and the CLI flag --traj-file was actually wired to config.replay_traj_file —
same word pointing in opposite directions depending on context.

New names mirror the backend pair (ForwardBackend = recording, ReplayBackend =
replay) so the role is obvious at the field. CLI is split into two independent
flags accordingly.

Recorder constructor still takes traj_file= since it names the JSONL file type,
not its role; only the config-field / CLI surface changes.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(model-service): enforce recording_file/replay_file mutex via model_validator

Setting both at once was silently resolved in favor of replay (the backend
factory checks replay_file first), masking what is really a configuration
error. A Pydantic model_validator now rejects the combination at construction
time; validate_assignment=True extends the check to CLI-style field-by-field
overrides applied after a yaml load.

Three tests added: construction-time mutex, assignment-time mutex, and the
existing yaml-load test split into one-side-only variants since the original
deliberately set both fields.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* refactor(model-service): move _replay_sse_iter into ReplayBackend as a staticmethod

Module-level function with a single call site inside ReplayBackend; the SSE
chunk-emit shape is purely a replay-mode implementation detail. Moving it
inside the class also makes the pairing with the JSON branch in serve()
visible at a glance.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* refactor(model-service): move get_base_url into ForwardBackend as _resolve_base_url

Single call site inside ForwardBackend.serve, and the function reads only
self._config — drop the redundant config parameter and rename to
_resolve_base_url to make the multi-source fallback (proxy_base_url →
proxy_rules[model] → proxy_rules['default']) explicit.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* refactor(model-service): move _forward_stream_and_record into ForwardBackend as _stream_and_record

Same single-call-site argument as the previous moves: 3 of the 7 kwargs were
just relaying self._config / self._recorder. As an instance method the
parameter list drops to 4 and the streaming path mirrors the structure of
ReplayBackend._sse_iter.

_send_with_retry stays at module scope — it's a pure helper bound to the
httpx.AsyncClient lifecycle, not to any backend state.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(model-service): move + rewrite proxy README under docs/dev/model-service/

Old examples/model_service/README.md was stale: still mentioned litellm,
StandardLoggingPayload, --num-retries, and the conflated --traj-file flag.

Rewritten to reflect current shape: ForwardBackend / ReplayBackend pair,
recording_file / replay_file mutex, retry-on-status-code with the documented
attempt budget, openai SDK only used as the stream-state aggregator behind the
forwarding path. Also calls out that the rock model-service start subcommand
hasn't been wired up with the new flags yet.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(model-service): expose --recording-file / --replay-file on rock model-service start

The two flags previously existed only on the python -m rock.sdk.model.server.main
entry; rock model-service start (which subprocess-spawns that same module) had
no way to thread them through, forcing users to bypass the CLI for any
record/replay scenario.

Wire them through ModelServiceCommand argparse → ModelService.start →
start_sandbox_service → subprocess argv. Add three tests around the argv
construction (default omits both flags, recording_file forwarded,
replay_file forwarded).

Doc updated: drop the python -m caveat and switch all example commands to
rock model-service start.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* test(model-service): add CLI-layer coverage for --recording-file / --replay-file wiring

The existing tests/unit/sdk/model/test_service.py covers the subprocess argv
construction (catches cmd-string typos like --recording_file vs --recording-file)
but mocks nothing above the SDK layer, so a missing kwarg in
ModelServiceCommand.arun would slip through.

Add tests/unit/cli/command/test_model_service.py mirroring the test_job.py
pattern: drive the real argparse sub-parser end-to-end and mock
ModelService.start to assert the kwargs it receives. Covers the new flags
both in isolation and in their default (omitted) state.

Two layers, two bug surfaces — together they cover the full path from CLI
argv to subprocess argv.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* test(model-service): rename test_service.py → test_service_subprocess.py to fix CI collision

tests/integration/envhub/test_service.py shares the same basename, and pytest's
default importmode (prepend) collapses both into a single 'test_service' module
in sys.modules, so collection fails as soon as both files are picked up:

    import file mismatch:
    imported module 'test_service' has this __file__ attribute: .../envhub/test_service.py
    which is not the same as the test file we want to collect: .../sdk/model/test_service.py

Renaming the new file is the smallest fix and keeps importmode=prepend behavior
unchanged for the rest of the suite. The new name also describes the file
better (it tests how start_sandbox_service builds the subprocess argv).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* test(model-service): rename test_proxy_record_replay.py → ..._e2e.py

The file is the only one in the model-service unit suite that boots a real
uvicorn upstream in a background thread and drives the proxy through real
HTTP — append _e2e to make that scope obvious in the file name. It stays
under tests/unit/ because the project's integration/ tier is reserved for
tests requiring out-of-process services (Docker, Ray, admin), which this one
doesn't.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

drop litellm, rebuild model-service proxy on httpx + openai SDK

2 participants