Skip to content

fix(llm): stabilize JSON-object generation (force tool_choice + native response_format + robust planner parsing)#78

Merged
ZhiXiao-Lin merged 2 commits into
mainfrom
fix/structured-json-stability
Jun 23, 2026
Merged

fix(llm): stabilize JSON-object generation (force tool_choice + native response_format + robust planner parsing)#78
ZhiXiao-Lin merged 2 commits into
mainfrom
fix/structured-json-stability

Conversation

@ZhiXiao-Lin

Copy link
Copy Markdown
Contributor

Problem

Users reported a3s-code's JSON-object generation is unstable. Root-cause audit found two gaps:

  1. Structured output never used provider-native guarantees. The structured engine injected a synthetic emit_<schema> tool but only offered it (no tool_choice), and Strict/Json modes collapsed to Tool universally — so the model could reply with prose or malformed args. Pure parse-and-pray.
  2. The planner / pre-analysis JSON path was fragile. llm_planner::extract_json was a naive first-{/last-} slice: no markdown-fence handling, no balanced extraction, no repair, hard-erroring on any fenced/prosey/brace-in-string output.

Fix (Tier 1 + Tier 2)

  • LlmClient trait (non-breaking): additive native_structured_support(), complete_structured(), complete_streaming_structured() — all with default impls that reproduce current behavior, so every existing client/mock keeps working.
  • structured.rs: capability-aware resolve_mode + StructuredDirective. Force tool_choice for Tool/Auto; request native response_format (Strictjson_schema+strict, Jsonjson_object) only on providers that support it, falling back to forced Tool mode otherwise (never silent degradation).
  • Providers: Anthropic forces tool_choice; OpenAI sets tool_choice + response_format; Zhipu delegates to its inner OpenAI client. Request-building was extracted from request-execution so the structured path shares the exact HTTP/retry/parse code (complete_streaming left untouched).
  • llm_planner.rs: the four parse paths now reuse the robust shared extractor, and pre_analyze gets one repair-retry (re-prompt for strict JSON) before falling back.
  • generate_object.rs: stop pre-collapsing Strict/Json to Tool — the engine resolves per capability.

Tests

  • Deep adversarial unit tests: capability/directive routing (5), provider wire-format tool_choice/response_format (Anthropic + OpenAI), adversarial extraction (CRLF/uppercase fences, prose+brace-in-string, single-quote rejection, tool-returns-text fallback), planner fence/prose/brace + repair-retry. 1811 lib tests green, cargo fmt + clippy clean.
  • New #[ignore] real-LLM integration test (tests/test_structured_json_real_llm.rs) driven by .a3s/config.acl.

Real-LLM validation (against gpt-4o via the configured gateway)

Case Result
forced tool_choice ×5 (stability) 5/5 valid objects, 0 repairs
native json_object ✅ valid object
pre_analyze (planner JSON) ✅ parsed, original request preserved
native json_schema (strict) ✅ handled cleanly (gateway rejected → graceful)

Run: A3S_CONFIG_FILE=/abs/.a3s/config.acl A3S_TEST_MODEL=openai/<tool-capable-model> cargo test -p a3s-code-core --test test_structured_json_real_llm -- --ignored --nocapture

Notes / follow-ups

  • Strict (json_schema strict) is opt-in only; Auto/Tool never send response_format (avoids the OpenAI strict-subset 400 footgun).
  • Follow-up: override complete_streaming_structured on the providers so the streaming structured path also forces the directive — today it uses the non-forcing default (the emit tool is still present, just not forced). Not a correctness bug; a reliability optimization for the streaming path.

claude added 2 commits June 23, 2026 12:19
Users reported unstable JSON-object generation. Two root causes:

1. Structured output never used provider-native guarantees: the synthetic
   `emit_<schema>` tool was only *offered* (no `tool_choice`), and Strict/Json
   modes collapsed to Tool universally — so the model could emit prose or
   malformed args ("parse-and-pray").
2. The planner / pre-analysis JSON path used a naive first-`{`/last-`}` slice
   with no fence handling and no repair, hard-erroring on fenced/prosey output.

Fix (Tier 1 + Tier 2):
- LlmClient: additive `native_structured_support()`, `complete_structured()`,
  `complete_streaming_structured()` with default impls (non-breaking — existing
  clients/mocks keep working).
- structured.rs: capability-aware `resolve_mode` + `StructuredDirective`. Force
  `tool_choice` (Tool/Auto), and request native `response_format`
  (Strict→json_schema+strict, Json→json_object) on capable providers, falling
  back to forced Tool mode otherwise.
- anthropic/openai/zhipu: honor the directive (Anthropic forced tool_choice;
  OpenAI tool_choice + response_format; Zhipu delegates to its inner client).
- llm_planner: reuse the robust shared extractor + add one repair retry in
  `pre_analyze`.
- generate_object: stop pre-collapsing Strict/Json to Tool (engine resolves it).

Tests:
- Deep adversarial unit tests: capability/directive routing, provider
  wire-format (tool_choice/response_format), adversarial JSON extraction,
  planner fence/prose/brace-in-string + repair-retry. 1811 lib tests green,
  fmt + clippy clean.
- New `#[ignore]` real-LLM integration test (tests/test_structured_json_real_llm.rs).
  Validated end-to-end against gpt-4o via the real gateway in .a3s/config.acl:
  forced tool_choice 5/5 stable (0 repairs), json_object ok, pre_analyze ok.

Notes:
- Strict json_schema is opt-in only; Auto/Tool never send response_format.
- Follow-up: override complete_streaming_structured on providers so the
  streaming structured path also forces the directive (today it uses the
  non-forcing default).
Follow-up to the blocking-path fix: providers now override
`complete_streaming_structured` so streaming structured generation also forces
`tool_choice` / sets native `response_format`, instead of falling back to the
non-forcing default.

- anthropic/openai: extract `send_streaming` from `complete_streaming`; the
  trait methods become thin wrappers, and `complete_streaming_structured`
  applies the directive before executing. The large streaming parsers are
  unchanged.
- zhipu: already delegates to its inner client (no change).
- tests: RecordingClient records the streaming directive; new unit test asserts
  `generate_streaming` forces the tool; new `#[ignore]` real-LLM streaming case.
  Validated against gpt-4o: streaming forced tool_choice -> 8 partials, valid
  object (5/5 integration cases pass).
@ZhiXiao-Lin

Copy link
Copy Markdown
Contributor Author

Addressed the streaming follow-up in dfb7acc: providers now override complete_streaming_structured so the streaming structured path also forces tool_choice / native response_format (previously it used the non-forcing default). Validated against gpt-4o via the real gateway — streaming forced tool_choice yielded 8 partial-object callbacks and a valid final object; all 5 real-LLM integration cases pass. 1812 lib tests + fmt + clippy green.

@ZhiXiao-Lin ZhiXiao-Lin merged commit e7d01cc into main Jun 23, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants