fix(llm): stabilize JSON-object generation (force tool_choice + native response_format + robust planner parsing) by ZhiXiao-Lin · Pull Request #78 · AI45Lab/Code

ZhiXiao-Lin · 2026-06-23T04:20:12Z

Problem

Users reported a3s-code's JSON-object generation is unstable. Root-cause audit found two gaps:

Structured output never used provider-native guarantees. The structured engine injected a synthetic emit_<schema> tool but only offered it (no tool_choice), and Strict/Json modes collapsed to Tool universally — so the model could reply with prose or malformed args. Pure parse-and-pray.
The planner / pre-analysis JSON path was fragile. llm_planner::extract_json was a naive first-{/last-} slice: no markdown-fence handling, no balanced extraction, no repair, hard-erroring on any fenced/prosey/brace-in-string output.

Fix (Tier 1 + Tier 2)

LlmClient trait (non-breaking): additive native_structured_support(), complete_structured(), complete_streaming_structured() — all with default impls that reproduce current behavior, so every existing client/mock keeps working.
structured.rs: capability-aware resolve_mode + StructuredDirective. Force tool_choice for Tool/Auto; request native response_format (Strict→json_schema+strict, Json→json_object) only on providers that support it, falling back to forced Tool mode otherwise (never silent degradation).
Providers: Anthropic forces tool_choice; OpenAI sets tool_choice + response_format; Zhipu delegates to its inner OpenAI client. Request-building was extracted from request-execution so the structured path shares the exact HTTP/retry/parse code (complete_streaming left untouched).
llm_planner.rs: the four parse paths now reuse the robust shared extractor, and pre_analyze gets one repair-retry (re-prompt for strict JSON) before falling back.
generate_object.rs: stop pre-collapsing Strict/Json to Tool — the engine resolves per capability.

Tests

Deep adversarial unit tests: capability/directive routing (5), provider wire-format tool_choice/response_format (Anthropic + OpenAI), adversarial extraction (CRLF/uppercase fences, prose+brace-in-string, single-quote rejection, tool-returns-text fallback), planner fence/prose/brace + repair-retry. 1811 lib tests green, cargo fmt + clippy clean.
New #[ignore] real-LLM integration test (tests/test_structured_json_real_llm.rs) driven by .a3s/config.acl.

Real-LLM validation (against `gpt-4o` via the configured gateway)

Case	Result
forced `tool_choice` ×5 (stability)	✅ 5/5 valid objects, 0 repairs
native `json_object`	✅ valid object
`pre_analyze` (planner JSON)	✅ parsed, original request preserved
native `json_schema` (strict)	✅ handled cleanly (gateway rejected → graceful)

Run: A3S_CONFIG_FILE=/abs/.a3s/config.acl A3S_TEST_MODEL=openai/<tool-capable-model> cargo test -p a3s-code-core --test test_structured_json_real_llm -- --ignored --nocapture

Notes / follow-ups

Strict (json_schema strict) is opt-in only; Auto/Tool never send response_format (avoids the OpenAI strict-subset 400 footgun).
Follow-up: override complete_streaming_structured on the providers so the streaming structured path also forces the directive — today it uses the non-forcing default (the emit tool is still present, just not forced). Not a correctness bug; a reliability optimization for the streaming path.

Users reported unstable JSON-object generation. Two root causes: 1. Structured output never used provider-native guarantees: the synthetic `emit_<schema>` tool was only *offered* (no `tool_choice`), and Strict/Json modes collapsed to Tool universally — so the model could emit prose or malformed args ("parse-and-pray"). 2. The planner / pre-analysis JSON path used a naive first-`{`/last-`}` slice with no fence handling and no repair, hard-erroring on fenced/prosey output. Fix (Tier 1 + Tier 2): - LlmClient: additive `native_structured_support()`, `complete_structured()`, `complete_streaming_structured()` with default impls (non-breaking — existing clients/mocks keep working). - structured.rs: capability-aware `resolve_mode` + `StructuredDirective`. Force `tool_choice` (Tool/Auto), and request native `response_format` (Strict→json_schema+strict, Json→json_object) on capable providers, falling back to forced Tool mode otherwise. - anthropic/openai/zhipu: honor the directive (Anthropic forced tool_choice; OpenAI tool_choice + response_format; Zhipu delegates to its inner client). - llm_planner: reuse the robust shared extractor + add one repair retry in `pre_analyze`. - generate_object: stop pre-collapsing Strict/Json to Tool (engine resolves it). Tests: - Deep adversarial unit tests: capability/directive routing, provider wire-format (tool_choice/response_format), adversarial JSON extraction, planner fence/prose/brace-in-string + repair-retry. 1811 lib tests green, fmt + clippy clean. - New `#[ignore]` real-LLM integration test (tests/test_structured_json_real_llm.rs). Validated end-to-end against gpt-4o via the real gateway in .a3s/config.acl: forced tool_choice 5/5 stable (0 repairs), json_object ok, pre_analyze ok. Notes: - Strict json_schema is opt-in only; Auto/Tool never send response_format. - Follow-up: override complete_streaming_structured on providers so the streaming structured path also forces the directive (today it uses the non-forcing default).

Follow-up to the blocking-path fix: providers now override `complete_streaming_structured` so streaming structured generation also forces `tool_choice` / sets native `response_format`, instead of falling back to the non-forcing default. - anthropic/openai: extract `send_streaming` from `complete_streaming`; the trait methods become thin wrappers, and `complete_streaming_structured` applies the directive before executing. The large streaming parsers are unchanged. - zhipu: already delegates to its inner client (no change). - tests: RecordingClient records the streaming directive; new unit test asserts `generate_streaming` forces the tool; new `#[ignore]` real-LLM streaming case. Validated against gpt-4o: streaming forced tool_choice -> 8 partials, valid object (5/5 integration cases pass).

ZhiXiao-Lin · 2026-06-23T06:25:42Z

Addressed the streaming follow-up in dfb7acc: providers now override complete_streaming_structured so the streaming structured path also forces tool_choice / native response_format (previously it used the non-forcing default). Validated against gpt-4o via the real gateway — streaming forced tool_choice yielded 8 partial-object callbacks and a valid final object; all 5 real-LLM integration cases pass. 1812 lib tests + fmt + clippy green.

claude added 2 commits June 23, 2026 12:19

ZhiXiao-Lin merged commit e7d01cc into main Jun 23, 2026
1 check passed

ZhiXiao-Lin mentioned this pull request Jun 23, 2026

chore(release): v4.2.0 #79

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(llm): stabilize JSON-object generation (force tool_choice + native response_format + robust planner parsing)#78

fix(llm): stabilize JSON-object generation (force tool_choice + native response_format + robust planner parsing)#78
ZhiXiao-Lin merged 2 commits into
mainfrom
fix/structured-json-stability

ZhiXiao-Lin commented Jun 23, 2026

Uh oh!

ZhiXiao-Lin commented Jun 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

ZhiXiao-Lin commented Jun 23, 2026

Problem

Fix (Tier 1 + Tier 2)

Tests

Real-LLM validation (against gpt-4o via the configured gateway)

Notes / follow-ups

Uh oh!

ZhiXiao-Lin commented Jun 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Real-LLM validation (against `gpt-4o` via the configured gateway)