Skip to content

fix(genai): preserve non-ASCII in tool-call arguments JSON#1804

Open
Humphrey (HumphreySun98) wants to merge 1 commit into
langchain-ai:mainfrom
HumphreySun98:fix/genai-tool-call-args-unicode-escape
Open

fix(genai): preserve non-ASCII in tool-call arguments JSON#1804
Humphrey (HumphreySun98) wants to merge 1 commit into
langchain-ai:mainfrom
HumphreySun98:fix/genai-tool-call-args-unicode-escape

Conversation

@HumphreySun98

Copy link
Copy Markdown

Description

_parse_response_candidate in libs/genai/langchain_google_genai/chat_models.py serialized additional_kwargs["function_call"]["arguments"] with the default json.dumps(...). Python's json defaults to ensure_ascii=True, so every non-ASCII character is escaped to \uXXXX. CJK text, accented characters, and emoji in tool-call arguments became unreadable when written to JSON columns / log files.

The same arguments stay correct in tool_calls[i]["args"] (a clean dict, because parse_tool_calls round-trips through json.loads), so consumers see different content depending on which field they read. From #1789:

msg.tool_calls[0]["args"]                              # → {'text': '안녕하세요'}        ✅
msg.additional_kwargs["function_call"]["arguments"]    # → '{"text": "\\uc548\\ub155\\ud558\\uc138\\uc694"}'  ❌

langchain-openai already passes ensure_ascii=False at the analogous call site in langchain_openai/chat_models/base.py, and langchain-core follows the same convention across its json.dumps sites that touch message content. This change makes langchain-google-genai match — one keyword argument.

Relevant issues

Fixes #1789

Type

🐛 Bug Fix

Changes

  • libs/genai/langchain_google_genai/chat_models.py: pass ensure_ascii=False to the json.dumps that produces function_call["arguments"].
  • libs/genai/tests/unit_tests/test_chat_models.py: add test_parse_response_candidate_preserves_non_ascii_in_function_call_arguments asserting on the raw string (not the json.loads round-trip — the existing parametrized tests round-trip and so were blind to the encoding difference).

Testing

$ uv run --group test pytest tests/unit_tests/test_chat_models.py -k parse_response_candidate -v
...
test_parse_response_candidate_preserves_non_ascii_in_function_call_arguments PASSED
============= 17 passed, 203 deselected in 5.18s =============

Reverting only the one-line chat_models.py change while keeping the new test makes that test fail with the exact '\\uc548\\ub155...' escape sequence from the issue, confirming the test pins the buggy behavior.

ruff check and ruff format --check pass on both files.

Note

Per CLAUDE.md PR guidelines: this fix was prepared with the assistance of an AI agent (Claude Code). All code and test changes were reviewed by the author before submission.

`_parse_response_candidate` in `chat_models.py` serialized
`additional_kwargs["function_call"]["arguments"]` with the default
`json.dumps(...)`, which escapes every non-ASCII character to `\uXXXX`.
CJK text, accented characters, and emoji that the model returns in tool
call arguments became unreadable when persisted to JSON columns / log
files. The same arguments stay correct in `tool_calls[i]["args"]` (a
clean dict, because `parse_tool_calls` round-trips through `json.loads`),
so consumers see different content depending on which field they read.

`langchain-openai` already passes `ensure_ascii=False` at the analogous
site (`langchain_openai/chat_models/base.py`), and `langchain-core`
follows the same convention across its `json.dumps` call sites that
touch message content. This change makes `langchain-google-genai` match.

The existing parametrized `test_parse_response_candidate` cases round
arguments through `json.loads` for equality, so they were blind to the
encoding difference. The new regression test asserts on the raw string.

Fixes langchain-ai#1789
@HumphreySun98

Copy link
Copy Markdown
Author

Hi Mason Daugherty (@mdrxy) — friendly nudge when you get a chance. This is a one-keyword fix in genai's _parse_response_candidate (ensure_ascii=False in json.dumps), matching the convention already used by langchain-openai's chat model and langchain-core's messages/utils.py:1810. Regression test fails before and passes after; CI green across all 5 Python versions. Happy to adjust placement / naming if you'd prefer.

Humphrey (HumphreySun98) added a commit to HumphreySun98/langchain-google that referenced this pull request Jun 8, 2026
…ation

`BigQueryCallbackHandler` (and the langgraph/async variants) build the
content for BigQuery JSON columns with bare `json.dumps(...)`. Python's
default `ensure_ascii=True` escapes every non-ASCII character to
`\uXXXX`, so CJK / emoji / accented text from chain inputs, outputs,
documents, tool calls, agent actions, and langgraph attributes land in
storage as escape sequences and are unreadable when inspecting the
BigQuery row directly.

Pass `ensure_ascii=False` at every `json.dumps` site in
`callbacks/bigquery_callback.py` and add unit-test coverage on
`_prepare_arrow_batch` asserting CJK and emoji round-trip into the
resulting `pa.RecordBatch`.

The convention matches what `langchain-openai`, `langchain-core`
(`messages/utils.py:1810`), and our just-shipped genai/vertexai
`_parse_response_candidate` fixes (langchain-ai#1804, langchain-ai#1823) already use.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ChatGoogleGenerativeAI: tool-call arguments in additional_kwargs.function_call.arguments are emitted as \uXXXX-escaped strings (CJK / non-ASCII)

1 participant