feat: add CJK leak sanitization for non-Chinese reports#681
Open
ardha27 wants to merge 1 commit into
Open
Conversation
When the LLM generates persona quotes (BI economists, ministry officials, Reddit commenters, etc.) in non-Chinese locales, it can occasionally slip Chinese characters into otherwise fluent English/Latin prose. The system prompt language instruction reduces but doesn't eliminate this — the LLM sometimes reaches back to its Chinese training data for fluent-sounding speech, producing output like: "BI economist said: Purbaya过于倾向财政扩张..." This adds a post-processing step that: 1. Detects runs of CJK Unified Ideographs (U+4E00..U+9FFF) and CJK Symbols/Punctuation (U+3000..U+303F) ≥ 2 chars in length 2. Batch-translates them via the configured LLM endpoint (reusing the same LLM_API_KEY/LLM_BASE_URL as the rest of MiroFish) 3. Replaces each run in-place, injecting spaces at ASCII boundaries so the result reads naturally in surrounding text 4. Iterates up to 3 passes to catch fragments the LLM leaves in pass 1 Behavior: - Auto-enabled for non-Chinese locales (en, es, fr, pt, ru, de, id) - Skipped for zh / zh-CN / zh-TW (legitimate CJK content) - No-op when LLM_API_KEY is not configured (warns and returns original) - Graceful fallback: any LLM failure returns original text unchanged - Idempotent: re-running on already-sanitized text is a no-op Configuration (all optional, set in .env): CJK_SANITIZE_ENABLED=0 # force off (default: auto for non-zh) CJK_SANITIZE_LANGS=ja,ko # override target locale set CJK_SANITIZE_MAX_PASSES=3 # default 3 Files added: - backend/app/utils/cjk_sanitize.py (~250 lines, the module) - backend/scripts/test_cjk_sanitize.py (23 unit + integration tests) Files modified: - backend/app/services/report_agent.py (wire-in after assemble_full_report) - README.md (document config env vars) Tested: 23/23 unit tests pass; live Purbaya/USD-IDR report (24 unique CJK runs in 14kB markdown) reduced to 0 in 3.4s with real DeepSeek API.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds a post-processing step in
ReportAgent.generate_report()that catches CJK (Chinese) characters leaking into non-Chinese reports. The LLM occasionally slips Mandarin into persona quotes despite the system-prompt language instruction, producing output like:Why
The system prompt's
get_language_instruction()reduces but doesn't eliminate the leak. Persona quote generation can pull from Chinese training data for fluent-sounding speech, even when the surrounding language is English/Indonesian/etc. The CJK font is registered so it renders fine, but readers see mixed-language quotes.This is a hard guarantee at the post-processing layer — independent of LLM behavior.
How
LLM_API_KEY/LLM_BASE_URLas the rest of MiroFish)Behavior
en,es,fr,pt,ru,de,id)zh/zh-CN/zh-TW(legitimate CJK content)LLM_API_KEYis not configured (logs warning, returns original)Configuration (all optional, in
.env)Files
Added:
backend/app/utils/cjk_sanitize.py(~250 lines, the module —is_enabled()+sanitize_cjk_in_text())backend/scripts/test_cjk_sanitize.py(23 unit + integration tests, all passing)Modified:
backend/app/services/report_agent.py(4-line import + 30-line wire-in afterassemble_full_report; saves sanitized version tofull_report.mdon disk too)README.md(one section documenting the new env vars)Testing
23/23 unit tests pass:
zhskipped,en/id/etc enabled, env override)Live test on a real report: Purbaya/USD-IDR 15,000 simulation (14kB markdown, 24 unique CJK runs) reduced to 0 in 3.4 seconds with the real DeepSeek API.
Risk
Low. The module is:
CJK_SANITIZE_ENABLED=0disables globallyChecklist
backend/scripts/