feat: add CJK leak sanitization for non-Chinese reports by ardha27 · Pull Request #681 · 666ghj/MiroFish

ardha27 · 2026-06-07T05:54:51Z

What

Adds a post-processing step in ReportAgent.generate_report() that catches CJK (Chinese) characters leaking into non-Chinese reports. The LLM occasionally slips Mandarin into persona quotes despite the system-prompt language instruction, producing output like:

"BI economist said: Purbaya过于倾向财政扩张..."

Why

The system prompt's get_language_instruction() reduces but doesn't eliminate the leak. Persona quote generation can pull from Chinese training data for fluent-sounding speech, even when the surrounding language is English/Indonesian/etc. The CJK font is registered so it renders fine, but readers see mixed-language quotes.

This is a hard guarantee at the post-processing layer — independent of LLM behavior.

How

Detect runs of CJK Unified Ideographs (U+4E00..U+9FFF) and CJK Symbols/Punctuation (U+3000..U+303F) ≥ 2 chars in length
Batch-translate them via the configured LLM endpoint (reuses the same LLM_API_KEY / LLM_BASE_URL as the rest of MiroFish)
Replace each run in-place, injecting spaces at ASCII boundaries so the result reads naturally in surrounding English/Latin text
Iterate up to 3 passes to catch fragments the LLM leaves in pass 1

Behavior

Auto-enabled for non-Chinese locales (en, es, fr, pt, ru, de, id)
Skipped for zh / zh-CN / zh-TW (legitimate CJK content)
No-op when LLM_API_KEY is not configured (logs warning, returns original)
Graceful fallback: any LLM failure returns original text unchanged
Idempotent: re-running on already-sanitized text is a no-op (0 LLM calls)

Configuration (all optional, in `.env`)

CJK_SANITIZE_ENABLED=0     # force off (default: auto for non-zh locales)
CJK_SANITIZE_LANGS=ja,ko   # override target locale set (default: en,es,fr,pt,ru,de,id)
CJK_SANITIZE_MAX_PASSES=3  # max retry passes (default 3)

Files

Added:

backend/app/utils/cjk_sanitize.py (~250 lines, the module — is_enabled() + sanitize_cjk_in_text())
backend/scripts/test_cjk_sanitize.py (23 unit + integration tests, all passing)

Modified:

backend/app/services/report_agent.py (4-line import + 30-line wire-in after assemble_full_report; saves sanitized version to full_report.md on disk too)
README.md (one section documenting the new env vars)

Testing

23/23 unit tests pass:

CJK run detection (basic, dedup, min-length, order)
Boundary-aware replacement (ASCII left/right, fullwidth parens, number boundaries)
Locale gating (zh skipped, en/id/etc enabled, env override)
Integration (empty, no-CJK, no-API-key, mocked LLM translate, multi-pass, idempotent)
Failure modes (LLM error falls back to original text)

Live test on a real report: Purbaya/USD-IDR 15,000 simulation (14kB markdown, 24 unique CJK runs) reduced to 0 in 3.4 seconds with the real DeepSeek API.

Risk

Low. The module is:

Additive: no existing behavior changes when not enabled
Bounded: skipped entirely for Chinese reports, no impact on the primary zh audience
Opt-out: CJK_SANITIZE_ENABLED=0 disables globally
Fail-safe: any LLM failure returns original text — sanitization can never corrupt a report

Checklist

Tested locally (23 unit + 1 live integration test)
Documentation in README
Tests in backend/scripts/
No changes to existing zh-user behavior

When the LLM generates persona quotes (BI economists, ministry officials, Reddit commenters, etc.) in non-Chinese locales, it can occasionally slip Chinese characters into otherwise fluent English/Latin prose. The system prompt language instruction reduces but doesn't eliminate this — the LLM sometimes reaches back to its Chinese training data for fluent-sounding speech, producing output like: "BI economist said: Purbaya过于倾向财政扩张..." This adds a post-processing step that: 1. Detects runs of CJK Unified Ideographs (U+4E00..U+9FFF) and CJK Symbols/Punctuation (U+3000..U+303F) ≥ 2 chars in length 2. Batch-translates them via the configured LLM endpoint (reusing the same LLM_API_KEY/LLM_BASE_URL as the rest of MiroFish) 3. Replaces each run in-place, injecting spaces at ASCII boundaries so the result reads naturally in surrounding text 4. Iterates up to 3 passes to catch fragments the LLM leaves in pass 1 Behavior: - Auto-enabled for non-Chinese locales (en, es, fr, pt, ru, de, id) - Skipped for zh / zh-CN / zh-TW (legitimate CJK content) - No-op when LLM_API_KEY is not configured (warns and returns original) - Graceful fallback: any LLM failure returns original text unchanged - Idempotent: re-running on already-sanitized text is a no-op Configuration (all optional, set in .env): CJK_SANITIZE_ENABLED=0 # force off (default: auto for non-zh) CJK_SANITIZE_LANGS=ja,ko # override target locale set CJK_SANITIZE_MAX_PASSES=3 # default 3 Files added: - backend/app/utils/cjk_sanitize.py (~250 lines, the module) - backend/scripts/test_cjk_sanitize.py (23 unit + integration tests) Files modified: - backend/app/services/report_agent.py (wire-in after assemble_full_report) - README.md (document config env vars) Tested: 23/23 unit tests pass; live Purbaya/USD-IDR report (24 unique CJK runs in 14kB markdown) reduced to 0 in 3.4s with real DeepSeek API.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add CJK leak sanitization for non-Chinese reports#681

feat: add CJK leak sanitization for non-Chinese reports#681
ardha27 wants to merge 1 commit into
666ghj:mainfrom
ardha27:feat/cjk-sanitize-middleware

ardha27 commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ardha27 commented Jun 7, 2026

What

Why

How

Behavior

Configuration (all optional, in .env)

Files

Testing

Risk

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Configuration (all optional, in `.env`)