Skip to content

feat: add CJK leak sanitization for non-Chinese reports#681

Open
ardha27 wants to merge 1 commit into
666ghj:mainfrom
ardha27:feat/cjk-sanitize-middleware
Open

feat: add CJK leak sanitization for non-Chinese reports#681
ardha27 wants to merge 1 commit into
666ghj:mainfrom
ardha27:feat/cjk-sanitize-middleware

Conversation

@ardha27

@ardha27 ardha27 commented Jun 7, 2026

Copy link
Copy Markdown

What

Adds a post-processing step in ReportAgent.generate_report() that catches CJK (Chinese) characters leaking into non-Chinese reports. The LLM occasionally slips Mandarin into persona quotes despite the system-prompt language instruction, producing output like:

"BI economist said: Purbaya过于倾向财政扩张..."

Why

The system prompt's get_language_instruction() reduces but doesn't eliminate the leak. Persona quote generation can pull from Chinese training data for fluent-sounding speech, even when the surrounding language is English/Indonesian/etc. The CJK font is registered so it renders fine, but readers see mixed-language quotes.

This is a hard guarantee at the post-processing layer — independent of LLM behavior.

How

  1. Detect runs of CJK Unified Ideographs (U+4E00..U+9FFF) and CJK Symbols/Punctuation (U+3000..U+303F) ≥ 2 chars in length
  2. Batch-translate them via the configured LLM endpoint (reuses the same LLM_API_KEY / LLM_BASE_URL as the rest of MiroFish)
  3. Replace each run in-place, injecting spaces at ASCII boundaries so the result reads naturally in surrounding English/Latin text
  4. Iterate up to 3 passes to catch fragments the LLM leaves in pass 1

Behavior

  • Auto-enabled for non-Chinese locales (en, es, fr, pt, ru, de, id)
  • Skipped for zh / zh-CN / zh-TW (legitimate CJK content)
  • No-op when LLM_API_KEY is not configured (logs warning, returns original)
  • Graceful fallback: any LLM failure returns original text unchanged
  • Idempotent: re-running on already-sanitized text is a no-op (0 LLM calls)

Configuration (all optional, in .env)

CJK_SANITIZE_ENABLED=0     # force off (default: auto for non-zh locales)
CJK_SANITIZE_LANGS=ja,ko   # override target locale set (default: en,es,fr,pt,ru,de,id)
CJK_SANITIZE_MAX_PASSES=3  # max retry passes (default 3)

Files

Added:

  • backend/app/utils/cjk_sanitize.py (~250 lines, the module — is_enabled() + sanitize_cjk_in_text())
  • backend/scripts/test_cjk_sanitize.py (23 unit + integration tests, all passing)

Modified:

  • backend/app/services/report_agent.py (4-line import + 30-line wire-in after assemble_full_report; saves sanitized version to full_report.md on disk too)
  • README.md (one section documenting the new env vars)

Testing

23/23 unit tests pass:

  • CJK run detection (basic, dedup, min-length, order)
  • Boundary-aware replacement (ASCII left/right, fullwidth parens, number boundaries)
  • Locale gating (zh skipped, en/id/etc enabled, env override)
  • Integration (empty, no-CJK, no-API-key, mocked LLM translate, multi-pass, idempotent)
  • Failure modes (LLM error falls back to original text)

Live test on a real report: Purbaya/USD-IDR 15,000 simulation (14kB markdown, 24 unique CJK runs) reduced to 0 in 3.4 seconds with the real DeepSeek API.

Risk

Low. The module is:

  • Additive: no existing behavior changes when not enabled
  • Bounded: skipped entirely for Chinese reports, no impact on the primary zh audience
  • Opt-out: CJK_SANITIZE_ENABLED=0 disables globally
  • Fail-safe: any LLM failure returns original text — sanitization can never corrupt a report

Checklist

  • Tested locally (23 unit + 1 live integration test)
  • Documentation in README
  • Tests in backend/scripts/
  • No changes to existing zh-user behavior

When the LLM generates persona quotes (BI economists, ministry officials,
Reddit commenters, etc.) in non-Chinese locales, it can occasionally slip
Chinese characters into otherwise fluent English/Latin prose. The system
prompt language instruction reduces but doesn't eliminate this — the LLM
sometimes reaches back to its Chinese training data for fluent-sounding
speech, producing output like:
  "BI economist said: Purbaya过于倾向财政扩张..."

This adds a post-processing step that:
1. Detects runs of CJK Unified Ideographs (U+4E00..U+9FFF) and CJK
   Symbols/Punctuation (U+3000..U+303F) ≥ 2 chars in length
2. Batch-translates them via the configured LLM endpoint (reusing the
   same LLM_API_KEY/LLM_BASE_URL as the rest of MiroFish)
3. Replaces each run in-place, injecting spaces at ASCII boundaries so
   the result reads naturally in surrounding text
4. Iterates up to 3 passes to catch fragments the LLM leaves in pass 1

Behavior:
- Auto-enabled for non-Chinese locales (en, es, fr, pt, ru, de, id)
- Skipped for zh / zh-CN / zh-TW (legitimate CJK content)
- No-op when LLM_API_KEY is not configured (warns and returns original)
- Graceful fallback: any LLM failure returns original text unchanged
- Idempotent: re-running on already-sanitized text is a no-op

Configuration (all optional, set in .env):
  CJK_SANITIZE_ENABLED=0   # force off (default: auto for non-zh)
  CJK_SANITIZE_LANGS=ja,ko # override target locale set
  CJK_SANITIZE_MAX_PASSES=3 # default 3

Files added:
- backend/app/utils/cjk_sanitize.py     (~250 lines, the module)
- backend/scripts/test_cjk_sanitize.py  (23 unit + integration tests)

Files modified:
- backend/app/services/report_agent.py  (wire-in after assemble_full_report)
- README.md                             (document config env vars)

Tested: 23/23 unit tests pass; live Purbaya/USD-IDR report (24 unique CJK
runs in 14kB markdown) reduced to 0 in 3.4s with real DeepSeek API.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant