Skip to content

refactor: unified translation pipeline with OpenRouter, retry, and coverage dashboard#83

Open
xiaoyu2er wants to merge 246 commits into
mainfrom
feat/unified-translate-pipeline
Open

refactor: unified translation pipeline with OpenRouter, retry, and coverage dashboard#83
xiaoyu2er wants to merge 246 commits into
mainfrom
feat/unified-translate-pipeline

Conversation

@xiaoyu2er

Copy link
Copy Markdown
Owner

Summary

  • Bug fix: heading hash collision## Title and ### Title now produce distinct MD5 hashes (previously identical, causing wrong heading levels in cached translations)
  • Bug fix: cache.save() preserves src field — source location tracking no longer lost on save
  • Bug fix: validator node count guard — if the LLM merges/splits paragraphs, cache update is skipped to prevent silent corruption
  • Feature: OpenRouter integration — default API backend, configured via .env
  • Feature: retry with exponential backoff — 3 attempts (2s/4s/8s) for rate limits, timeouts, and provider outages
  • Feature: truncation detection — checks finish_reason=length to prevent caching half-translated content
  • Feature: unified CLI--status for coverage report, --lang for translation, --dry-run, --concurrency, --max-tokens
  • Feature: coverage dashboard — per-language translation progress (cached/missing/coverage%)
  • Cleanup: deleted 10 legacy files (~1,800 lines) — removed dead main.ts/openai.ts path
  • Cleanup: removed unused deps — commander, cosmiconfig, gray-matter, micromatch, @anthropic-ai/sdk

Test Coverage

  • 76 tests pass (8 test files, 0 failures)
  • Updated parser test: heading levels produce different hashes
  • Added cache test: src field round-trip through save/load
  • Added validator test: node count mismatch guard
  • Added translator test: stripThinkingBlock
  • Deleted 4 legacy test files (chunk, config, logger, usage)
  • Fixed integration test paths: apps/docs/content/encontent/en

Usage

# Check translation coverage
bun run packages/translate/src/batch.pipeline.ts --status --docs-root content-v15/en

# Translate v15 to zh-hans
bun run packages/translate/src/batch.pipeline.ts --docs-root content-v15/en --lang zh-hans --output-dir content-v15

# Dry run
bun run packages/translate/src/batch.pipeline.ts --docs-root content-v15/en --lang zh-hans --dry-run

TODOS

  • Phase 2: translate v15 to remaining 7 languages (after zh-hans quality validated)
  • Update translation config docsContext reference

🤖 Generated with Claude Code

xiaoyu2er and others added 2 commits March 19, 2026 11:02
…verage dashboard

Bug fixes:
- Heading hash collision: different heading levels now produce distinct MD5 hashes
- cache.save() preserves src field for source location tracking
- Validator guards against node count mismatch (LLM merges/splits nodes)
- Default docs-root fixed to content/en

Features:
- OpenRouter integration as default API (configurable via .env)
- Retry with exponential backoff (3 attempts, 2s/4s/8s)
- Truncation detection via finish_reason check
- Unified CLI: --status, --lang, --dry-run, --concurrency, --max-tokens
- Coverage dashboard: per-language translation progress report
- .env / .env.example for API key management

Cleanup:
- Deleted 10 legacy files (~1,800 lines): main.ts, openai.ts, utils.ts,
  config.ts, index.ts, chunk.ts, pipeline-demo.ts, pipeline.ts, usage.ts, logger.ts
- Deleted 4 legacy test files
- Removed unused dependencies: commander, cosmiconfig, gray-matter, micromatch,
  @anthropic-ai/sdk

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@cloudflare-workers-and-pages

cloudflare-workers-and-pages Bot commented Mar 19, 2026

Copy link
Copy Markdown

Deploying with  Cloudflare Workers  Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status Name Latest Commit Updated (UTC)
❌ Deployment failed
View logs
nextjs-docs-latest 41d7475 Mar 21 2026, 10:08 PM

When the LLM merges/splits/drops nodes during translation, the validator
now uses cached translations as anchor points to align source and output
nodes instead of skipping all cache updates.

Also adds 405 to retryable errors (OpenRouter free models).
Includes 42 translated zh-hans files from prior runs.
Previously --max 10 took the first 10 files alphabetically, even if
all 10 were already cached. Now it scans all files, skips cached ones,
and limits only the number of files sent to the API.
Injects <!-- md5:hash --> before each translatable node in output files.
Skips frontmatter to avoid breaking YAML parsing.
Enables searching MD5 hashes directly in translated files.
… node alignment

Instead of writing the LLM's raw output (which may have extra blank lines
or split nodes), we now:
1. Translate → validate → update cache
2. Re-assemble from EN source + updated cache → write file

This guarantees output structure matches English source exactly.
When all nodes are cached after translation → use re-assembled output
(structure matches EN exactly).

When some nodes are uncached (anchor couldn't align them) → use LLM's
original output to avoid English text in translated file. Logs a warning.
Key additions:
- Preserve blank lines exactly (main cause of mismatch)
- Never remove blank line between paragraph and code block
- Never merge paragraphs
- Count paragraphs must match input
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant