Feat: Enhance translation quality features and refactor related components#13
Conversation
- Added new options for context overlap, scan budget, review, and speaker attribution. - Improved LLM calls for consistent translation quality with refined context handling. - Introduced UI elements for configuring translation quality settings. - Enabled more accurate speaker attribution to fix gender-specific translation errors. - Implemented a conservative review process for first-pass translations to ensure fidelity.
- Adds options for context overlap, scan budget, and review to boost translation accuracy. - Enhances consistency by refining speaker attribution and glossary handling. - Improves gender-specific translation with conservative review of first-pass outputs. - Supports better handling of context between batches, reducing translation errors.
- Introduces new flags for scan budget and context overlap. - Allows disabling review and speaker attribution for performance. - Enhances translation quality with default settings. - Provides flexibility for metered cloud and tight-context local models.
- Extracted translation defaults and tuning constants to a separate file for better organization. - Enhanced attribution user message construction for improved scene handling. - Enabled a flexible translation prompt system, facilitating iterative improvements. - Optimized the translation process by restructuring imports and refining functions.
- Centralizes LLM-facing prompts for better management and iteration. - Improves translation and review user messages with builder functions. - Consolidates constants for cleaner code and shared access. - Optimizes retry logic by aligning attempt constants. - Enhances context handling, improving translation accuracy and quality.
- Simplifies navbar by removing sticky positioning.
There was a problem hiding this comment.
Pull request overview
This PR enhances translation quality and configurability by adding richer prepass-derived context (scenes, participants, speaker attribution), an optional post-edit review pass, and centralized defaults/constants, with corresponding updates to both the CLI and web UI.
Changes:
- Added configurable scan budget, context overlap, review pass, and speaker-attribution refinement across CLI + web.
- Extended the context prepass to parse scene guidance and optionally refine per-block speaker attribution.
- Refactored prompt building / constants into dedicated modules and expanded test coverage for context parsing and review behavior.
Reviewed changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| web/src/app/core/translation.service.ts | Wires new quality options into the web translation pipeline (scan budget, overlap context, review pass, attribution refinement). |
| web/src/app/core/translation-prompt.ts | Centralizes web prompt strings and user-message builders (scan/review/attribution). |
| web/src/app/core/context-pass.ts | Extends web prepass parsing to include scenes + attribution helpers and improved scan serialization. |
| web/src/app/core/constants.ts | Introduces shared web defaults/tuning constants used across modules. |
| web/src/app/app.component.ts | Exposes new advanced quality settings in web state and passes them into translation calls. |
| web/src/app/app.component.scss | Styles new advanced settings toggles/fields. |
| web/src/app/app.component.html | Adds advanced UI controls for scan budget, context overlap, review pass, and attribution refinement. |
| cli/translora.py | Adds CLI flags for scan budget/context overlap/review/attribution refinement and forwards them into config. |
| cli/tests/test_review_pass.py | Adds tests covering review pass behavior (mismatch, API error, no-glossary skip). |
| cli/tests/test_context_pass.py | Expands tests for scene parsing/enrichment/serialization and attribution triggering. |
| cli/core/translator.py | Wires prepass attribution refinement and previous-batch tail context into batch translation flow. |
| cli/core/prompt.py | Centralizes CLI prompt strings and user-message builders (translate/review/scan/attribution). |
| cli/core/context_pass.py | Extends CLI prepass to parse scenes, enrich participants from block text, and refine speaker attribution. |
| cli/core/constants.py | Centralizes CLI defaults/tuning constants. |
| cli/core/config.py | Moves config defaults to use centralized constants; adds new config toggles (overlap/review/attribution). |
| cli/core/batch_runner.py | Adds review pass integration and previous-context support; refactors to use shared constants/prompts. |
| README.md | Documents the new CLI flags and tuning guidance. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| haystack, needle = text.casefold(), word.casefold() | ||
| nlen = len(needle) | ||
| i = 0 | ||
| while i <= len(haystack) - nlen: | ||
| j = haystack.find(needle, i) | ||
| if j < 0: | ||
| return -1 | ||
| before = text[j - 1] if j > 0 else "" | ||
| after = text[j + nlen] if j + nlen < len(text) else "" | ||
| # isalnum is Unicode-aware. | ||
| if not (before.isalnum() or before == "_") and not (after.isalnum() or after == "_"): | ||
| return j | ||
| i = j + 1 | ||
| return -1 |
There was a problem hiding this comment.
_find_word() computes j on text.casefold()/word.casefold() but then uses j and nlen to index into the original text for boundary checks. Casefolding can change string length for some Unicode characters (e.g., ß→ss, İ→i̇), so j/nlen may not line up with indices in text, leading to incorrect boundary validation. Consider doing the search on the original string via re with re.IGNORECASE (and explicit boundary classes) so match offsets are in the same coordinate space as the boundary checks.
| haystack, needle = text.casefold(), word.casefold() | |
| nlen = len(needle) | |
| i = 0 | |
| while i <= len(haystack) - nlen: | |
| j = haystack.find(needle, i) | |
| if j < 0: | |
| return -1 | |
| before = text[j - 1] if j > 0 else "" | |
| after = text[j + nlen] if j + nlen < len(text) else "" | |
| # isalnum is Unicode-aware. | |
| if not (before.isalnum() or before == "_") and not (after.isalnum() or after == "_"): | |
| return j | |
| i = j + 1 | |
| return -1 | |
| # Search on the original text so the returned offsets stay in the same | |
| # coordinate space as boundary validation. This avoids mismatches caused | |
| # by Unicode case folding changing string length (for example, ß -> ss). | |
| pattern = rf"(?<!\w){re.escape(word)}(?!\w)" | |
| match = re.search(pattern, text, re.IGNORECASE) | |
| return match.start() if match else -1 |
| client, batch_idx, left, cfg, file_context, left_path, | ||
| prev_tail=prev_tail, | ||
| ) | ||
| right_prev = left[-cfg.context_overlap:] if cfg.context_overlap else [] |
There was a problem hiding this comment.
right_prev = left[-cfg.context_overlap:] if cfg.context_overlap else [] treats negative context_overlap as truthy and will slice from the start of left (because -(-n)), leaking almost the entire previous batch as “prev context”. Consider clamping context_overlap to >= 0 (at config/argparse level) and using an explicit > 0 check when slicing prev-tail context.
| right_prev = left[-cfg.context_overlap:] if cfg.context_overlap else [] | |
| overlap = max(cfg.context_overlap, 0) | |
| right_prev = left[-overlap:] if overlap > 0 else [] |
| p.add_argument("--scan-budget", type=int, default=24_000, metavar="CHARS", | ||
| help="Character budget for the prepass scan (default: 24000). " | ||
| "Tuned for best-quality scans on typical TV episodes; " | ||
| "lower on tight-context local models (~8k window), " | ||
| "raise on large-context cloud models for full-file scans.") | ||
| p.add_argument("--context-overlap", type=int, default=2, metavar="N", | ||
| help="Source blocks from the previous batch shown as read-only " | ||
| "context (default: 2). Helps maintain speaker continuity " | ||
| "across batch boundaries. Set to 0 to disable.") |
There was a problem hiding this comment.
The new defaults are centralized in core/constants.py, but these flags still hard-code numeric defaults (24_000 and 2). Import and use the shared constants (e.g., DEFAULT_SCAN_CHAR_BUDGET, DEFAULT_CONTEXT_OVERLAP) so CLI help text and runtime behavior can’t drift from the canonical defaults.
| function findWord(text: string, word: string): number { | ||
| if (!text || !word) return -1; | ||
| const haystack = text.toLowerCase(); | ||
| const needle = word.toLowerCase(); | ||
| const nlen = needle.length; | ||
| let i = 0; | ||
| while (i <= haystack.length - nlen) { | ||
| const j = haystack.indexOf(needle, i); | ||
| if (j < 0) return -1; | ||
| const before = j > 0 ? text[j - 1] : ''; | ||
| const after = j + nlen < text.length ? text[j + nlen] : ''; | ||
| if (!extendsWord(before) && !extendsWord(after)) return j; | ||
| i = j + 1; | ||
| } | ||
| return -1; |
There was a problem hiding this comment.
findWord() lowercases text and uses the match index from the lowercased string (haystack.indexOf(...)) to index into the original text for boundary checks. For characters whose case mapping changes string length (e.g., Turkish dotted İ), this can produce incorrect indices and wrong boundary decisions. Consider using a Unicode-aware regex with i+u flags (and explicit letter/number/underscore boundaries) so indices are computed against the original string.
| function findWord(text: string, word: string): number { | |
| if (!text || !word) return -1; | |
| const haystack = text.toLowerCase(); | |
| const needle = word.toLowerCase(); | |
| const nlen = needle.length; | |
| let i = 0; | |
| while (i <= haystack.length - nlen) { | |
| const j = haystack.indexOf(needle, i); | |
| if (j < 0) return -1; | |
| const before = j > 0 ? text[j - 1] : ''; | |
| const after = j + nlen < text.length ? text[j + nlen] : ''; | |
| if (!extendsWord(before) && !extendsWord(after)) return j; | |
| i = j + 1; | |
| } | |
| return -1; | |
| function escapeRegExp(text: string): string { | |
| return text.replace(/[.*+?^${}()|[\]\\]/g, '\\$&'); | |
| } | |
| function findWord(text: string, word: string): number { | |
| if (!text || !word) return -1; | |
| const pattern = new RegExp(`(^|[^\\p{L}\\p{N}_])(${escapeRegExp(word)})(?=$|[^\\p{L}\\p{N}_])`, 'iu'); | |
| const match = pattern.exec(text); | |
| if (!match) return -1; | |
| return match.index + match[1].length; |
This pull request introduces several important improvements to the translation pipeline, focusing on increased translation quality, configurability, and better context handling. The main changes include the addition of new configurable options for context overlap and scan budget, a new scene-aware glossary and speaker attribution system, and improved handling of translation review passes to enhance consistency and reduce gender/number errors. Additionally, configuration defaults are centralized, and the codebase is refactored for clarity and maintainability.
New Features and Configurability:
--scan-budgetand--context-overlapCLI flags, along with--no-reviewand--no-refine-attribution, allowing users to tune translation quality, context window size, and LLM call cost. These options are documented inREADME.mdand wired intoTranslationConfig. [1] [2]cli/core/constants.pyfor easier maintenance and consistent usage across modules.Scene and Context Handling Improvements:
Translation and Review Pipeline Enhancements:
Refactoring and Cleanup:
constants.pyand cleaned up imports and legacy variables for clarity. [1] [2]These changes collectively provide more robust context handling, improved translation quality, and greater flexibility for users running on different LLM providers and models.