Skip to content

Feat: Enhance translation quality features and refactor related components#13

Merged
SloMR merged 6 commits into
mainfrom
feat/EnhanceTranlationQuality
Apr 25, 2026
Merged

Feat: Enhance translation quality features and refactor related components#13
SloMR merged 6 commits into
mainfrom
feat/EnhanceTranlationQuality

Conversation

@SloMR
Copy link
Copy Markdown
Owner

@SloMR SloMR commented Apr 25, 2026

This pull request introduces several important improvements to the translation pipeline, focusing on increased translation quality, configurability, and better context handling. The main changes include the addition of new configurable options for context overlap and scan budget, a new scene-aware glossary and speaker attribution system, and improved handling of translation review passes to enhance consistency and reduce gender/number errors. Additionally, configuration defaults are centralized, and the codebase is refactored for clarity and maintainability.

New Features and Configurability:

  • Added --scan-budget and --context-overlap CLI flags, along with --no-review and --no-refine-attribution, allowing users to tune translation quality, context window size, and LLM call cost. These options are documented in README.md and wired into TranslationConfig. [1] [2]
  • Centralized all default values and tuning constants in the new cli/core/constants.py for easier maintenance and consistent usage across modules.

Scene and Context Handling Improvements:

  • Extended the glossary system to include scene-level guidance and per-block speaker attribution, with new parsing logic for scenes and improved batch context rendering. This enables better handling of speaker gender and continuity across batch boundaries. [1] [2] [3]
  • Improved the scan pass to stride-sample large files and annotate blocks with numbers, ensuring late-appearing characters are included in the glossary and context.

Translation and Review Pipeline Enhancements:

  • Refactored the translation and review pipeline to use the new context and scene information, including an improved review pass that checks translation consistency against the glossary and optionally disables review or attribution for cost savings. [1] [2]
  • Added support for passing previous-batch tail blocks as context to improve speaker continuity across batch splits. [1] [2]

Refactoring and Cleanup:

  • Moved HTTP and retry constants to constants.py and cleaned up imports and legacy variables for clarity. [1] [2]

These changes collectively provide more robust context handling, improved translation quality, and greater flexibility for users running on different LLM providers and models.

SloMR added 6 commits April 25, 2026 23:12
- Added new options for context overlap, scan budget, review, and speaker attribution.
- Improved LLM calls for consistent translation quality with refined context handling.
- Introduced UI elements for configuring translation quality settings.
- Enabled more accurate speaker attribution to fix gender-specific translation errors.
- Implemented a conservative review process for first-pass translations to ensure fidelity.
- Adds options for context overlap, scan budget, and review to boost translation accuracy.
- Enhances consistency by refining speaker attribution and glossary handling.
- Improves gender-specific translation with conservative review of first-pass outputs.
- Supports better handling of context between batches, reducing translation errors.
- Introduces new flags for scan budget and context overlap.
- Allows disabling review and speaker attribution for performance.
- Enhances translation quality with default settings.
- Provides flexibility for metered cloud and tight-context local models.
- Extracted translation defaults and tuning constants to a separate file for better organization.
- Enhanced attribution user message construction for improved scene handling.
- Enabled a flexible translation prompt system, facilitating iterative improvements.
- Optimized the translation process by restructuring imports and refining functions.
- Centralizes LLM-facing prompts for better management and iteration.
- Improves translation and review user messages with builder functions.
- Consolidates constants for cleaner code and shared access.
- Optimizes retry logic by aligning attempt constants.
- Enhances context handling, improving translation accuracy and quality.
- Simplifies navbar by removing sticky positioning.
@SloMR SloMR self-assigned this Apr 25, 2026
Copilot AI review requested due to automatic review settings April 25, 2026 20:42
@SloMR SloMR merged commit fd35e0d into main Apr 25, 2026
5 checks passed
@SloMR SloMR deleted the feat/EnhanceTranlationQuality branch April 25, 2026 20:43
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances translation quality and configurability by adding richer prepass-derived context (scenes, participants, speaker attribution), an optional post-edit review pass, and centralized defaults/constants, with corresponding updates to both the CLI and web UI.

Changes:

  • Added configurable scan budget, context overlap, review pass, and speaker-attribution refinement across CLI + web.
  • Extended the context prepass to parse scene guidance and optionally refine per-block speaker attribution.
  • Refactored prompt building / constants into dedicated modules and expanded test coverage for context parsing and review behavior.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
web/src/app/core/translation.service.ts Wires new quality options into the web translation pipeline (scan budget, overlap context, review pass, attribution refinement).
web/src/app/core/translation-prompt.ts Centralizes web prompt strings and user-message builders (scan/review/attribution).
web/src/app/core/context-pass.ts Extends web prepass parsing to include scenes + attribution helpers and improved scan serialization.
web/src/app/core/constants.ts Introduces shared web defaults/tuning constants used across modules.
web/src/app/app.component.ts Exposes new advanced quality settings in web state and passes them into translation calls.
web/src/app/app.component.scss Styles new advanced settings toggles/fields.
web/src/app/app.component.html Adds advanced UI controls for scan budget, context overlap, review pass, and attribution refinement.
cli/translora.py Adds CLI flags for scan budget/context overlap/review/attribution refinement and forwards them into config.
cli/tests/test_review_pass.py Adds tests covering review pass behavior (mismatch, API error, no-glossary skip).
cli/tests/test_context_pass.py Expands tests for scene parsing/enrichment/serialization and attribution triggering.
cli/core/translator.py Wires prepass attribution refinement and previous-batch tail context into batch translation flow.
cli/core/prompt.py Centralizes CLI prompt strings and user-message builders (translate/review/scan/attribution).
cli/core/context_pass.py Extends CLI prepass to parse scenes, enrich participants from block text, and refine speaker attribution.
cli/core/constants.py Centralizes CLI defaults/tuning constants.
cli/core/config.py Moves config defaults to use centralized constants; adds new config toggles (overlap/review/attribution).
cli/core/batch_runner.py Adds review pass integration and previous-context support; refactors to use shared constants/prompts.
README.md Documents the new CLI flags and tuning guidance.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread cli/core/context_pass.py
Comment on lines +139 to +152
haystack, needle = text.casefold(), word.casefold()
nlen = len(needle)
i = 0
while i <= len(haystack) - nlen:
j = haystack.find(needle, i)
if j < 0:
return -1
before = text[j - 1] if j > 0 else ""
after = text[j + nlen] if j + nlen < len(text) else ""
# isalnum is Unicode-aware.
if not (before.isalnum() or before == "_") and not (after.isalnum() or after == "_"):
return j
i = j + 1
return -1
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_find_word() computes j on text.casefold()/word.casefold() but then uses j and nlen to index into the original text for boundary checks. Casefolding can change string length for some Unicode characters (e.g., ß→ss, İ→i̇), so j/nlen may not line up with indices in text, leading to incorrect boundary validation. Consider doing the search on the original string via re with re.IGNORECASE (and explicit boundary classes) so match offsets are in the same coordinate space as the boundary checks.

Suggested change
haystack, needle = text.casefold(), word.casefold()
nlen = len(needle)
i = 0
while i <= len(haystack) - nlen:
j = haystack.find(needle, i)
if j < 0:
return -1
before = text[j - 1] if j > 0 else ""
after = text[j + nlen] if j + nlen < len(text) else ""
# isalnum is Unicode-aware.
if not (before.isalnum() or before == "_") and not (after.isalnum() or after == "_"):
return j
i = j + 1
return -1
# Search on the original text so the returned offsets stay in the same
# coordinate space as boundary validation. This avoids mismatches caused
# by Unicode case folding changing string length (for example, ß -> ss).
pattern = rf"(?<!\w){re.escape(word)}(?!\w)"
match = re.search(pattern, text, re.IGNORECASE)
return match.start() if match else -1

Copilot uses AI. Check for mistakes.
Comment thread cli/core/batch_runner.py
client, batch_idx, left, cfg, file_context, left_path,
prev_tail=prev_tail,
)
right_prev = left[-cfg.context_overlap:] if cfg.context_overlap else []
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right_prev = left[-cfg.context_overlap:] if cfg.context_overlap else [] treats negative context_overlap as truthy and will slice from the start of left (because -(-n)), leaking almost the entire previous batch as “prev context”. Consider clamping context_overlap to >= 0 (at config/argparse level) and using an explicit > 0 check when slicing prev-tail context.

Suggested change
right_prev = left[-cfg.context_overlap:] if cfg.context_overlap else []
overlap = max(cfg.context_overlap, 0)
right_prev = left[-overlap:] if overlap > 0 else []

Copilot uses AI. Check for mistakes.
Comment thread cli/translora.py
Comment on lines +80 to +88
p.add_argument("--scan-budget", type=int, default=24_000, metavar="CHARS",
help="Character budget for the prepass scan (default: 24000). "
"Tuned for best-quality scans on typical TV episodes; "
"lower on tight-context local models (~8k window), "
"raise on large-context cloud models for full-file scans.")
p.add_argument("--context-overlap", type=int, default=2, metavar="N",
help="Source blocks from the previous batch shown as read-only "
"context (default: 2). Helps maintain speaker continuity "
"across batch boundaries. Set to 0 to disable.")
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new defaults are centralized in core/constants.py, but these flags still hard-code numeric defaults (24_000 and 2). Import and use the shared constants (e.g., DEFAULT_SCAN_CHAR_BUDGET, DEFAULT_CONTEXT_OVERLAP) so CLI help text and runtime behavior can’t drift from the canonical defaults.

Copilot uses AI. Check for mistakes.
Comment on lines +114 to +128
function findWord(text: string, word: string): number {
if (!text || !word) return -1;
const haystack = text.toLowerCase();
const needle = word.toLowerCase();
const nlen = needle.length;
let i = 0;
while (i <= haystack.length - nlen) {
const j = haystack.indexOf(needle, i);
if (j < 0) return -1;
const before = j > 0 ? text[j - 1] : '';
const after = j + nlen < text.length ? text[j + nlen] : '';
if (!extendsWord(before) && !extendsWord(after)) return j;
i = j + 1;
}
return -1;
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

findWord() lowercases text and uses the match index from the lowercased string (haystack.indexOf(...)) to index into the original text for boundary checks. For characters whose case mapping changes string length (e.g., Turkish dotted İ), this can produce incorrect indices and wrong boundary decisions. Consider using a Unicode-aware regex with i+u flags (and explicit letter/number/underscore boundaries) so indices are computed against the original string.

Suggested change
function findWord(text: string, word: string): number {
if (!text || !word) return -1;
const haystack = text.toLowerCase();
const needle = word.toLowerCase();
const nlen = needle.length;
let i = 0;
while (i <= haystack.length - nlen) {
const j = haystack.indexOf(needle, i);
if (j < 0) return -1;
const before = j > 0 ? text[j - 1] : '';
const after = j + nlen < text.length ? text[j + nlen] : '';
if (!extendsWord(before) && !extendsWord(after)) return j;
i = j + 1;
}
return -1;
function escapeRegExp(text: string): string {
return text.replace(/[.*+?^${}()|[\]\\]/g, '\\$&');
}
function findWord(text: string, word: string): number {
if (!text || !word) return -1;
const pattern = new RegExp(`(^|[^\\p{L}\\p{N}_])(${escapeRegExp(word)})(?=$|[^\\p{L}\\p{N}_])`, 'iu');
const match = pattern.exec(text);
if (!match) return -1;
return match.index + match[1].length;

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants