feat: add Rust postprocess kernel#37
Conversation
|
Claude encountered an error after 0s —— View job I'll analyze this and get back to you. |
There was a problem hiding this comment.
Pull request overview
This PR introduces a deterministic, side-effect-free transcript post-processing “oracle” in Python and a matching Rust kernel (exposed via the existing providers.kernel_bridge) so RUST_KERNEL_MODE=required runs can hard-require native post-processing while keeping the default Python execution path.
Changes:
- Added pure Python post-processing helpers for segment assembly, display-name disambiguation, conservative text-only merging, and JSON-safe word normalization.
- Added a Rust post-processing implementation plus PyO3 bindings and Python bridge validation/entrypoint.
- Extended unit tests, docs/changelog, environment/config documentation, and Docker heavy-gate smoke to cover the new Rust-backed path.
Reviewed changes
Copilot reviewed 19 out of 20 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/test_postprocess_segments_kernel.py | New golden tests for the Python postprocess oracle behaviors (normalization, merging, display names, segment assembly). |
| tests/unit/test_pipeline_runner.py | Adds coverage ensuring artifacts provider selects Rust postprocess when RUST_KERNEL_MODE=required. |
| tests/unit/test_pipeline_alignment.py | Formatting-only updates to existing assertions. |
| tests/unit/test_kernel_bridge.py | Updates expected core version and adds bridge tests for postprocess_segments validation. |
| doc/configuration.zh.md | Documents that RUST_KERNEL_MODE now also selects result post-processing. |
| doc/configuration.en.md | Same as above (English). |
| doc/changelog.zh.md | Changelog entry for Rust-backed result post-processing + related coverage updates. |
| doc/changelog.en.md | Same as above (English). |
| crates/voscript_core/src/postprocess.rs | New Rust implementation for merging/normalizing aligned segments and building result segments. |
| crates/voscript_core/src/lib.rs | Exposes Rust postprocess via PyO3 (postprocess_segments) and parsing helpers. |
| crates/voscript_core/Cargo.toml | Bumps voscript_core version to 0.8.2. |
| Cargo.lock | Locks the updated voscript_core version. |
| app/providers/kernel_bridge/runtime.py | Adds Rust postprocess response validation and Python bridge entrypoint postprocess_segments. |
| app/providers/kernel_bridge/init.py | Re-exports postprocess_segments. |
| app/providers/artifacts/default.py | Switches segment assembly to Python oracle by default and Rust kernel when required. |
| app/postprocess/segments.py | New Python oracle module for post-processing logic used by both pipeline and Rust equivalence tests. |
| app/postprocess/init.py | Exposes the postprocess helpers as a package API. |
| app/pipeline/stages/diarization/alignment.py | Reuses the shared normalize_words helper from the new postprocess module. |
| .github/workflows/rust-foundation-heavy.yml | Extends Docker heavy-gate smoke to exercise postprocess_segments under required mode. |
| .env.example | Notes that required Rust mode now covers result post-processing in addition to voiceprint scoring. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| let word = match dict.get_item("word")? { | ||
| Some(value) if !value.is_none() => value.str()?.to_string(), | ||
| _ => String::new(), | ||
| }; |
| @@ -24,60 +26,22 @@ def _build_display_names( | |||
| speaker_labels: list[str], | |||
| speaker_map: dict[str, dict], | |||
| ) -> dict[str, str]: | |||
| labels_by_name: dict[str, list[str]] = {} | |||
|
|
|||
| for speaker_label in speaker_labels: | |||
| match = speaker_map.get(speaker_label, {}) | |||
| speaker_name = str(match.get("matched_name") or speaker_label) | |||
| labels_by_name.setdefault(speaker_name, []).append(speaker_label) | |||
|
|
|||
| display_names: dict[str, str] = {} | |||
| for speaker_name, labels in labels_by_name.items(): | |||
| for index, speaker_label in enumerate(labels, start=1): | |||
| display_names[speaker_label] = ( | |||
| speaker_name if index == 1 else f"{speaker_name} ({index})" | |||
| ) | |||
| return display_names | |||
| return build_display_names(speaker_labels, speaker_map) | |||
|
|
|||
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #37 +/- ##
==========================================
- Coverage 91.59% 91.22% -0.38%
==========================================
Files 79 81 +2
Lines 3333 3464 +131
==========================================
+ Hits 3053 3160 +107
- Misses 280 304 +24
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. |
Summary
RUST_KERNEL_MODE=required, while keeping the default Python path.Validation
PYTEST_DISABLE_PLUGIN_AUTOLOAD=1 python -m pytest tests/unit/ tests/test_security.py tests/test_voiceprint_db.py tests/test_job_service.py -q --tb=shortcargo fmt --manifest-path crates/voscript_core/Cargo.toml -- --checkcargo test --manifest-path crates/voscript_core/Cargo.tomlcargo clippy --manifest-path crates/voscript_core/Cargo.toml --features python-bindings --all-targets -- -D warningsruff format --check app/ tests/unit/test_kernel_bridge.py tests/unit/test_pipeline_alignment.py tests/unit/test_pipeline_runner.py tests/unit/test_postprocess_segments_kernel.pyruff check app/ tests/unit/test_kernel_bridge.py tests/unit/test_pipeline_alignment.py tests/unit/test_pipeline_runner.py tests/unit/test_postprocess_segments_kernel.py --ignore E501python voscript-api/scripts/public_release_scan.py --root <repo>git diff --checkNotes
speaker_labelremains the stable cluster key; duplicate display names are disambiguated, not merged.segments[].wordsremains optional.