feat: optimize OSW export and merge workflows#213
Open
singjc wants to merge 3 commits into
Open
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR aims to improve performance and robustness of PyProphet OSW export/merge workflows for large datasets, including new CLI options to control behavior (fresh merges, excluding variance columns) and better handling of DuckDB extension availability in containerized environments.
Changes:
- Added
merge osw --freshand extended merge implementation with batching and resume/progress tracking. - Added
export parquet --exclude_feature_varplus related config plumbing, and introduced additional export performance work (SQLite indices, streamingUNION ALLexport). - Improved DuckDB
sqlite_scannerextension handling messaging for offline/container environments.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 13 comments.
Show a summary per file
| File | Description |
|---|---|
pyprophet/util.py |
Adds --fresh, batching, and resume/progress tracking to OSW merge workflows; adjusts pragmas/logging; removes VACUUM for post-scored merges. |
pyprophet/io/util.py |
Updates DuckDB sqlite_scanner loading logic and messaging for offline/container cases. |
pyprophet/io/export/osw.py |
Adds exclude_feature_var hook, creates SQLite indices to speed joins, streams single-parquet export via UNION ALL, changes peptide mapping to chunked processing, and refines alignment export query. |
pyprophet/cli/merge.py |
Adds --fresh flag to merge osw CLI and forwards it into merge implementation. |
pyprophet/cli/export.py |
Adds --exclude_feature_var option to parquet export CLI and passes through to config. |
pyprophet/_config.py |
Adds exclude_feature_var to ExportIOConfig. |
.dockerignore |
Adjusts ignore rules to include specific Rust tool data files. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+394
to
+415
| # Check if it's a network/download error (e.g., in containers) | ||
| if ("Failed to download extension" in error_msg or | ||
| "Connection timed out" in error_msg or | ||
| "Network unreachable" in error_msg): | ||
| from loguru import logger | ||
| logger.warning( | ||
| f"Cannot download sqlite_scanner extension (likely in container without internet). " | ||
| f"Attempting to load from local cache or using fallback method.\n" | ||
| f"To fix: Set DUCKDB_EXTENSION_DIRECTORY environment variable to a directory with pre-downloaded extensions.\n" | ||
| f"Details: {error_msg}" | ||
| ) | ||
| # Try to load from local cache | ||
| try: | ||
| conn.execute("LOAD sqlite_scanner") | ||
| except Exception: | ||
| # If it still fails, the export will fall back to direct sqlite3 if available | ||
| logger.error( | ||
| "Could not load sqlite_scanner. Export may use slower fallback method. " | ||
| "For better performance, pre-download extensions: " | ||
| "python3 -c 'import duckdb; duckdb.connect(\":memory:\").execute(\"LOAD sqlite_scanner\")'" | ||
| ) | ||
|
|
Comment on lines
+1524
to
+1537
| try: | ||
| load_sqlite_scanner(conn) | ||
| except Exception as scanner_error: | ||
| # If sqlite_scanner fails to load (e.g., in containers without internet), | ||
| # provide helpful guidance but continue with fallback | ||
| if "Failed to download extension" in str(scanner_error) or "Connection timed out" in str(scanner_error): | ||
| click.echo( | ||
| "Warning: sqlite_scanner extension could not be loaded (likely in container without internet access).\n" | ||
| "To fix: Set DUCKDB_EXTENSION_DIRECTORY environment variable to a directory with pre-downloaded extensions.\n" | ||
| "Or pre-download extensions on your host with: " | ||
| "python3 -c 'import duckdb; duckdb.connect(\":memory:\").execute(\"LOAD sqlite_scanner\")'\n" | ||
| "Continuing with alternative method...", | ||
| err=True | ||
| ) |
Comment on lines
+1557
to
+1570
| try: | ||
| load_sqlite_scanner(conn) | ||
| except Exception as scanner_error: | ||
| # If sqlite_scanner fails to load (e.g., in containers without internet), | ||
| # provide helpful guidance but continue with fallback | ||
| if "Failed to download extension" in str(scanner_error) or "Connection timed out" in str(scanner_error): | ||
| click.echo( | ||
| "Warning: sqlite_scanner extension could not be loaded (likely in container without internet access).\n" | ||
| "To fix: Set DUCKDB_EXTENSION_DIRECTORY environment variable to a directory with pre-downloaded extensions.\n" | ||
| "Or pre-download extensions on your host with: " | ||
| "python3 -c 'import duckdb; duckdb.connect(\":memory:\").execute(\"LOAD sqlite_scanner\")'\n" | ||
| "Continuing with alternative method...", | ||
| err=True | ||
| ) |
Comment on lines
+1901
to
+1907
| # Merge on codename | ||
| merged_chunk = pd.merge( | ||
| unimod_chunk, | ||
| codename_chunk, | ||
| on="codename", | ||
| how="outer", | ||
| ) |
Comment on lines
+425
to
+427
| # Only create empty tables if not resuming | ||
| # OR if resuming but MERGE_PROGRESS was just created (old partial merge) - need to clear feature tables | ||
| need_to_recreate_feature_tables = is_resume and all(v == 0 for v in progress.values()) |
Comment on lines
+1169
to
+1170
| ## Skip VACUUM for now (it's slow) - SQLite will auto-optimize on next use | ||
| click.echo("\nInfo: All Post-Scored OSWS files were merged successfully.") |
Comment on lines
58
to
+62
| "--merged_post_scored_runs", | ||
| is_flag=True, | ||
| help="Merge OSW output files that have already been scored.", | ||
| ) | ||
| @click.option( |
Comment on lines
+267
to
+270
| # Skip if exclude_feature_var is enabled | ||
| if self.config.exclude_feature_var: | ||
| return "" | ||
|
|
Comment on lines
+970
to
+972
| conn = sqlite3.connect(outfile) | ||
| c = conn.cursor() | ||
|
|
Comment on lines
265
to
+270
| def _build_score_sql(self, con): | ||
| """Build SQL fragment for score columns in unscored files.""" | ||
| # Skip if exclude_feature_var is enabled | ||
| if self.config.exclude_feature_var: | ||
| return "" | ||
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request introduces several performance improvements and new options for exporting and merging data in PyProphet, especially targeting large datasets and containerized environments. The updates focus on memory optimization, export speed, and robustness when required extensions are unavailable.
Export and Merge CLI Enhancements:
--exclude_feature_varoption to theexport parquetCLI, allowing users to exclude feature variance columns (VAR_*) fromFEATURE_MS1andFEATURE_MS2tables. This can significantly speed up exports and reduce file size. [1] [2] [3] [4]--freshoption to themerge oswCLI, which allows users to start from scratch and ignore any existing merged output file. [1] [2]Performance and Robustness Improvements:
UNION ALLquery, eliminating the need for intermediate temp tables and reducing memory footprint.Container and Extension Handling:
sqlite_scannerextension cannot be downloaded (e.g., in containers without internet access). The code now logs clear warnings, suggests solutions, and falls back gracefully to alternative export methods. [1] [2] [3]Other Notable Changes:
.dockerignoreto allow inclusion of specific data files used by the Rust-basedosw_to_parquettool.ROW_NUMBERwindow function to select the best score per feature, improving correctness and efficiency. [1] [2]These changes collectively improve export/merge workflows for large-scale and containerized analyses, with greater configurability and reliability.