feat: optimize OSW export and merge workflows by singjc · Pull Request #213 · PyProphet/pyprophet

singjc · 2026-06-17T23:32:05Z

This pull request introduces several performance improvements and new options for exporting and merging data in PyProphet, especially targeting large datasets and containerized environments. The updates focus on memory optimization, export speed, and robustness when required extensions are unavailable.

Export and Merge CLI Enhancements:

Added a new --exclude_feature_var option to the export parquet CLI, allowing users to exclude feature variance columns (VAR_*) from FEATURE_MS1 and FEATURE_MS2 tables. This can significantly speed up exports and reduce file size. [1] [2] [3] [4]
Added a --fresh option to the merge osw CLI, which allows users to start from scratch and ignore any existing merged output file. [1] [2]

Performance and Robustness Improvements:

The creation of the peptide unimod-to-codename mapping table now processes peptides in chunks, greatly reducing memory usage for large datasets. Progress is logged, and the final count of mappings is reported. [1] [2]
During export, indices are now created on key tables and columns in SQLite databases to optimize join performance, improving export speed for large files.
The export to a single parquet file now streams data directly using a UNION ALL query, eliminating the need for intermediate temp tables and reducing memory footprint.

Container and Extension Handling:

Improved handling when the DuckDB sqlite_scanner extension cannot be downloaded (e.g., in containers without internet access). The code now logs clear warnings, suggests solutions, and falls back gracefully to alternative export methods. [1] [2] [3]

Other Notable Changes:

Updated .dockerignore to allow inclusion of specific data files used by the Rust-based osw_to_parquet tool.
Alignment data export now uses a ROW_NUMBER window function to select the best score per feature, improving correctness and efficiency. [1] [2]

These changes collectively improve export/merge workflows for large-scale and containerized analyses, with greater configurability and reliability.

Copilot

Pull request overview

This PR aims to improve performance and robustness of PyProphet OSW export/merge workflows for large datasets, including new CLI options to control behavior (fresh merges, excluding variance columns) and better handling of DuckDB extension availability in containerized environments.

Changes:

Added merge osw --fresh and extended merge implementation with batching and resume/progress tracking.
Added export parquet --exclude_feature_var plus related config plumbing, and introduced additional export performance work (SQLite indices, streaming UNION ALL export).
Improved DuckDB sqlite_scanner extension handling messaging for offline/container environments.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 13 comments.

Show a summary per file

File	Description
`pyprophet/util.py`	Adds `--fresh`, batching, and resume/progress tracking to OSW merge workflows; adjusts pragmas/logging; removes VACUUM for post-scored merges.
`pyprophet/io/util.py`	Updates DuckDB `sqlite_scanner` loading logic and messaging for offline/container cases.
`pyprophet/io/export/osw.py`	Adds `exclude_feature_var` hook, creates SQLite indices to speed joins, streams single-parquet export via `UNION ALL`, changes peptide mapping to chunked processing, and refines alignment export query.
`pyprophet/cli/merge.py`	Adds `--fresh` flag to `merge osw` CLI and forwards it into merge implementation.
`pyprophet/cli/export.py`	Adds `--exclude_feature_var` option to parquet export CLI and passes through to config.
`pyprophet/_config.py`	Adds `exclude_feature_var` to `ExportIOConfig`.
`.dockerignore`	Adjusts ignore rules to include specific Rust tool data files.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+        # Check if it's a network/download error (e.g., in containers)
+        if ("Failed to download extension" in error_msg or 
+            "Connection timed out" in error_msg or
+            "Network unreachable" in error_msg):
+            from loguru import logger
+            logger.warning(
+                f"Cannot download sqlite_scanner extension (likely in container without internet). "
+                f"Attempting to load from local cache or using fallback method.\n"
+                f"To fix: Set DUCKDB_EXTENSION_DIRECTORY environment variable to a directory with pre-downloaded extensions.\n"
+                f"Details: {error_msg}"
+            )
+            # Try to load from local cache
+            try:
+                conn.execute("LOAD sqlite_scanner")
+            except Exception:
+                # If it still fails, the export will fall back to direct sqlite3 if available
+                logger.error(
+                    "Could not load sqlite_scanner. Export may use slower fallback method. "
+                    "For better performance, pre-download extensions: "
+                    "python3 -c 'import duckdb; duckdb.connect(\":memory:\").execute(\"LOAD sqlite_scanner\")'"
+                )
+


+        try:
+            load_sqlite_scanner(conn)
+        except Exception as scanner_error:
+            # If sqlite_scanner fails to load (e.g., in containers without internet),
+            # provide helpful guidance but continue with fallback
+            if "Failed to download extension" in str(scanner_error) or "Connection timed out" in str(scanner_error):
+                click.echo(
+                    "Warning: sqlite_scanner extension could not be loaded (likely in container without internet access).\n"
+                    "To fix: Set DUCKDB_EXTENSION_DIRECTORY environment variable to a directory with pre-downloaded extensions.\n"
+                    "Or pre-download extensions on your host with: "
+                    "python3 -c 'import duckdb; duckdb.connect(\":memory:\").execute(\"LOAD sqlite_scanner\")'\n"
+                    "Continuing with alternative method...",
+                    err=True
+                )


+        try:
+            load_sqlite_scanner(conn)
+        except Exception as scanner_error:
+            # If sqlite_scanner fails to load (e.g., in containers without internet),
+            # provide helpful guidance but continue with fallback
+            if "Failed to download extension" in str(scanner_error) or "Connection timed out" in str(scanner_error):
+                click.echo(
+                    "Warning: sqlite_scanner extension could not be loaded (likely in container without internet access).\n"
+                    "To fix: Set DUCKDB_EXTENSION_DIRECTORY environment variable to a directory with pre-downloaded extensions.\n"
+                    "Or pre-download extensions on your host with: "
+                    "python3 -c 'import duckdb; duckdb.connect(\":memory:\").execute(\"LOAD sqlite_scanner\")'\n"
+                    "Continuing with alternative method...",
+                    err=True
+                )


+                # Merge on codename
+                merged_chunk = pd.merge(
+                    unimod_chunk,
+                    codename_chunk,
+                    on="codename",
+                    how="outer",
+                )


+    # Only create empty tables if not resuming
+    # OR if resuming but MERGE_PROGRESS was just created (old partial merge) - need to clear feature tables
+    need_to_recreate_feature_tables = is_resume and all(v == 0 for v in progress.values())


+    ## Skip VACUUM for now (it's slow) - SQLite will auto-optimize on next use
+    click.echo("\nInfo: All Post-Scored OSWS files were merged successfully.")


    "--merged_post_scored_runs",
    is_flag=True,
    help="Merge OSW output files that have already been scored.",
 )
+@click.option(


+        # Skip if exclude_feature_var is enabled
+        if self.config.exclude_feature_var:
+            return ""
+


+            conn = sqlite3.connect(outfile)
+            c = conn.cursor()
+


    def _build_score_sql(self, con):
        """Build SQL fragment for score columns in unscored files."""
+        # Skip if exclude_feature_var is enabled
+        if self.config.exclude_feature_var:
+            return ""
+


feat: optimize OSW export and merge workflows

89b9eba

Copilot AI review requested due to automatic review settings June 17, 2026 23:32

Merge branch 'master' into split/osw-export-merge

e8d1add

singjc enabled auto-merge June 17, 2026 23:32

Copilot started reviewing on behalf of singjc June 17, 2026 23:32 View session

Copilot AI reviewed Jun 17, 2026

View reviewed changes

Merge branch 'master' into split/osw-export-merge

3e9008e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: optimize OSW export and merge workflows#213

feat: optimize OSW export and merge workflows#213
singjc wants to merge 3 commits into
PyProphet:masterfrom
singjc:split/osw-export-merge

singjc commented Jun 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		## Skip VACUUM for now (it's slow) - SQLite will auto-optimize on next use
		click.echo("\nInfo: All Post-Scored OSWS files were merged successfully.")

Conversation

singjc commented Jun 17, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants