Skip to content

feat: optimize OSW export and merge workflows#213

Open
singjc wants to merge 3 commits into
PyProphet:masterfrom
singjc:split/osw-export-merge
Open

feat: optimize OSW export and merge workflows#213
singjc wants to merge 3 commits into
PyProphet:masterfrom
singjc:split/osw-export-merge

Conversation

@singjc

@singjc singjc commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

This pull request introduces several performance improvements and new options for exporting and merging data in PyProphet, especially targeting large datasets and containerized environments. The updates focus on memory optimization, export speed, and robustness when required extensions are unavailable.

Export and Merge CLI Enhancements:

  • Added a new --exclude_feature_var option to the export parquet CLI, allowing users to exclude feature variance columns (VAR_*) from FEATURE_MS1 and FEATURE_MS2 tables. This can significantly speed up exports and reduce file size. [1] [2] [3] [4]
  • Added a --fresh option to the merge osw CLI, which allows users to start from scratch and ignore any existing merged output file. [1] [2]

Performance and Robustness Improvements:

  • The creation of the peptide unimod-to-codename mapping table now processes peptides in chunks, greatly reducing memory usage for large datasets. Progress is logged, and the final count of mappings is reported. [1] [2]
  • During export, indices are now created on key tables and columns in SQLite databases to optimize join performance, improving export speed for large files.
  • The export to a single parquet file now streams data directly using a UNION ALL query, eliminating the need for intermediate temp tables and reducing memory footprint.

Container and Extension Handling:

  • Improved handling when the DuckDB sqlite_scanner extension cannot be downloaded (e.g., in containers without internet access). The code now logs clear warnings, suggests solutions, and falls back gracefully to alternative export methods. [1] [2] [3]

Other Notable Changes:

  • Updated .dockerignore to allow inclusion of specific data files used by the Rust-based osw_to_parquet tool.
  • Alignment data export now uses a ROW_NUMBER window function to select the best score per feature, improving correctness and efficiency. [1] [2]

These changes collectively improve export/merge workflows for large-scale and containerized analyses, with greater configurability and reliability.

Copilot AI review requested due to automatic review settings June 17, 2026 23:32

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to improve performance and robustness of PyProphet OSW export/merge workflows for large datasets, including new CLI options to control behavior (fresh merges, excluding variance columns) and better handling of DuckDB extension availability in containerized environments.

Changes:

  • Added merge osw --fresh and extended merge implementation with batching and resume/progress tracking.
  • Added export parquet --exclude_feature_var plus related config plumbing, and introduced additional export performance work (SQLite indices, streaming UNION ALL export).
  • Improved DuckDB sqlite_scanner extension handling messaging for offline/container environments.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 13 comments.

Show a summary per file
File Description
pyprophet/util.py Adds --fresh, batching, and resume/progress tracking to OSW merge workflows; adjusts pragmas/logging; removes VACUUM for post-scored merges.
pyprophet/io/util.py Updates DuckDB sqlite_scanner loading logic and messaging for offline/container cases.
pyprophet/io/export/osw.py Adds exclude_feature_var hook, creates SQLite indices to speed joins, streams single-parquet export via UNION ALL, changes peptide mapping to chunked processing, and refines alignment export query.
pyprophet/cli/merge.py Adds --fresh flag to merge osw CLI and forwards it into merge implementation.
pyprophet/cli/export.py Adds --exclude_feature_var option to parquet export CLI and passes through to config.
pyprophet/_config.py Adds exclude_feature_var to ExportIOConfig.
.dockerignore Adjusts ignore rules to include specific Rust tool data files.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pyprophet/io/util.py
Comment on lines +394 to +415
# Check if it's a network/download error (e.g., in containers)
if ("Failed to download extension" in error_msg or
"Connection timed out" in error_msg or
"Network unreachable" in error_msg):
from loguru import logger
logger.warning(
f"Cannot download sqlite_scanner extension (likely in container without internet). "
f"Attempting to load from local cache or using fallback method.\n"
f"To fix: Set DUCKDB_EXTENSION_DIRECTORY environment variable to a directory with pre-downloaded extensions.\n"
f"Details: {error_msg}"
)
# Try to load from local cache
try:
conn.execute("LOAD sqlite_scanner")
except Exception:
# If it still fails, the export will fall back to direct sqlite3 if available
logger.error(
"Could not load sqlite_scanner. Export may use slower fallback method. "
"For better performance, pre-download extensions: "
"python3 -c 'import duckdb; duckdb.connect(\":memory:\").execute(\"LOAD sqlite_scanner\")'"
)

Comment on lines +1524 to +1537
try:
load_sqlite_scanner(conn)
except Exception as scanner_error:
# If sqlite_scanner fails to load (e.g., in containers without internet),
# provide helpful guidance but continue with fallback
if "Failed to download extension" in str(scanner_error) or "Connection timed out" in str(scanner_error):
click.echo(
"Warning: sqlite_scanner extension could not be loaded (likely in container without internet access).\n"
"To fix: Set DUCKDB_EXTENSION_DIRECTORY environment variable to a directory with pre-downloaded extensions.\n"
"Or pre-download extensions on your host with: "
"python3 -c 'import duckdb; duckdb.connect(\":memory:\").execute(\"LOAD sqlite_scanner\")'\n"
"Continuing with alternative method...",
err=True
)
Comment on lines +1557 to +1570
try:
load_sqlite_scanner(conn)
except Exception as scanner_error:
# If sqlite_scanner fails to load (e.g., in containers without internet),
# provide helpful guidance but continue with fallback
if "Failed to download extension" in str(scanner_error) or "Connection timed out" in str(scanner_error):
click.echo(
"Warning: sqlite_scanner extension could not be loaded (likely in container without internet access).\n"
"To fix: Set DUCKDB_EXTENSION_DIRECTORY environment variable to a directory with pre-downloaded extensions.\n"
"Or pre-download extensions on your host with: "
"python3 -c 'import duckdb; duckdb.connect(\":memory:\").execute(\"LOAD sqlite_scanner\")'\n"
"Continuing with alternative method...",
err=True
)
Comment on lines +1901 to +1907
# Merge on codename
merged_chunk = pd.merge(
unimod_chunk,
codename_chunk,
on="codename",
how="outer",
)
Comment thread pyprophet/util.py
Comment on lines +425 to +427
# Only create empty tables if not resuming
# OR if resuming but MERGE_PROGRESS was just created (old partial merge) - need to clear feature tables
need_to_recreate_feature_tables = is_resume and all(v == 0 for v in progress.values())
Comment thread pyprophet/util.py
Comment on lines +1169 to +1170
## Skip VACUUM for now (it's slow) - SQLite will auto-optimize on next use
click.echo("\nInfo: All Post-Scored OSWS files were merged successfully.")
Comment thread pyprophet/cli/merge.py
Comment on lines 58 to +62
"--merged_post_scored_runs",
is_flag=True,
help="Merge OSW output files that have already been scored.",
)
@click.option(
Comment on lines +267 to +270
# Skip if exclude_feature_var is enabled
if self.config.exclude_feature_var:
return ""

Comment thread pyprophet/util.py
Comment on lines +970 to +972
conn = sqlite3.connect(outfile)
c = conn.cursor()

Comment on lines 265 to +270
def _build_score_sql(self, con):
"""Build SQL fragment for score columns in unscored files."""
# Skip if exclude_feature_var is enabled
if self.config.exclude_feature_var:
return ""

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants