Skip to content

feat: improve scoring workflows for large OSW datasets#212

Open
singjc wants to merge 10 commits into
PyProphet:masterfrom
singjc:split/scoring-large-osw
Open

feat: improve scoring workflows for large OSW datasets#212
singjc wants to merge 10 commits into
PyProphet:masterfrom
singjc:split/scoring-large-osw

Conversation

@singjc

@singjc singjc commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

This pull request introduces several new configuration options and command-line arguments to enhance the flexibility and control of the scoring process in pyprophet. The main themes are the addition of experimental features for transition scoring and training, improved report generation controls, and better batch processing options. The changes are reflected throughout the configuration, CLI, and reporting code.

Key changes:

Transition scoring and training enhancements

  • Added new experimental options to control transition scoring features: transition_score_use_mapping_cardinality, transition_score_use_unique_mapping, and transition_score_use_phospho_loss. These allow exposing additional features for transition scoring. [1] [2] [3]
  • Added options to restrict which transitions are used for semi-supervised training: transition_training_require_unique_mapping, transition_training_require_phospho_loss, transition_training_max_isotope_overlap, and transition_training_min_log_sn. [1] [2] [3]
  • All new options are exposed via the CLI and integrated into the configuration and argument parsing logic. [1] [2] [3] [4]

Report generation improvements

  • Introduced a report_mode option (with choices: 'auto', 'full', 'main', 'none') to control the scope of the PDF report. 'auto' selects 'main' for large experiments, and 'full' otherwise. 'none' disables report generation. [1] [2] [3] [4] [5] [6]
  • The CLI and config now handle report_mode, and the report-writing logic respects this setting, skipping report generation if set to 'none'. [1] [2]

Batch processing and filtering

  • Added apply_weights_run_batch_size to control how many runs are processed together when applying weights, with CLI and config support. [1] [2] [3] [4]
  • Added run_id_filter to RunnerIOConfig to allow filtering by run ID, and integrated it into the argument parsing and config serialization. [1] [2]

Miscellaneous

  • Improved string representations (__str__, __repr__) of the config classes to include all new options for easier debugging and logging. [1] [2] [3]
  • Added a save_scorer method stub to the IO base class for future extensibility.

These changes provide more granular control over transition scoring and training, allow for more efficient processing of large experiments, and give users flexibility in report generation and batch processing.

Copilot AI review requested due to automatic review settings June 17, 2026 23:31

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds new scoring/training configuration knobs aimed at scaling pyprophet score to large OSW datasets, including persisted-scorer streaming apply, run-level filtering, and report generation controls.

Changes:

  • Added persisted scorer support for OSW workflows and a streamed apply-weights path to reduce memory usage on large multi-run OSW files.
  • Introduced report_mode (auto|full|main|none) and apply_weights_run_batch_size to control report scope and streamed apply batching.
  • Added experimental transition scoring/training feature flags and run_id_filter support across config, readers, and tests.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/test_pyprophet_score.py Adds an integration test for persisted-scorer streaming apply with report_mode=main.
tests/test_io_scoring.py Adds tests validating run_id_filter behavior for OSW reader subsets.
pyprophet/scoring/semi_supervised.py Adds transition-training target filters and refactors score-alias handling.
pyprophet/scoring/runner.py Implements streamed OSW apply using a persisted scorer; defers reader loading for apply path.
pyprophet/scoring/pyprophet.py Adds compact error-stat lookup for persisted scorers and adjusts pickling payload.
pyprophet/scoring/data_handling.py Introduces get_score_alias_columns, preserves meta_* columns, and updates feature-matrix selection.
pyprophet/report.py Adds report_mode support to skip report generation or omit downstream pages.
pyprophet/io/scoring/tsv.py Respects report_mode when deciding whether to generate a PDF report.
pyprophet/io/scoring/split_parquet.py Adds optional transition metadata/features for mapping/phospho-loss and training filters.
pyprophet/io/scoring/parquet.py Adds optional transition metadata/features for mapping/phospho-loss and training filters.
pyprophet/io/scoring/osw.py Adds run_id_filter support and transition metadata/features; adds incremental score writing and scorer persistence.
pyprophet/io/_base.py Adds a no-op save_scorer hook and ensures writers respect report_mode=none.
pyprophet/cli/score.py Adds CLI options for new transition flags, report_mode, and streamed apply batching; implements auto report selection for large experiments.
pyprophet/_config.py Extends config dataclasses/serialization to include new flags, report_mode, batch size, and run_id_filter.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pyprophet/io/scoring/osw.py Outdated
Comment on lines +44 to +50
self._create_indexes()
if getattr(self.config, "run_id_filter", None) is not None:
logger.info(
"Using SQLite read path for run-scoped OSW access."
)
con = sqlite3.connect(self.infile)
return self._read_using_sqlite(con)
Comment thread pyprophet/scoring/runner.py Outdated
Comment on lines 382 to 386
con = sqlite3.connect(apply_weights)
if self.classifier in ("LDA", "SVM"):
try:
con = sqlite3.connect(apply_weights)

if not check_sqlite_table(con, "PYPROPHET_WEIGHTS"):
raise click.ClickException(
singjc and others added 8 commits June 17, 2026 20:16
…ests

- Updated expected output values in regression tests for multi-split parquet and TSV formats to reflect recent changes in scoring calculations.
- Adjusted the float stabilization logic in the `_stabilize_regtest_float` function to clamp values greater than or equal to 1 to three decimal places, ensuring consistent results across different environments while maintaining higher precision for sub-unit scores.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants