feat: improve scoring workflows for large OSW datasets#212
Open
singjc wants to merge 10 commits into
Open
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR adds new scoring/training configuration knobs aimed at scaling pyprophet score to large OSW datasets, including persisted-scorer streaming apply, run-level filtering, and report generation controls.
Changes:
- Added persisted scorer support for OSW workflows and a streamed apply-weights path to reduce memory usage on large multi-run OSW files.
- Introduced
report_mode(auto|full|main|none) andapply_weights_run_batch_sizeto control report scope and streamed apply batching. - Added experimental transition scoring/training feature flags and
run_id_filtersupport across config, readers, and tests.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_pyprophet_score.py | Adds an integration test for persisted-scorer streaming apply with report_mode=main. |
| tests/test_io_scoring.py | Adds tests validating run_id_filter behavior for OSW reader subsets. |
| pyprophet/scoring/semi_supervised.py | Adds transition-training target filters and refactors score-alias handling. |
| pyprophet/scoring/runner.py | Implements streamed OSW apply using a persisted scorer; defers reader loading for apply path. |
| pyprophet/scoring/pyprophet.py | Adds compact error-stat lookup for persisted scorers and adjusts pickling payload. |
| pyprophet/scoring/data_handling.py | Introduces get_score_alias_columns, preserves meta_* columns, and updates feature-matrix selection. |
| pyprophet/report.py | Adds report_mode support to skip report generation or omit downstream pages. |
| pyprophet/io/scoring/tsv.py | Respects report_mode when deciding whether to generate a PDF report. |
| pyprophet/io/scoring/split_parquet.py | Adds optional transition metadata/features for mapping/phospho-loss and training filters. |
| pyprophet/io/scoring/parquet.py | Adds optional transition metadata/features for mapping/phospho-loss and training filters. |
| pyprophet/io/scoring/osw.py | Adds run_id_filter support and transition metadata/features; adds incremental score writing and scorer persistence. |
| pyprophet/io/_base.py | Adds a no-op save_scorer hook and ensures writers respect report_mode=none. |
| pyprophet/cli/score.py | Adds CLI options for new transition flags, report_mode, and streamed apply batching; implements auto report selection for large experiments. |
| pyprophet/_config.py | Extends config dataclasses/serialization to include new flags, report_mode, batch size, and run_id_filter. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+44
to
+50
| self._create_indexes() | ||
| if getattr(self.config, "run_id_filter", None) is not None: | ||
| logger.info( | ||
| "Using SQLite read path for run-scoped OSW access." | ||
| ) | ||
| con = sqlite3.connect(self.infile) | ||
| return self._read_using_sqlite(con) |
Comment on lines
382
to
386
| con = sqlite3.connect(apply_weights) | ||
| if self.classifier in ("LDA", "SVM"): | ||
| try: | ||
| con = sqlite3.connect(apply_weights) | ||
|
|
||
| if not check_sqlite_table(con, "PYPROPHET_WEIGHTS"): | ||
| raise click.ClickException( |
…xclusion of generated docs
… into split/scoring-large-osw
… into split/scoring-large-osw
…ests - Updated expected output values in regression tests for multi-split parquet and TSV formats to reflect recent changes in scoring calculations. - Adjusted the float stabilization logic in the `_stabilize_regtest_float` function to clamp values greater than or equal to 1 to three decimal places, ensuring consistent results across different environments while maintaining higher precision for sub-unit scores.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request introduces several new configuration options and command-line arguments to enhance the flexibility and control of the scoring process in
pyprophet. The main themes are the addition of experimental features for transition scoring and training, improved report generation controls, and better batch processing options. The changes are reflected throughout the configuration, CLI, and reporting code.Key changes:
Transition scoring and training enhancements
transition_score_use_mapping_cardinality,transition_score_use_unique_mapping, andtransition_score_use_phospho_loss. These allow exposing additional features for transition scoring. [1] [2] [3]transition_training_require_unique_mapping,transition_training_require_phospho_loss,transition_training_max_isotope_overlap, andtransition_training_min_log_sn. [1] [2] [3]Report generation improvements
report_modeoption (with choices:'auto','full','main','none') to control the scope of the PDF report.'auto'selects'main'for large experiments, and'full'otherwise.'none'disables report generation. [1] [2] [3] [4] [5] [6]report_mode, and the report-writing logic respects this setting, skipping report generation if set to'none'. [1] [2]Batch processing and filtering
apply_weights_run_batch_sizeto control how many runs are processed together when applying weights, with CLI and config support. [1] [2] [3] [4]run_id_filtertoRunnerIOConfigto allow filtering by run ID, and integrated it into the argument parsing and config serialization. [1] [2]Miscellaneous
__str__,__repr__) of the config classes to include all new options for easier debugging and logging. [1] [2] [3]save_scorermethod stub to the IO base class for future extensibility.These changes provide more granular control over transition scoring and training, allow for more efficient processing of large experiments, and give users flexibility in report generation and batch processing.