Skip to content

Use susie_ser for sumstat PIP screening#498

Merged
danielnachun merged 8 commits into
StatFunGen:mainfrom
xueweic:main
May 25, 2026
Merged

Use susie_ser for sumstat PIP screening#498
danielnachun merged 8 commits into
StatFunGen:mainfrom
xueweic:main

Conversation

@xueweic

@xueweic xueweic commented May 25, 2026

Copy link
Copy Markdown
Contributor

two minor changes

  • Use susie_ser for sumstat PIP screening
  • Handle empty RSS tabix regions

xueweic and others added 8 commits May 24, 2026 11:18
Final Report

1. Interface Separation

Rationale:
Direct ColocBoost users and xQTL protocol users need different entrypoints. Mixing both workflows in one function made the API confusing and encouraged duplicated QC logic.

Key changes:
- Added/standardized colocboost_analysis() as the direct user-facing wrapper.
  - Accepts the same core inputs as colocboost() through ....
  - Adds optional QC/imputation parameters.
  - If no QC is requested, it behaves as a minimal call-through to colocboost().
  - If QC is requested, it prepares cleaned inputs first, then calls colocboost().
- Added/standardized colocboost_pipeline() as the protocol-oriented wrapper.
  - Starts from region_data produced by load_multitask_regional_data().
  - Designed for xQTL-protocol workflows.
  - Runs QC once, then reuses cleaned inputs across xQTL-only, joint-GWAS, and separate-GWAS analyses.

2. RegionalData Conversion Layer

Rationale:
Loaded regional data should be converted into reusable analysis inputs without reloading files or duplicating conversion logic.

Key changes:
- Added reusable adapters:
  - region_data_to_rss_input() for summary-stat/RSS workflows.
  - region_data_to_ind_input() for individual-level workflows.
  - region_data_to_colocboost_input() for ColocBoost-specific assembly.
- Preserved shared heavy objects such as X, LD, and X_ref.
- Used dictionaries such as dict_YX and dict_sumstatLD to avoid duplicated matrix storage.

3. Shared QC Refactor

Rationale:
QC logic should be reusable across ColocBoost, SuSiE RSS, and individual-level SuSiE workflows instead of being embedded separately in each pipeline.

Key changes:
- Added qc_individual_data() for individual-level QC.
- Expanded summary_stats_qc() to support combined RSS/ColocBoost summary-stat QC while preserving its historical LD-mismatch QC behavior.
- Clarified qc_method behavior:
  - NULL, "none", and "rss_qc" mean basic summary-stat harmonization only.
  - "slalom" and "dentist" add optional LD-mismatch outlier QC.
- Added QC progress messages for stage tracking, retained counts, skips, and warnings.

4. X_ref-Based Imputation

Rationale:
When genotype reference X_ref is available, imputation should use the genotype reference directly rather than deriving LD blocks, because LD block partitioning can alter imputation behavior.

Key changes:
- Final behavior uses whole-region scaled X_ref with RAISS genotype/SVD imputation.
- The temporary scaled matrix is used only inside imputation.
- The original X_ref remains the reference passed into the final colocboost() call.
- The implementation avoids full LD construction for X_ref imputation.
- Added tests showing:
  - raw X_ref SVD can differ from LD-based RAISS;
  - scaled X_ref SVD matches LD scale in a controlled single-matrix case;
  - colocboost_analysis(..., X_ref=..., impute=TRUE) uses scaled genotype/SVD and does not call compute_LD().

Important report observation:
- Old LD-block RAISS retained fewer variants for ENSG00000155304: 166 -> 7 and 167 -> 7.
- Whole-region scaled X_ref SVD retained more: 166 -> 16 and 167 -> 13.
- This difference is expected and should be highlighted as evidence that LD block partitioning changes imputation behavior.

5. Protocol Compatibility and Minimal Loader Fixes

Rationale:
The implementation should support the uploaded chr21 test data and xQTL-protocol use case while keeping loader changes minimal.

Key changes:
- Kept all changes within pecotmr; no changes were made to the core colocboost() function.
- Made only minimal loader-related fixes required by smoke testing:
  - multi-genotype-source loading via match_geno_pheno;
  - header handling for tabix-backed summary-stat files;
  - whitespace-padded covariate numeric parsing.
- Updated protocol-facing naming to use colocboost_pipeline().

6. Documentation and Tests

Rationale:
The new interfaces, QC behavior, and imputation decision need clear user-facing documentation and regression coverage so future changes do not reintroduce ambiguity.

Key changes:
- Updated documentation/man pages and vignette references for the new ColocBoost interfaces.
- Added chr21 smoke-test deliverable notes emphasizing the X_ref imputation decision.
- Added focused regression tests for:
  - colocboost_analysis() direct-input behavior;
  - colocboost_pipeline() protocol-level behavior;
  - reusable summary-stat QC behavior;
  - RAISS raw/scaled genotype-reference behavior;
  - X_ref-backed imputation avoiding compute_LD().
- Focused tests passed:
  - test_raiss.R
  - test_sumstats_qc.R
  - test_colocboost_pipeline.R
- Chr21 smoke tests completed on two genes across six scenarios, covering no-AD xQTL-only, basic QC, imputation, PIP screening, SLALOM QC, and combined modes.
Remove the legacy rss_qc alias so summary_stats_qc only accepts none, slalom, and dentist.
# Conflicts:
#	NAMESPACE
#	R/sumstats_qc.R
#	R/univariate_pipeline.R
#	man/summary_stats_qc.Rd
#	tests/testthat/test_sumstats_qc.R
@danielnachun danielnachun merged commit fb8eef2 into StatFunGen:main May 25, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants