From 4667533595b90eccb9d2a82d82e7c625467cabf9 Mon Sep 17 00:00:00 2001 From: xuewei cao <36172337+xueweic@users.noreply.github.com> Date: Sun, 24 May 2026 11:18:15 -0500 Subject: [PATCH 1/5] [Important update] Design Shared pecotmr QC and ColocBoost Analysis Work Final Report 1. Interface Separation Rationale: Direct ColocBoost users and xQTL protocol users need different entrypoints. Mixing both workflows in one function made the API confusing and encouraged duplicated QC logic. Key changes: - Added/standardized colocboost_analysis() as the direct user-facing wrapper. - Accepts the same core inputs as colocboost() through .... - Adds optional QC/imputation parameters. - If no QC is requested, it behaves as a minimal call-through to colocboost(). - If QC is requested, it prepares cleaned inputs first, then calls colocboost(). - Added/standardized colocboost_pipeline() as the protocol-oriented wrapper. - Starts from region_data produced by load_multitask_regional_data(). - Designed for xQTL-protocol workflows. - Runs QC once, then reuses cleaned inputs across xQTL-only, joint-GWAS, and separate-GWAS analyses. 2. RegionalData Conversion Layer Rationale: Loaded regional data should be converted into reusable analysis inputs without reloading files or duplicating conversion logic. Key changes: - Added reusable adapters: - region_data_to_rss_input() for summary-stat/RSS workflows. - region_data_to_ind_input() for individual-level workflows. - region_data_to_colocboost_input() for ColocBoost-specific assembly. - Preserved shared heavy objects such as X, LD, and X_ref. - Used dictionaries such as dict_YX and dict_sumstatLD to avoid duplicated matrix storage. 3. Shared QC Refactor Rationale: QC logic should be reusable across ColocBoost, SuSiE RSS, and individual-level SuSiE workflows instead of being embedded separately in each pipeline. Key changes: - Added qc_individual_data() for individual-level QC. - Expanded summary_stats_qc() to support combined RSS/ColocBoost summary-stat QC while preserving its historical LD-mismatch QC behavior. - Clarified qc_method behavior: - NULL, "none", and "rss_qc" mean basic summary-stat harmonization only. - "slalom" and "dentist" add optional LD-mismatch outlier QC. - Added QC progress messages for stage tracking, retained counts, skips, and warnings. 4. X_ref-Based Imputation Rationale: When genotype reference X_ref is available, imputation should use the genotype reference directly rather than deriving LD blocks, because LD block partitioning can alter imputation behavior. Key changes: - Final behavior uses whole-region scaled X_ref with RAISS genotype/SVD imputation. - The temporary scaled matrix is used only inside imputation. - The original X_ref remains the reference passed into the final colocboost() call. - The implementation avoids full LD construction for X_ref imputation. - Added tests showing: - raw X_ref SVD can differ from LD-based RAISS; - scaled X_ref SVD matches LD scale in a controlled single-matrix case; - colocboost_analysis(..., X_ref=..., impute=TRUE) uses scaled genotype/SVD and does not call compute_LD(). Important report observation: - Old LD-block RAISS retained fewer variants for ENSG00000155304: 166 -> 7 and 167 -> 7. - Whole-region scaled X_ref SVD retained more: 166 -> 16 and 167 -> 13. - This difference is expected and should be highlighted as evidence that LD block partitioning changes imputation behavior. 5. Protocol Compatibility and Minimal Loader Fixes Rationale: The implementation should support the uploaded chr21 test data and xQTL-protocol use case while keeping loader changes minimal. Key changes: - Kept all changes within pecotmr; no changes were made to the core colocboost() function. - Made only minimal loader-related fixes required by smoke testing: - multi-genotype-source loading via match_geno_pheno; - header handling for tabix-backed summary-stat files; - whitespace-padded covariate numeric parsing. - Updated protocol-facing naming to use colocboost_pipeline(). 6. Documentation and Tests Rationale: The new interfaces, QC behavior, and imputation decision need clear user-facing documentation and regression coverage so future changes do not reintroduce ambiguity. Key changes: - Updated documentation/man pages and vignette references for the new ColocBoost interfaces. - Added chr21 smoke-test deliverable notes emphasizing the X_ref imputation decision. - Added focused regression tests for: - colocboost_analysis() direct-input behavior; - colocboost_pipeline() protocol-level behavior; - reusable summary-stat QC behavior; - RAISS raw/scaled genotype-reference behavior; - X_ref-backed imputation avoiding compute_LD(). - Focused tests passed: - test_raiss.R - test_sumstats_qc.R - test_colocboost_pipeline.R - Chr21 smoke tests completed on two genes across six scenarios, covering no-AD xQTL-only, basic QC, imputation, PIP screening, SLALOM QC, and combined modes. --- NAMESPACE | 7 +- R/colocboost_pipeline.R | 1560 ++++++++++++----- R/file_utils.R | 271 ++- R/sumstats_qc.R | 437 ++++- R/univariate_pipeline.R | 112 +- man/colocboost_analysis.Rd | 114 ++ ...sis_pipeline.Rd => colocboost_pipeline.Rd} | 32 +- man/qc_individual_data.Rd | 40 + man/region_data_to_colocboost_input.Rd | 18 + man/region_data_to_ind_input.Rd | 18 + man/region_data_to_rss_input.Rd | 17 + man/rss_analysis_pipeline.Rd | 9 +- man/rss_basic_qc.Rd | 13 +- man/summary_stats_qc.Rd | 84 +- tests/testthat/test_colocboost_pipeline.R | 1341 ++++++++++++-- tests/testthat/test_file_utils.R | 161 +- tests/testthat/test_raiss.R | 64 +- tests/testthat/test_sumstats_qc.R | 117 ++ tests/testthat/test_univariate_pipeline.R | 22 +- vignettes/colocboost-pipeline.Rmd | 14 +- 20 files changed, 3760 insertions(+), 691 deletions(-) create mode 100644 man/colocboost_analysis.Rd rename man/{colocboost_analysis_pipeline.Rd => colocboost_pipeline.Rd} (60%) create mode 100644 man/qc_individual_data.Rd create mode 100644 man/region_data_to_colocboost_input.Rd create mode 100644 man/region_data_to_ind_input.Rd create mode 100644 man/region_data_to_rss_input.Rd diff --git a/NAMESPACE b/NAMESPACE index 369bf50a..513ba822 100644 --- a/NAMESPACE +++ b/NAMESPACE @@ -22,7 +22,8 @@ export(check_ld) export(clean_context_names) export(coloc_post_processor) export(coloc_wrapper) -export(colocboost_analysis_pipeline) +export(colocboost_analysis) +export(colocboost_pipeline) export(compute_LD) export(compute_qtl_enrichment) export(compute_sldsc_M_ref) @@ -107,9 +108,13 @@ export(parse_variant_id) export(postprocess_finemapping_fits) export(prs_cs) export(prs_cs_weights) +export(qc_individual_data) export(raiss) export(read_afreq) export(read_sldsc_trait) +export(region_data_to_colocboost_input) +export(region_data_to_ind_input) +export(region_data_to_rss_input) export(region_to_df) export(rss_analysis_pipeline) export(rss_basic_qc) diff --git a/R/colocboost_pipeline.R b/R/colocboost_pipeline.R index 9d921a7d..61c4874d 100644 --- a/R/colocboost_pipeline.R +++ b/R/colocboost_pipeline.R @@ -19,7 +19,7 @@ build_ld_args <- function(ld_list, subset = NULL) { .run_colocboost <- function(label, ...) { t1 <- Sys.time() res <- tryCatch( - colocboost(...), + colocboost_analysis(...), error = function(e) { message(label, " failed: ", conditionMessage(e)) NULL @@ -28,9 +28,313 @@ build_ld_args <- function(ld_list, subset = NULL) { list(result = res, time = Sys.time() - t1) } -#' Multi-trait colocalization analysis pipeline +.cb_call_colocboost <- function(args, dots) { + if (!requireNamespace("colocboost", quietly = TRUE)) { + stop("The colocboost package is required for colocboost_analysis().") + } + do.call(colocboost::colocboost, c(args, dots)) +} + +#' Convert loaded regional data to ColocBoost inputs +#' +#' @param region_data A list returned by \code{load_multitask_regional_data()}. +#' @return A structured list containing \code{colocboost_input}, +#' \code{qc_input}, and \code{source_info}. +#' @export +region_data_to_colocboost_input <- function(region_data) { + ind_records_from_input <- function(input) { + X <- .cb_as_named_list(input$X, "individual") + Y <- .cb_as_named_list(input$Y, "individual") + contexts <- intersect(names(X), names(Y)) + records <- list() + for (context in contexts) { + if (is.null(X[[context]]) || .cb_y_ncol(Y[[context]]) == 0) next + records[[context]] <- list( + X = X[[context]], + Y = Y[[context]], + maf = .cb_list_value(input$maf, context), + X_variance = .cb_list_value(input$X_variance, context) + ) + } + records + } + + ind_input <- region_data_to_ind_input(region_data) + rss_input <- region_data_to_rss_input(region_data) + + ind_records <- ind_records_from_input(ind_input) + ind_args <- .cb_format_individual(ind_records) + + sumstat_records <- lapply(names(rss_input$rss_input), function(study) { + list(rss_input = rss_input$rss_input[[study]], + LD_matrix = rss_input$LD_data[[study]]$LD_matrix) + }) + names(sumstat_records) <- names(rss_input$rss_input) + sumstat_args <- .cb_format_sumstat(sumstat_records) + + outcome_names <- c(ind_args$outcome_names, names(sumstat_args$sumstat)) + ind_args$outcome_names <- NULL + colocboost_input <- .cb_merge_args(ind_args, sumstat_args) + if (length(outcome_names) > 0) colocboost_input$outcome_names <- outcome_names + + list( + colocboost_input = Filter(Negate(is.null), colocboost_input), + qc_input = list( + individual = ind_input[c("X", "Y", "maf", "X_variance")], + sumstat = rss_input[c("rss_input", "LD_data")] + ), + source_info = list(individual = ind_input$source_info, + sumstat = rss_input$source_info) + ) +} + +#' ColocBoost analysis with optional pipeline QC +#' +#' This wrapper keeps the direct \code{colocboost()} argument surface. All +#' ColocBoost inputs and model parameters are supplied through \code{...}. When +#' no QC options are requested, the call is passed directly to +#' \code{colocboost::colocboost()}. When QC options are requested, the wrapper +#' inspects named \code{X}/\code{Y} and/or \code{sumstat}/\code{LD}/\code{X_ref} +#' arguments in \code{...}, runs the relevant reusable QC step, and then calls +#' ColocBoost on the cleaned inputs. If the required named inputs are not +#' available, QC is skipped with a warning and the original ColocBoost call is +#' used. +#' +#' @details +#' Use \code{colocboost_analysis()} the same way you would use +#' \code{colocboost::colocboost()}: pass the native ColocBoost arguments by +#' name, for example \code{X}, \code{Y}, \code{sumstat}, \code{LD}, +#' \code{X_ref}, \code{dict_YX}, \code{dict_sumstatLD}, +#' \code{outcome_names}, \code{focal_outcome_idx}, \code{effect_est}, +#' \code{effect_se}, \code{effect_n}, \code{M}, and other ColocBoost model or +#' post-processing options. These arguments are forwarded unchanged unless one +#' or more QC controls are requested. +#' +#' Individual-level QC is only attempted when at least one individual QC control +#' is non-\code{NULL} and named \code{X} and \code{Y} inputs are available in +#' \code{...}. Summary-statistic QC is only attempted when \code{qc_method}, +#' \code{pip_cutoff_to_skip_sumstat}, \code{impute = TRUE}, or +#' \code{LD_reference_info} is supplied and named \code{sumstat} plus either +#' \code{LD}, \code{X_ref}, or \code{LD_reference_info} are available. +#' \code{qc_method = "none"} and \code{qc_method = "rss_qc"} mean run basic +#' allele/variant harmonization only; they do not run SLALOM/DENTIST +#' LD-mismatch QC. RAISS imputation is controlled separately by +#' \code{impute = TRUE}. +#' +#' If no QC controls are supplied, this function is a thin direct call to +#' \code{colocboost::colocboost(...)}. +#' When QC removes outcomes, \code{outcome_names} and \code{focal_outcome_idx} +#' are updated to match the post-QC outcome order. If the requested focal outcome +#' is removed by QC, \code{focal_outcome_idx} is set to \code{NULL} with a +#' warning. #' -#' This function perform a multi-trait colocalization using ColocBoost +#' @param ... Arguments passed to \code{colocboost::colocboost()}, including +#' data inputs such as \code{X}, \code{Y}, \code{sumstat}, \code{LD}, +#' \code{X_ref}, \code{dict_YX}, \code{dict_sumstatLD}, +#' \code{outcome_names}, and all ColocBoost model/post-processing options. +#' QC can only inspect inputs that are supplied by name. +#' @param missing_rate_thresh,maf_cutoff,xvar_cutoff,ld_reference_meta_file,pip_cutoff_to_skip_ind +#' Individual-level QC controls. If all are \code{NULL}, individual-level QC +#' is not run. +#' @param keep_indel,pip_cutoff_to_skip_sumstat,qc_method,impute,impute_opts +#' Summary-statistic QC controls. \code{qc_method = "none"} and +#' \code{qc_method = "rss_qc"} run basic allele harmonization without +#' LD-mismatch outlier detection. Imputation is only run when +#' \code{impute = TRUE}. +#' @param LD_reference_info Optional LD reference information for +#' summary-statistic QC. This is only needed when the native \code{LD} matrix +#' row/column names or \code{X_ref} column names are missing or are not +#' parseable genomic variant IDs. It can be a .bim/.pvar/.pvar.zst file path, +#' a data.frame with variant metadata, or a \code{load_LD_matrix()} result. +#' This is a QC-only argument and is not passed to +#' \code{colocboost::colocboost()}. +#' @param variant_convention Allele order used by native ColocBoost-style +#' \code{sumstat$variant} and LD/X_ref names when deriving QC inputs: +#' \code{"A2_A1"} for pecotmr canonical \code{chr:pos:A2:A1}, or +#' \code{"A1_A2"} for \code{chr:pos:A1:A2}. +#' @return The object returned by \code{colocboost::colocboost()}. +#' @examples +#' \dontrun{ +#' # Direct ColocBoost call without QC. +#' fit <- colocboost_analysis(X = X, Y = Y, M = 500) +#' +#' # Summary-statistic input with basic allele/variant harmonization only. +#' fit <- colocboost_analysis(sumstat = sumstat, LD = LD, +#' qc_method = "none", M = 500) +#' +#' # Summary-statistic input with LD-mismatch QC and RAISS imputation. +#' fit <- colocboost_analysis(sumstat = sumstat, LD = LD, +#' qc_method = "slalom", impute = TRUE) +#' +#' # Use richer LD metadata from load_LD_matrix() for QC, while still passing +#' # ColocBoost's native LD input. +#' ld_data <- load_LD_matrix(ld_meta_file, region) +#' fit <- colocboost_analysis(sumstat = sumstat, LD = ld_data$LD_matrix, +#' LD_reference_info = ld_data, qc_method = "none") +#' +#' # Individual-level input with explicit genotype QC thresholds. +#' fit <- colocboost_analysis(X = X, Y = Y, +#' missing_rate_thresh = 0.1, +#' maf_cutoff = 0.0005) +#' } +#' @export +colocboost_analysis <- function(..., + # individual QC + missing_rate_thresh = NULL, + maf_cutoff = NULL, + xvar_cutoff = NULL, + ld_reference_meta_file = NULL, + pip_cutoff_to_skip_ind = NULL, + # sumstat QC + keep_indel = TRUE, + pip_cutoff_to_skip_sumstat = NULL, + qc_method = NULL, + impute = FALSE, + impute_opts = list(rcond = 0.01, R2_threshold = 0.6, + minimum_ld = 5, lamb = 0.01), + LD_reference_info = NULL, + variant_convention = c("A2_A1", "A1_A2")) { + variant_convention <- match.arg(variant_convention) + direct_args <- list(...) + pre_qc_data_outcomes <- .cb_colocboost_outcome_names(direct_args, prefer_supplied = FALSE) + pre_qc_display_outcomes <- .cb_colocboost_outcome_names(direct_args, prefer_supplied = TRUE) + if (!is.null(qc_method)) qc_method <- .resolve_summary_qc_method(qc_method) + + individual_qc_requested <- !is.null(missing_rate_thresh) || + !is.null(maf_cutoff) || !is.null(xvar_cutoff) || + !is.null(ld_reference_meta_file) || !is.null(pip_cutoff_to_skip_ind) + sumstat_qc_requested <- !is.null(qc_method) || isTRUE(impute) || + !is.null(pip_cutoff_to_skip_sumstat) || !is.null(LD_reference_info) + qc_requested <- individual_qc_requested || sumstat_qc_requested + if (!qc_requested) { + return(.cb_call_colocboost(direct_args, list())) + } + + X <- direct_args$X + Y <- direct_args$Y + sumstat <- direct_args$sumstat + LD <- direct_args$LD + X_ref <- direct_args$X_ref + dict_YX <- direct_args$dict_YX + dict_sumstatLD <- direct_args$dict_sumstatLD + + qc_skip_messages <- character() + individual_qc_input <- NULL + if (individual_qc_requested) { + if (!is.null(X) && !is.null(Y)) { + individual_qc_input <- .cb_individual_qc_input_from_colocboost(X, Y, dict_YX) + } else { + qc_skip_messages <- c(qc_skip_messages, + "Individual-level QC requested but named X and Y were not both supplied.") + } + } + + sumstat_qc_input <- NULL + if (sumstat_qc_requested) { + if (!is.null(sumstat) && (!is.null(LD) || !is.null(X_ref) || !is.null(LD_reference_info))) { + sumstat_qc_input <- tryCatch( + .cb_sumstat_qc_input_from_colocboost( + sumstat, LD, X_ref, dict_sumstatLD, + LD_reference_info = LD_reference_info, + variant_convention = variant_convention + ), + error = function(e) { + qc_skip_messages <<- c( + qc_skip_messages, + paste("Summary-statistic QC input could not be prepared:", conditionMessage(e)) + ) + NULL + } + ) + if (!is.null(sumstat_qc_input)) { + if (length(sumstat_qc_input$skip_reasons) > 0) { + qc_skip_messages <- c(qc_skip_messages, sumstat_qc_input$skip_reasons) + } + if (length(sumstat_qc_input$rss_input) == 0) { + sumstat_qc_input <- NULL + } + } + } else { + qc_skip_messages <- c(qc_skip_messages, + "Summary-statistic QC requested but named sumstat plus LD, X_ref, or LD_reference_info were not supplied.") + } + } + if (is.null(individual_qc_input) && is.null(sumstat_qc_input)) { + warning("QC requested but required QC inputs are unavailable. Calling colocboost() directly. ", + paste(qc_skip_messages, collapse = " ")) + return(.cb_call_colocboost(direct_args, list())) + } + if (length(qc_skip_messages) > 0) { + warning(paste(qc_skip_messages, collapse = " "), " Skipping unavailable QC branch.") + } + + qc_args <- tryCatch({ + args <- list() + if (!is.null(individual_qc_input)) { + message("QC track: processing individual-level inputs before ColocBoost.") + ind <- qc_individual_data( + X = individual_qc_input$X, + Y = individual_qc_input$Y, + maf = individual_qc_input$maf, + X_variance = individual_qc_input$X_variance, + missing_rate_thresh = missing_rate_thresh, + maf_cutoff = maf_cutoff, + xvar_cutoff = .cb_default(xvar_cutoff, 0), + ld_reference_meta_file = ld_reference_meta_file, + keep_indel = keep_indel, + pip_cutoff_to_skip = .cb_default(pip_cutoff_to_skip_ind, 0) + ) + args <- .cb_merge_args(args, .cb_format_individual(ind)) + } + if (!is.null(sumstat_qc_input) && length(sumstat_qc_input$rss_input) > 0) { + message("QC track: processing summary-statistic inputs before ColocBoost.") + sumstat_qc <- summary_stats_qc( + rss_input = sumstat_qc_input$rss_input, + LD_data = sumstat_qc_input$LD_data, + keep_indel = keep_indel, + pip_cutoff_to_skip = .cb_default(pip_cutoff_to_skip_sumstat, 0), + qc_method = if (is.null(qc_method)) "none" else qc_method, + impute = impute, + impute_opts = impute_opts + ) + args <- .cb_merge_args(args, .cb_format_sumstat(sumstat_qc)) + } + args + }, error = function(e) { + warning("QC requested but skipped: ", conditionMessage(e), + ". Calling colocboost() directly.") + NULL + }) + + if (is.null(qc_args) || length(qc_args) == 0) { + return(.cb_call_colocboost(direct_args, list())) + } + merged_args <- .cb_merge_args(direct_args, qc_args) + if (!is.null(qc_args$LD)) merged_args$X_ref <- NULL + if (!is.null(qc_args$X_ref)) merged_args$LD <- NULL + post_qc_data_outcomes <- .cb_colocboost_outcome_names(merged_args, prefer_supplied = FALSE) + if (length(post_qc_data_outcomes) > 0) { + merged_args$outcome_names <- .cb_resolve_qc_outcome_names( + pre_qc_data_outcomes, + pre_qc_display_outcomes, + post_qc_data_outcomes + ) + merged_args$focal_outcome_idx <- .cb_remap_focal_outcome_idx( + focal_outcome_idx = direct_args$focal_outcome_idx, + pre_qc_data_outcomes = pre_qc_data_outcomes, + pre_qc_display_outcomes = pre_qc_display_outcomes, + post_qc_data_outcomes = post_qc_data_outcomes, + post_qc_display_outcomes = merged_args$outcome_names + ) + } + .cb_call_colocboost(Filter(Negate(is.null), merged_args), list()) +} + +#' Multi-trait colocalization analysis protocol pipeline +#' +#' This function performs protocol-level multi-trait colocalization using +#' ColocBoost. It accepts loaded regional data, performs QC once, then runs the +#' requested xQTL-only, joint GWAS, and separate GWAS analyses. #' #' @param region_data A region data loaded from \code{load_regional_data}. #' @param focal_trait Name of trait if perform focaled ColocBoost @@ -38,7 +342,9 @@ build_ld_args <- function(ld_list, subset = NULL) { #' @param maf_cutoff A scalar to remove variants with maf < maf_cutoff, dafault is 0.005. #' @param pip_cutoff_to_skip_ind A vector of cutoff values for skipping analysis based on PIP values for each context. Default is 0. #' @param pip_cutoff_to_skip_sumstat A vector of cutoff values for skipping analysis based on PIP values for each sumstat Default is 0. -#' @param qc_method Quality control method to use. Options are "slalom" or "dentist" (default: "slalom"). +#' @param qc_method Quality control method to use. Options are "none", +#' "rss_qc", "slalom", or "dentist". \code{NULL} and \code{"rss_qc"} +#' are treated as \code{"none"} for basic-only summary-stat preprocessing. #' @param impute Logical; if TRUE, performs imputation for outliers identified in the analysis (default: TRUE). #' @param impute_opts A list of imputation options including rcond, R2_threshold, and minimum_ld (default: list(rcond = 0.01, R2_threshold = 0.6, minimum_ld = 5)). #' @@ -57,23 +363,28 @@ build_ld_args <- function(ld_list, subset = NULL) { #' #' @importFrom susieR susie_rss #' @export -colocboost_analysis_pipeline <- function(region_data, - focal_trait = NULL, - event_filters = NULL, - # - analysis - xqtl_coloc = TRUE, - joint_gwas = FALSE, - separate_gwas = FALSE, - # - individual QC - maf_cutoff = 0.0005, - pip_cutoff_to_skip_ind = 0, - # - sumstat QC - keep_indel = TRUE, - pip_cutoff_to_skip_sumstat = 0, - qc_method = c("slalom", "dentist", "none"), - impute = TRUE, - impute_opts = list(rcond = 0.01, R2_threshold = 0.6, minimum_ld = 5, lamb = 0.01), - ...) { +colocboost_pipeline <- function( + region_data, + focal_trait = NULL, + event_filters = NULL, + # - analysis + xqtl_coloc = TRUE, + joint_gwas = FALSE, + separate_gwas = FALSE, + # - individual QC + maf_cutoff = 0.0005, + pip_cutoff_to_skip_ind = 0, + # - sumstat QC + keep_indel = TRUE, + pip_cutoff_to_skip_sumstat = 0, + qc_method = NULL, + impute = TRUE, + impute_opts = list( + rcond = 0.01, R2_threshold = 0.6, + minimum_ld = 5, lamb = 0.01 + ), + ... +) { # - internal function by filtering events based on event_filters filter_events <- function(events, filters, condition) { # filters is a list of filter specifications @@ -173,14 +484,23 @@ colocboost_analysis_pipeline <- function(region_data, } } } else { - message(if (is.null(phenotypes_init)) "No sumstat data in this region!" else "No sumstat data pass QC.") + if (is.null(phenotypes_init)) { + message("No sumstat data in this region!") + } else if (!is.null(phenotypes_init$sumstat_studies)) { + message(paste( + "Skipping follow-up analysis for sumstat studies", + paste(phenotypes_init$sumstat_studies, collapse = ";"), "after QC." + )) + } else { + message("No sumstat data pass QC.") + } } return(phenotypes) } ####### ========= resolve defaults ======== ####### - qc_method <- match.arg(qc_method) + qc_method <- .resolve_summary_qc_method(qc_method) ####### ========= initial output results before QC ======== ####### analysis_results <- list("xqtl_coloc" = NULL, "joint_gwas" = NULL, "separate_gwas" = NULL) @@ -234,78 +554,22 @@ colocboost_analysis_pipeline <- function(region_data, impute_opts = impute_opts ) phenotypes_QC <- extract_contexts_studies(region_data, phenotypes_init = phenotypes_init) + if (!is.null(phenotypes_init$sumstat_studies) && + is.null(phenotypes_QC$sumstat_studies)) { + message("No valid summary statistic studies remain after validation.") + } t02 <- Sys.time() analysis_results$computing_time$QC <- t02 - t01 - ####### ========= organize individual level data ======== ######## - individual_data <- region_data$individual_data - if (!is.null(individual_data)) { - X <- individual_data$X - Y <- individual_data$Y - null_Y <- which(sapply(Y, is.null)) - if (length(null_Y) != 0 & length(null_Y) != length(Y)) { - X <- X[-null_Y] - Y <- Y[-null_Y] - } else if (length(null_Y) == length(Y)) { - X <- NULL - Y <- NULL - } - if (!is.null(Y)) { - Y_split <- purrr::imap(Y, function(y, i) { - purrr::map(seq_len(ncol(y)), function(j) setNames(y[, j, drop = FALSE], colnames(y)[j])) - }) - dict_YX <- cbind(seq_along(unlist(Y_split, recursive = FALSE)), rep(seq_along(Y_split), purrr::map_int(Y_split, length))) - Y <- unlist(Y_split, recursive = FALSE) - names(Y) <- sapply(Y, colnames) - } - } else { - X <- Y <- dict_YX <- NULL - } - - ####### ========= organize summary statistics ======== ######## - sumstat_data <- region_data$sumstat_data - if (!is.null(sumstat_data$sumstats)) { - sumstats <- lapply(sumstat_data$sumstats, function(ss) { - z <- ss$sumstats$z - # Normalize variant IDs to canonical format (with chr prefix) - variant <- normalize_variant_id(as.character(ss$sumstats$variant_id)) - n <- ss$n - - # Filter out NA values from z-scores and corresponding variants - na_mask <- !is.na(z) - if (sum(na_mask) == 0) { - message("Warning: All z-scores are NA for this summary statistic dataset") - return(data.frame("z" = numeric(0), "n" = numeric(0), "variant" = character(0))) - } - - data.frame("z" = z[na_mask], "n" = n, "variant" = variant[na_mask]) - }) - names(sumstats) <- names(sumstat_data$sumstats) - LD_mat <- lapply(sumstat_data$LD_mat, function(ld) { - # Normalize LD dimnames to canonical format (with chr prefix) - if (!is.null(colnames(ld))) { - colnames(ld) <- normalize_variant_id(as.character(colnames(ld))) - } - if (!is.null(rownames(ld))) { - rownames(ld) <- normalize_variant_id(as.character(rownames(ld))) - } - return(ld) - }) - LD_match <- sumstat_data$LD_match - dict_sumstatLD <- cbind(seq_along(sumstats), match(LD_match, names(sumstat_data$LD_mat))) - # Validate and filter sumstat entries before analysis - filtered <- filter_valid_sumstats(sumstats, LD_mat, LD_match) - if (is.null(filtered)) { - sumstats <- LD_mat <- dict_sumstatLD <- NULL - } else { - sumstats <- filtered$sumstats - LD_mat <- filtered$LD_mat - LD_match <- filtered$LD_match - dict_sumstatLD <- filtered$dict_sumstatLD - } - } else { - sumstats <- LD_mat <- dict_sumstatLD <- NULL - } + ####### ========= convert QC'd regional data to ColocBoost input ======== ######## + colocboost_input <- region_data_to_colocboost_input(region_data)$colocboost_input + X <- colocboost_input$X + Y <- colocboost_input$Y + dict_YX <- colocboost_input$dict_YX + sumstats <- colocboost_input$sumstat + dict_sumstatLD <- colocboost_input$dict_sumstatLD + LD_mat <- colocboost_input$LD + if (is.null(LD_mat)) LD_mat <- colocboost_input$X_ref ####### ========= streamline three types of analyses ======== ######## @@ -323,7 +587,7 @@ colocboost_analysis_pipeline <- function(region_data, outcome_names = traits, focal_outcome_idx = focal_outcome_idx, output_level = 2, ... ) - analysis_results$xqtl_coloc <- cb_res$result + analysis_results["xqtl_coloc"] <- list(cb_res$result) analysis_results$computing_time$Analysis$xqtl_coloc <- cb_res$time } # - run joint GWAS no focaled version of ColocBoost @@ -336,7 +600,7 @@ colocboost_analysis_pipeline <- function(region_data, dict_YX = dict_YX, dict_sumstatLD = dict_sumstatLD, outcome_names = traits, focal_outcome_idx = NULL, output_level = 2), ld_args, list(...))) - analysis_results$joint_gwas <- cb_res$result + analysis_results["joint_gwas"] <- list(cb_res$result) analysis_results$computing_time$Analysis$joint_gwas <- cb_res$time } # - run focaled version of ColocBoost for each GWAS @@ -344,9 +608,9 @@ colocboost_analysis_pipeline <- function(region_data, t31 <- Sys.time() res_gwas_separate <- analysis_results$separate_gwas for (i_gwas in 1:nrow(dict_sumstatLD)) { - current_study <- names(sumstats)[i_gwas] - message(paste("====== Performing focaled version GWAS-xQTL ColocBoost on", length(Y), "contexts and ", current_study, "GWAS. =====")) dict <- dict_sumstatLD[i_gwas, ] + current_study <- names(sumstats)[dict[1]] + message(paste("====== Performing focaled version GWAS-xQTL ColocBoost on", length(Y), "contexts and ", current_study, "GWAS. =====")) traits <- c(names(Y), current_study) ld_args_sep <- build_ld_args(LD_mat, subset = dict[2]) cb_res <- do.call(.run_colocboost, c( @@ -355,7 +619,7 @@ colocboost_analysis_pipeline <- function(region_data, dict_YX = dict_YX, outcome_names = traits, focal_outcome_idx = length(traits), output_level = 2), ld_args_sep, list(...))) - res_gwas_separate[[current_study]] <- cb_res$result + res_gwas_separate[current_study] <- list(cb_res$result) } t32 <- Sys.time() analysis_results$separate_gwas <- res_gwas_separate @@ -365,86 +629,22 @@ colocboost_analysis_pipeline <- function(region_data, return(analysis_results) } - -#' Validate a single summary statistics entry -#' -#' Checks whether a sumstat data frame (with columns z, n, variant) has -#' sufficient data for colocboost analysis. -#' -#' @param ss_df A data.frame with columns "z", "n", and "variant" as produced -#' by the sumstat processing block in \code{colocboost_analysis_pipeline}. -#' @param min_variants Minimum number of non-NA z-score variants required. -#' Default is 2. -#' @return TRUE if the entry is valid; FALSE otherwise. -#' @noRd -is_valid_sumstat_entry <- function(ss_df, min_variants = 2) { - if (is.null(ss_df) || !is.data.frame(ss_df)) return(FALSE) - if (nrow(ss_df) < min_variants) return(FALSE) - if (all(is.na(ss_df$z))) return(FALSE) - if (all(ss_df$n <= 0 | is.na(ss_df$n))) return(FALSE) - return(TRUE) -} - - -#' Filter summary statistics to retain only valid entries +#' Initial QC for the region data loaded from \code{load_regional_data} #' -#' Applies \code{is_valid_sumstat_entry} to each element of the sumstats list -#' and removes invalid entries. Also updates the LD match mapping and -#' dict_sumstatLD accordingly. +#' This compatibility wrapper converts loaded regional data to reusable individual +#' and RSS inputs, runs the shared QC helpers once, and returns the historical +#' post-QC structure consumed by \code{colocboost_pipeline()}. #' -#' @param sumstats Named list of summary statistic data frames. -#' @param LD_mat List of LD matrices. -#' @param LD_match Character vector mapping each sumstat to an LD matrix name. -#' @param min_variants Minimum variant count passed to -#' \code{is_valid_sumstat_entry}. -#' @return A list with filtered \code{sumstats}, \code{LD_mat}, \code{LD_match}, -#' and \code{dict_sumstatLD}, or NULL if no valid entries remain. -#' @noRd -filter_valid_sumstats <- function(sumstats, LD_mat, LD_match, min_variants = 2) { - valid_idx <- vapply(sumstats, is_valid_sumstat_entry, logical(1), - min_variants = min_variants) - if (!any(valid_idx)) { - message("No valid summary statistic studies remain after validation.") - return(NULL) - } - removed <- names(sumstats)[!valid_idx] - if (length(removed) > 0) { - message("Removed invalid sumstat studies: ", paste(removed, collapse = ", ")) - } - sumstats <- sumstats[valid_idx] - LD_match <- LD_match[valid_idx] - dict_sumstatLD <- cbind(seq_along(sumstats), match(LD_match, names(LD_mat))) - list(sumstats = sumstats, LD_mat = LD_mat, LD_match = LD_match, - dict_sumstatLD = dict_sumstatLD) -} - - -#' Initial QC for the region data loading from \code{load_regional_data} -#' -#' This function do the initial QC including: check PIP; check maf for individual_data; check QC and impute for sumstat_data -#' -#' @section Loading individual level data from multiple corhorts #' @param region_data A region data loaded from \code{load_regional_data}. -#' @param maf_cutoff A scalar to remove variants with maf < maf_cutoff, dafault is 0.005. -#' @param pip_cutoff_to_skip_ind A vector of cutoff values for skipping analysis based on PIP values for each context. Default is 0. -#' @param pip_cutoff_to_skip_sumstat A vector of cutoff values for skipping analysis based on PIP values for each sumstat Default is 0. -#' @param qc_method Quality control method to use. Options are "slalom" or "dentist" (default: "slalom"). -#' @param impute Logical; if TRUE, performs imputation for outliers identified in the analysis (default: TRUE). -#' @param impute_opts A list of imputation options including rcond, R2_threshold, and minimum_ld (default: list(rcond = 0.01, R2_threshold = 0.6, minimum_ld = 5)). -#' -#' -#' @return A list containing the individual_data and sumstat_data after QC: -#' individual_data contains the following components if exist -#' \itemize{ -#' \item Y: A list of residualized phenotype values for all tasks. -#' \item X: A list of residualized genotype matrices all tasks. -#' } -#' sumstat_data contains the following components if exist -#' \itemize{ -#' \item sumstats: A list of summary statistics for the matched LD_info, each sublist contains sumstats, n, var_y from \code{load_rss_data}. -#' \item LD_info: A list of LD information, each sublist contains LD_variants, LD_matrix, ref_panel \code{load_LD_matrix}. -#' } -#' +#' @param maf_cutoff A scalar to remove variants with maf < maf_cutoff. +#' @param pip_cutoff_to_skip_ind A vector of cutoff values for skipping individual contexts. +#' @param pip_cutoff_to_skip_sumstat A vector of cutoff values for skipping summary-stat studies. +#' @param qc_method Quality control method to use. Options are "none", +#' "rss_qc", "slalom", or "dentist". \code{NULL} and \code{"rss_qc"} +#' are treated as \code{"none"} for basic-only summary-stat preprocessing. +#' @param impute Logical; if TRUE, performs imputation when required metadata are available. +#' @param impute_opts A list of imputation options. +#' @return A list containing post-QC \code{individual_data} and \code{sumstat_data}. #' @noRd qc_regional_data <- function(region_data, # - individual @@ -453,285 +653,815 @@ qc_regional_data <- function(region_data, # - sumstat keep_indel = TRUE, pip_cutoff_to_skip_sumstat = 0, - qc_method = c("slalom", "dentist", "none"), + qc_method = NULL, impute = TRUE, impute_opts = list(rcond = 0.01, R2_threshold = 0.6, minimum_ld = 5, lamb = 0.01)) { - qc_method <- match.arg(qc_method) - - # Validate and recycle pip_cutoff_to_skip_ind: scalar -> named vector for all contexts - if (!is.null(region_data$individual_data)) { - ind_context_names <- names(region_data$individual_data$residual_Y) - n_ind_contexts <- length(ind_context_names) - if (length(pip_cutoff_to_skip_ind) == 1 && is.null(names(pip_cutoff_to_skip_ind))) { - pip_cutoff_to_skip_ind <- setNames(rep(pip_cutoff_to_skip_ind, n_ind_contexts), ind_context_names) - } else if (!is.null(names(pip_cutoff_to_skip_ind))) { - # Named vector: fill missing contexts with 0 - missing <- setdiff(ind_context_names, names(pip_cutoff_to_skip_ind)) - if (length(missing) > 0) { - pip_cutoff_to_skip_ind <- c(pip_cutoff_to_skip_ind, setNames(rep(0, length(missing)), missing)) + qc_method <- .resolve_summary_qc_method(qc_method) + qced_individual_to_region_data <- function(ind_qc) { + if (is.null(ind_qc) || length(ind_qc) == 0) return(NULL) + list( + X = lapply(ind_qc, `[[`, "X"), + Y = lapply(ind_qc, `[[`, "Y") + ) + } + qced_sumstat_to_region_data <- function(sumstat_qc) { + if (is.null(sumstat_qc) || length(sumstat_qc) == 0) return(NULL) + if (!is.null(sumstat_qc$rss_input) && !is.null(sumstat_qc$LD_matrix)) { + sumstat_qc <- list(study1 = sumstat_qc) + } + sumstats <- lapply(sumstat_qc, `[[`, "rss_input") + LD_mat <- list() + LD_match <- character() + ld_variant_index <- list() + for (study in names(sumstat_qc)) { + ld <- sumstat_qc[[study]]$LD_matrix + variant_key <- paste(colnames(ld), collapse = ",") + if (variant_key %in% names(ld_variant_index)) { + LD_match <- c(LD_match, ld_variant_index[[variant_key]]) + } else { + LD_mat[[study]] <- ld + ld_variant_index[[variant_key]] <- study + LD_match <- c(LD_match, study) } - } else if (length(pip_cutoff_to_skip_ind) == n_ind_contexts) { - names(pip_cutoff_to_skip_ind) <- ind_context_names - } else { - stop("pip_cutoff_to_skip_ind must be a scalar, a named vector, or a vector with length equal to the number of individual contexts (", n_ind_contexts, ").") } + list(sumstats = sumstats, LD_mat = LD_mat, LD_match = LD_match) } - # Validate pip_cutoff_to_skip_sumstat: scalar -> named vector for all studies - if (!is.null(region_data$sumstat_data)) { - all_study_names <- unlist(lapply(region_data$sumstat_data$sumstats, names)) - if (length(pip_cutoff_to_skip_sumstat) == 1 && is.null(names(pip_cutoff_to_skip_sumstat))) { - pip_cutoff_to_skip_sumstat <- setNames(rep(pip_cutoff_to_skip_sumstat, length(all_study_names)), all_study_names) - } else if (!is.null(names(pip_cutoff_to_skip_sumstat))) { - # Named vector: fill missing studies with 0 - missing <- setdiff(all_study_names, names(pip_cutoff_to_skip_sumstat)) - if (length(missing) > 0) { - pip_cutoff_to_skip_sumstat <- c(pip_cutoff_to_skip_sumstat, setNames(rep(0, length(missing)), missing)) - } - } + individual_data <- NULL + ind_input <- region_data_to_ind_input(region_data) + if (isTRUE(ind_input$source_info$has_individual)) { + ind_qc <- qc_individual_data( + X = ind_input$X, + Y = ind_input$Y, + maf = ind_input$maf, + X_variance = ind_input$X_variance, + maf_cutoff = maf_cutoff, + pip_cutoff_to_skip = pip_cutoff_to_skip_ind + ) + individual_data <- qced_individual_to_region_data(ind_qc) } - #### related internal functions - # Add context names to colname of Y if missing - add_context_to_Y <- function(res_Y) { - res <- lapply(seq_along(res_Y), function(iy) { - y <- res_Y[[iy]] - if (is.null(y)) { - return(NULL) - } - if (is.null(colnames(y))) { - colnames(y) <- names(res_Y)[iy] - } else { - colnames(y) <- paste0(names(res_Y)[iy], "_", colnames(y)) - } - return(y) - }) - names(res) <- names(res_Y) - return(res) - } - - # Initial PIP check for individual level data - filter_resY_pip <- function(res_X, res_Y, pip_cutoff = 0, context = NULL) { - # Initial PIP check - if (pip_cutoff != 0) { - if (pip_cutoff < 0) { - # automatically determine the cutoff to use - pip_cutoff <- 3 * 1 / ncol(res_X) + sumstat_data <- NULL + rss_input <- region_data_to_rss_input(region_data) + if (isTRUE(rss_input$source_info$has_sumstat)) { + sumstat_qc <- summary_stats_qc( + rss_input = rss_input$rss_input, + LD_data = rss_input$LD_data, + keep_indel = keep_indel, + pip_cutoff_to_skip = pip_cutoff_to_skip_sumstat, + qc_method = qc_method, + impute = impute, + impute_opts = impute_opts + ) + sumstat_data <- qced_sumstat_to_region_data(sumstat_qc) + } + + list(individual_data = individual_data, sumstat_data = sumstat_data) +} + +#' Run reusable individual-level QC +#' +#' @param X Genotype matrix or named list of genotype matrices. +#' @param Y Phenotype vector/matrix or named list of phenotype matrices. +#' @param maf Optional MAF vector or named list. +#' @param X_variance Optional variant variance vector or named list. +#' @param missing_rate_thresh Maximum missing genotype rate. +#' @param maf_cutoff Minimum MAF cutoff. +#' @param xvar_cutoff Minimum genotype variance cutoff. +#' @param ld_reference_meta_file Optional LD reference metadata file. +#' @param keep_indel Whether indel variants are kept during LD-reference +#' filtering. +#' @param pip_cutoff_to_skip Optional single-effect PIP cutoff. +#' @return A named list of cleaned context-level \code{X}/\code{Y} records, or +#' one cleaned record for matrix inputs. +#' @export +qc_individual_data <- function(X, Y, maf = NULL, X_variance = NULL, + missing_rate_thresh = NULL, + maf_cutoff = 0.0005, + xvar_cutoff = 0, + ld_reference_meta_file = NULL, + keep_indel = TRUE, + pip_cutoff_to_skip = 0, + context = NULL) { + qc_one <- function(X, Y, maf = NULL, X_variance = NULL, context = NULL, + pip_cutoff_to_skip = 0) { + if (is.null(X) || is.null(Y)) return(NULL) + if (is.null(colnames(X))) stop("X must have variant colnames for individual-level QC.") + if (is.vector(Y)) Y <- matrix(Y, ncol = 1) + if (is.null(colnames(Y))) colnames(Y) <- .cb_default(context, paste0("outcome", seq_len(ncol(Y)))) + if (!is.null(context)) colnames(Y) <- paste0(context, "_", colnames(Y)) + + message("QC track: starting individual-level QC for ", .cb_default(context, "individual data"), ".") + original_variants <- colnames(X) + if (!is.null(names(maf)) || length(maf) == length(original_variants)) { + if (is.null(names(maf))) names(maf) <- original_variants + } + if (!is.null(names(X_variance)) || length(X_variance) == length(original_variants)) { + if (is.null(names(X_variance))) names(X_variance) <- original_variants + } + if (!is.null(ld_reference_meta_file)) { + reference_filter <- filter_variants_by_ld_reference(original_variants, ld_reference_meta_file, + keep_indel = keep_indel) + X <- X[, reference_filter$data, drop = FALSE] + if (!is.null(names(maf))) maf <- maf[colnames(X)] + if (!is.null(names(X_variance))) X_variance <- X_variance[colnames(X)] + } + X <- filter_X(X, missing_rate_thresh = missing_rate_thresh, + maf_thresh = maf_cutoff, var_thresh = xvar_cutoff, + maf = maf, X_variance = X_variance) + if (!is.null(names(maf))) maf <- maf[colnames(X)] + if (!is.null(names(X_variance))) X_variance <- X_variance[colnames(X)] + + if (!is.null(pip_cutoff_to_skip) && pip_cutoff_to_skip != 0) { + cutoff <- pip_cutoff_to_skip + if (cutoff < 0) cutoff <- 3 / ncol(X) + keep_y <- logical(ncol(Y)) + for (j in seq_len(ncol(Y))) { + observed <- !is.na(Y[, j]) + if (sum(observed) < 2) next + pip <- susieR::susie(X[observed, , drop = FALSE], Y[observed, j], + L = 1, max_iter = 100)$pip + keep_y[j] <- any(pip > cutoff) } - top_model_pip <- lapply(1:ncol(res_Y), function(i) susieR::susie(res_X, res_Y[, i], L = 1)$pip) - check_model_pip <- sapply(top_model_pip, function(pip) any(pip > pip_cutoff)) - include_idx <- which(check_model_pip) - if (length(include_idx) == 0) { - message(paste( - "Skipping follow-up analysis for individual-context", context, - ". No signals above PIP threshold", pip_cutoff, "in initial model screening." - )) + if (!any(keep_y)) { + message("QC track: skipping individual context ", context, + ". No outcomes passed PIP threshold ", cutoff, ".") return(NULL) - } else if (length(include_idx) == ncol(res_Y)) { - message(paste("Keep all individual-phenotypes in context", context, ".")) - } else { - exclude_idx <- setdiff(1:ncol(res_Y), include_idx) - exclude_pheno <- paste(colnames(res_Y)[exclude_idx], collapse = ";") - message(paste( - "Skipping follow-up analysis for individual-phenotypes", exclude_pheno, "in context", context, - ". No signals above PIP threshold", pip_cutoff, "in initial model screening." - )) - res_Y <- res_Y[, include_idx, drop = FALSE] %>% .[, order(colnames(.)), drop = FALSE] } + Y <- Y[, keep_y, drop = FALSE] + } + message("QC track: retained ", ncol(X), " variants and ", ncol(Y), + " outcome(s) for individual context ", .cb_default(context, "input"), ".") + list(X = X, Y = Y, maf = maf, X_variance = X_variance) + } + + if (is.list(X) && !is.matrix(X) && !is.data.frame(X)) { + X <- .cb_as_named_list(X, "individual") + Y <- .cb_as_named_list(Y, "individual") + contexts <- intersect(names(X), names(Y)) + if (length(contexts) == 0) stop("No matched X/Y contexts for individual-level QC.") + cutoffs <- .cb_named_cutoff(pip_cutoff_to_skip, contexts, "pip_cutoff_to_skip_ind") + out <- list() + for (context in contexts) { + res <- qc_one( + X[[context]], Y[[context]], + maf = .cb_list_value(maf, context), + X_variance = .cb_list_value(X_variance, context), + context = context, + pip_cutoff_to_skip = cutoffs[[context]] + ) + if (!is.null(res)) out[[context]] <- res } - return(res_Y) - } - - # Initial check for all contexts with individual-level data - data_initial_screen_individual <- function(X, - Y, - MAF, - maf_cutoff = 0.0005, - pip_cutoff_to_skip_ind = 0) { - # - add context to colname of Y - Y <- add_context_to_Y(Y) - results <- purrr::imap(X, function(resX, context) { - resY <- Y[[context]] - maf <- MAF[[context]] - pip_cutoff <- if (context %in% names(pip_cutoff_to_skip_ind)) { - pip_cutoff_to_skip_ind[[context]] + return(out) + } + qc_one( + X = X, Y = Y, maf = maf, X_variance = X_variance, + pip_cutoff_to_skip = pip_cutoff_to_skip, + context = context + ) +} + + +##### Generic ColocBoost helper functions ##### + +.cb_default <- function(x, y) if (is.null(x)) y else x + +.cb_merge_args <- function(x, y) { + for (nm in names(y)) { + if (!is.null(y[[nm]])) x[[nm]] <- y[[nm]] + } + x +} + +.cb_as_named_list <- function(x, default_name) { + if (is.null(x)) return(NULL) + if (is.list(x) && !is.matrix(x) && !is.data.frame(x)) return(x) + stats::setNames(list(x), default_name) +} + +.cb_list_value <- function(x, name, default = NULL) { + if (is.null(x)) return(default) + if (is.list(x) && !is.matrix(x) && !is.data.frame(x)) { + if (name %in% names(x)) return(x[[name]]) + return(default) + } + x +} + +.cb_named_cutoff <- function(x, names_to_fill, arg_name) { + if (length(x) == 1 && is.null(names(x))) { + return(stats::setNames(rep(x, length(names_to_fill)), names_to_fill)) + } + if (!is.null(names(x))) { + missing <- setdiff(names_to_fill, names(x)) + if (length(missing) > 0) x <- c(x, stats::setNames(rep(0, length(missing)), missing)) + return(x[names_to_fill]) + } + if (length(x) == length(names_to_fill)) { + return(stats::setNames(x, names_to_fill)) + } + stop(arg_name, " must be a scalar, named vector, or match the number of inputs.") +} + +.cb_y_ncol <- function(y) { + if (is.null(y)) return(0L) + if (is.null(dim(y))) return(as.integer(length(y) > 0)) + ncol(y) +} + +##### Outcome helper functions ##### + +.cb_colocboost_outcome_names <- function(args, prefer_supplied = TRUE) { + if (isTRUE(prefer_supplied) && !is.null(args$outcome_names)) { + return(as.character(args$outcome_names)) + } + Y <- args$Y + y_outcomes <- character() + if (!is.null(Y)) { + if (is.data.frame(Y)) Y <- as.matrix(Y) + if (is.matrix(Y)) { + y_outcomes <- colnames(Y) + if (is.null(y_outcomes)) y_outcomes <- paste0("Y", seq_len(ncol(Y))) + } else if (is.atomic(Y) && !is.list(Y)) { + y_outcomes <- "Y1" + } else { + y_names <- names(Y) + y_cols <- unlist(Map(function(y, nm) { + if (is.null(y)) return(character()) + if (is.data.frame(y)) y <- as.matrix(y) + if (!is.null(dim(y))) { + cn <- colnames(y) + if (is.null(cn) || any(is.na(cn) | cn == "")) { + if (!is.null(nm) && !is.na(nm) && nm != "" && ncol(y) == 1) return(nm) + return(paste0(.cb_default(nm, "Y"), "_", seq_len(ncol(y)))) + } + return(as.character(cn)) + } + if (!is.null(nm) && !is.na(nm) && nm != "") return(nm) + "Y" + }, Y, .cb_default(y_names, rep("", length(Y)))), use.names = FALSE) + if (length(y_cols) != length(Y) || anyDuplicated(y_cols) == 0) { + y_outcomes <- y_cols } else { - 0 + if (is.null(y_names) || any(is.na(y_names) | y_names == "")) { + y_names <- paste0("Y", seq_along(Y)) + } + y_outcomes <- y_names } - if (is.null(resY)) return(NULL) - resX <- filter_X(resX, missing_rate_thresh = NULL, maf_thresh = maf_cutoff, maf = maf) - resY <- filter_resY_pip(resX, resY, pip_cutoff = pip_cutoff, context = context) - if (is.null(resY)) return(NULL) - list(X = resX, Y = resY) - }) %>% purrr::compact() - - if (length(results) == 0) { - message("Skipping follow-up analysis for all contexts.") - return(NULL) } - keep_contexts <- names(results) - message(paste("Region includes the following contexts after inital screening:", paste(keep_contexts, collapse = ";"), ".")) - list( - X = purrr::map(results, "X"), - Y = purrr::map(results, "Y") - ) } - # - individual level data QC - individual_data <- region_data$individual_data - if (!is.null(individual_data)) { - X <- individual_data$residual_X - Y <- individual_data$residual_Y - MAF <- individual_data$maf - # 1. remove maf < maf_cutoff - # 2. initial check PIP - individual_data <- data_initial_screen_individual( - X = X, Y = Y, MAF = MAF, - maf_cutoff = maf_cutoff, - pip_cutoff_to_skip_ind = pip_cutoff_to_skip_ind + sumstat <- args$sumstat + effect_est <- args$effect_est + sumstat_outcomes <- character() + if (!is.null(sumstat)) { + if (is.data.frame(sumstat)) { + sumstat_outcomes <- "sumstat1" + } else { + ss_names <- names(sumstat) + if (is.null(ss_names) || any(is.na(ss_names) | ss_names == "")) { + ss_names <- paste0("sumstat", seq_along(sumstat)) + } + sumstat_outcomes <- as.character(ss_names) + } + } else if (!is.null(effect_est)) { + effect_est <- as.matrix(effect_est) + effect_names <- colnames(effect_est) + if (is.null(effect_names)) effect_names <- paste0("sumstat", seq_len(ncol(effect_est))) + sumstat_outcomes <- as.character(effect_names) + } + c(as.character(y_outcomes), sumstat_outcomes) +} + +.cb_resolve_qc_outcome_names <- function(pre_qc_data_outcomes, + pre_qc_display_outcomes, + post_qc_data_outcomes) { + if (length(post_qc_data_outcomes) == 0) return(character()) + if (length(pre_qc_data_outcomes) == length(pre_qc_display_outcomes) && + length(pre_qc_data_outcomes) > 0 && + !anyDuplicated(pre_qc_data_outcomes)) { + idx <- match(post_qc_data_outcomes, pre_qc_data_outcomes) + if (all(!is.na(idx))) { + return(pre_qc_display_outcomes[idx]) + } + } + post_qc_data_outcomes +} + +.cb_remap_focal_outcome_idx <- function(focal_outcome_idx, + pre_qc_data_outcomes, + pre_qc_display_outcomes, + post_qc_data_outcomes, + post_qc_display_outcomes) { + if (is.null(focal_outcome_idx)) return(NULL) + if (length(focal_outcome_idx) != 1 || is.na(focal_outcome_idx)) { + warning("focal_outcome_idx must be a single non-missing index. Passing it through unchanged.") + return(focal_outcome_idx) + } + focal_outcome_idx <- as.integer(focal_outcome_idx) + if (length(post_qc_data_outcomes) == 0) return(NULL) + if (focal_outcome_idx < 1 || + focal_outcome_idx > max(length(pre_qc_display_outcomes), length(pre_qc_data_outcomes))) { + warning("focal_outcome_idx is outside the pre-QC outcome range. Passing it through unchanged.") + return(focal_outcome_idx) + } + + focal_display <- if (focal_outcome_idx <= length(pre_qc_display_outcomes)) { + pre_qc_display_outcomes[[focal_outcome_idx]] + } else { + NULL + } + focal_data <- if (focal_outcome_idx <= length(pre_qc_data_outcomes)) { + pre_qc_data_outcomes[[focal_outcome_idx]] + } else { + NULL + } + + candidates <- integer() + for (needle in unique(c(focal_display, focal_data))) { + if (is.null(needle) || is.na(needle) || needle == "") next + candidates <- c( + candidates, + which(post_qc_display_outcomes == needle), + which(post_qc_data_outcomes == needle) ) + suffix_pattern <- paste0("_", needle) + candidates <- c( + candidates, + which(endsWith(post_qc_display_outcomes, suffix_pattern)), + which(endsWith(post_qc_data_outcomes, suffix_pattern)) + ) + } + candidates <- unique(candidates) + if (length(candidates) > 0) { + return(candidates[[1]]) + } + if (length(pre_qc_data_outcomes) == length(post_qc_data_outcomes) || + length(pre_qc_display_outcomes) == length(post_qc_display_outcomes)) { + return(focal_outcome_idx) } + focal_label <- .cb_default(focal_display, .cb_default(focal_data, focal_outcome_idx)) + warning("focal_outcome_idx refers to outcome ", focal_label, + ", which is not present after QC. Setting focal_outcome_idx to NULL.") + NULL +} - # sumstat_data QC, imputation using raiss - # return A list of sumstat_data after initial checking and QC: - # \itemize{ - # \item sumstats: A list of summary statistics and ready to do analysis. - # \item LD_mat: A list of LD matrix and ready to do analysis. - # \item LD_match: A vector of strings to indicating sumstats and LD matching (save space since multiple sumstats may link to the same LD matrix). - # } - summary_stats_qc_multitask <- function(sumstat_data, - keep_indel = TRUE, - pip_cutoff_to_skip_sumstat = 0, - qc_method = c("slalom", "dentist", "none"), - impute = TRUE, - impute_opts = list(rcond = 0.01, R2_threshold = 0.6, minimum_ld = 5, lamb = 0.01)) { - n_LD <- length(sumstat_data$LD_info) - # Collect results into lists and flatten at the end - collected_sumstats <- list() - collected_LD <- list() - collected_LD_match <- character() - # Track LD matrices by variant signature for O(1) deduplication - ld_variant_index <- list() +##### Individual-level ColocBoost helper functions ##### - for (i in 1:n_LD) { - LD_data <- sumstat_data$LD_info[[i]] - sumstats <- sumstat_data$sumstats[[i]] - has_genotype <- isTRUE(LD_data$is_genotype) - - # When source is genotype X, derive R only where needed (QC, imputation). - # Keep X as primary for colocboost X_ref. - LD_data_for_qc <- LD_data - if (has_genotype) { - LD_data_for_qc$LD_matrix <- compute_LD(LD_data$LD_matrix, method = "sample") - LD_data_for_qc$is_genotype <- FALSE +.cb_individual_qc_input_from_colocboost <- function(X, Y, dict_YX = NULL) { + bind_y_for_qc <- function(Y_list, y_names = NULL) { + if (is.null(y_names)) y_names <- names(Y_list) + if (is.null(y_names)) y_names <- rep("", length(Y_list)) + mats <- Map(function(y, nm) { + if (is.null(dim(y))) y <- matrix(y, ncol = 1) + if (is.null(colnames(y))) { + colnames(y) <- if (!is.null(nm) && !is.na(nm) && nm != "") nm else paste0("Y", seq_len(ncol(y))) } + y + }, Y_list, y_names) + do.call(cbind, mats) + } - # Pre-compute LD partition once per block (shared across all GWAS studies) - if (impute) { - LD_matrix_partitioned <- partition_LD_matrix(LD_data_for_qc) + X_list <- .cb_as_named_list(X, "X1") + Y_list <- .cb_as_named_list(Y, "Y1") + matched <- intersect(names(X_list), names(Y_list)) + if (length(matched) > 0 && is.null(dict_YX)) { + return(list(X = X_list[matched], Y = Y_list[matched])) + } + + if (!is.null(dict_YX)) { + dict <- as.matrix(dict_YX) + if (ncol(dict) < 2) stop("dict_YX must have at least two columns.") + X_qc <- list() + Y_qc <- list() + for (x_idx in unique(dict[, 2])) { + if (is.na(x_idx) || x_idx < 1 || x_idx > length(X_list)) next + y_idx <- dict[dict[, 2] == x_idx, 1] + y_idx <- y_idx[!is.na(y_idx) & y_idx >= 1 & y_idx <= length(Y_list)] + if (length(y_idx) == 0) next + context <- names(X_list)[x_idx] + if (is.null(context) || is.na(context) || context == "") context <- paste0("X", x_idx) + X_qc[[context]] <- X_list[[x_idx]] + Y_qc[[context]] <- bind_y_for_qc(Y_list[y_idx], names(Y_list)[y_idx]) + } + if (length(X_qc) > 0) return(list(X = X_qc, Y = Y_qc)) + } + + if (length(X_list) == 1 && length(Y_list) > 0) { + context <- names(X_list)[1] + if (is.null(context) || is.na(context) || context == "") context <- "X1" + return(list( + X = stats::setNames(list(X_list[[1]]), context), + Y = stats::setNames(list(bind_y_for_qc(Y_list, names(Y_list))), context) + )) + } + + if (length(X_list) == length(Y_list)) { + if (is.null(names(X_list))) names(X_list) <- paste0("X", seq_along(X_list)) + names(Y_list) <- names(X_list) + return(list(X = X_list, Y = Y_list)) + } + + list(X = X_list, Y = Y_list) +} + +.cb_format_individual <- function(ind) { + if (length(ind) == 0) return(list()) + if (!is.null(ind$X) && !is.null(ind$Y)) { + ind <- list(individual = ind) + } + ind <- Filter(function(x) { + !is.null(x$X) && .cb_y_ncol(x$Y) > 0 + }, ind) + if (length(ind) == 0) return(list()) + X <- lapply(ind, `[[`, "X") + Y <- lapply(ind, `[[`, "Y") + unique_X <- list() + X_match <- integer(length(X)) + for (i in seq_along(X)) { + matched <- names(unique_X)[vapply(unique_X, identical, logical(1), X[[i]])] + if (length(matched) > 0) { + X_match[[i]] <- match(matched[[1]], names(unique_X)) + } else { + unique_X[[names(ind)[i]]] <- X[[i]] + X_match[[i]] <- length(unique_X) + } + } + Y_split <- list() + dict <- matrix(nrow = 0, ncol = 2) + y_index <- 0L + all_outcome_names <- unlist(lapply(Y, colnames), use.names = FALSE) + duplicated_outcome_names <- unique(all_outcome_names[ + duplicated(all_outcome_names) | duplicated(all_outcome_names, fromLast = TRUE) + ]) + for (i in seq_along(Y)) { + y <- Y[[i]] + if (is.null(dim(y))) y <- matrix(y, ncol = 1) + for (j in seq_len(ncol(y))) { + y_index <- y_index + 1L + outcome <- colnames(y)[j] + if (is.null(outcome) || is.na(outcome) || outcome == "") { + outcome <- paste0("outcome", y_index) + } + context <- names(ind)[i] + if (outcome %in% duplicated_outcome_names && + !is.null(context) && !is.na(context) && context != "") { + outcome <- paste0(context, "_", outcome) } + if (outcome %in% names(Y_split)) { + outcome <- make.unique(c(names(Y_split), outcome))[length(Y_split) + 1] + } + Y_split[[outcome]] <- y[, j, drop = FALSE] + dict <- rbind(dict, c(y_index, X_match[[i]])) + } + } + colnames(dict) <- c("Y", "X") + list(X = unique_X, Y = Y_split, dict_YX = dict, outcome_names = names(Y_split)) +} - for (ii in seq_along(sumstats)) { - sumstat <- sumstats[[ii]] - if (nrow(sumstat$sumstats) == 0) next - n <- sumstat$n - conditions_sumstat <- names(sumstats)[ii] - pip_cutoff_to_skip_ld <- if (conditions_sumstat %in% names(pip_cutoff_to_skip_sumstat)) { - as.numeric(pip_cutoff_to_skip_sumstat[conditions_sumstat]) - } else { - 0 - } +##### Summary-statistic ColocBoost helper functions ##### - # Preprocess: allele QC + variant subsetting (needs R for [variants, variants] indexing) - preprocess_results <- rss_basic_qc(sumstat$sumstats, LD_data_for_qc, keep_indel = keep_indel) - sumstat$sumstats <- preprocess_results$sumstats - R_mat <- preprocess_results$LD_mat - - # Initial PIP checking (uses X when available, R otherwise) - if (pip_cutoff_to_skip_ld != 0) { - pip_vars <- sumstat$sumstats$variant_id - if (has_genotype) { - pip <- susie_rss(z = sumstat$sumstats$z, - X = LD_data$LD_matrix[, pip_vars, drop = FALSE], - L = 1, L_greedy = NULL, max_iter = 1, n = n)$pip - } else { - pip <- susie_rss(z = sumstat$sumstats$z, R = R_mat, L = 1, L_greedy = NULL, max_iter = 1, n = n)$pip - } - if (pip_cutoff_to_skip_ld < 0) { - pip_cutoff_to_skip_ld <- 3 * 1 / length(pip_vars) - } - if (!any(pip > pip_cutoff_to_skip_ld)) { - message(paste( - "Skipping follow-up analysis for sumstat study", conditions_sumstat, - ". No signals above PIP threshold", pip_cutoff_to_skip_ld, "in initial model screening." - )) - next - } else { - message(paste("Keep summary study", conditions_sumstat, ".")) - } - } +.cb_sumstat_qc_input_from_colocboost <- function(sumstat, LD, X_ref, dict_sumstatLD, + LD_reference_info = NULL, + variant_convention = c("A2_A1", "A1_A2")) { + is_ld_data <- function(x) { + is.list(x) && !is.null(x$LD_matrix) + } + as_reference_info_list <- function(x) { + if (is.null(x)) return(NULL) + if (is_ld_data(x) || is.data.frame(x) || is.character(x)) { + return(list(LD = x)) + } + if (is.list(x) && length(x) > 0) return(x) + stop("LD_reference_info must be a .bim/.pvar path, data.frame, load_LD_matrix() result, or a list of these.") + } + reference_info_for_index <- function(reference_info, index) { + if (is.null(reference_info)) return(NULL) + if (length(reference_info) == 1) return(reference_info[[1]]) + reference_info[[min(index, length(reference_info))]] + } + validate_sumstat_for_qc <- function(sumstat, study) { + if (is.null(sumstat) || !is.data.frame(sumstat)) { + return(paste0("Summary-statistic QC for study ", study, + " requires each sumstat input to be a data.frame.")) + } + missing_cols <- setdiff(c("variant", "z"), colnames(sumstat)) + if (length(missing_cols) > 0) { + return(paste0("Summary-statistic QC for study ", study, + " requires sumstat columns: ", paste(missing_cols, collapse = ", "), ".")) + } + NULL + } + ld_variant_names <- function(ld) { + if (!is.matrix(ld)) return(NULL) + if (nrow(ld) == ncol(ld)) rownames(ld) else colnames(ld) + } - # Quality control - remove outlier variants (needs R) - if (!is.null(qc_method) && qc_method != "none") { - qc_results <- summary_stats_qc(sumstat$sumstats, LD_data_for_qc, n = n, method = qc_method) - sumstat$sumstats <- qc_results$sumstats - R_mat <- qc_results$LD_mat - } - # Imputation (needs R via partitioned LD) - if (impute) { - impute_results <- raiss(LD_data_for_qc$ref_panel, sumstat$sumstats, LD_matrix_partitioned, - rcond = impute_opts$rcond, - R2_threshold = impute_opts$R2_threshold, minimum_ld = impute_opts$minimum_ld, lamb = impute_opts$lamb - ) - sumstat$sumstats <- impute_results$result_filter - R_mat <- impute_results$LD_mat - } + variant_convention <- match.arg(variant_convention) + sumstat <- .cb_as_named_list(sumstat, "sumstat") + using_x_ref <- is.null(LD) && !is.null(X_ref) + ld_source <- .cb_as_named_list(if (!is.null(LD)) LD else X_ref, "LD") + reference_info <- as_reference_info_list(LD_reference_info) + if (is.null(dict_sumstatLD)) { + dict_sumstatLD <- cbind(seq_along(sumstat), pmin(seq_along(sumstat), length(ld_source))) + } + rss_input <- list() + LD_data <- list() + skip_reasons <- character() + for (i in seq_along(sumstat)) { + study <- names(sumstat)[i] + ld_index <- dict_sumstatLD[i, 2] + validation_reason <- validate_sumstat_for_qc(sumstat[[i]], study) + if (!is.null(validation_reason)) { + skip_reasons <- c(skip_reasons, validation_reason) + next + } + ld_mat <- ld_source[[ld_index]] + ref_info <- reference_info_for_index(reference_info, ld_index) + if (is.null(ref_info)) { + message("QC track: checking LD/X_ref variant names for summary-stat study ", study, ".") + ref_ids <- ld_variant_names(ld_mat) + if (is.null(ref_ids)) { + skip_reasons <- c( + skip_reasons, + paste0("Summary-statistic QC for study ", study, + " requires LD row/column names or X_ref column names parseable as genomic variant IDs; ", + "provide LD_reference_info for QC. Skipping summary-statistic QC for this study.") + ) + next + } + ld_data <- .cb_make_ld_data( + ld_mat, + is_genotype = using_x_ref, + variant_convention = variant_convention + ) + if (is.null(ld_data)) { + skip_reasons <- c( + skip_reasons, + paste0("Summary-statistic QC for study ", study, + " could not parse LD/X_ref names as genomic variant IDs; ", + "provide LD_reference_info for QC. Skipping summary-statistic QC for this study.") + ) + next + } + message("QC track: LD/X_ref names are parseable for summary-stat study ", study, ".") + } else if (is_ld_data(ref_info)) { + message("QC track: using supplied LD_reference_info LD data for summary-stat study ", study, ".") + ld_data <- .normalize_ld_data_for_qc(ref_info) + } else { + message("QC track: using supplied LD_reference_info variant metadata for summary-stat study ", study, ".") + ld_data <- .cb_make_ld_data( + ld_mat, + is_genotype = using_x_ref, + reference_info = ref_info, + variant_convention = variant_convention + ) + } + parsed <- tryCatch( + .cb_parse_variants_for_qc(sumstat[[i]]$variant, variant_convention), + error = function(e) { + skip_reasons <<- c( + skip_reasons, + paste0("Summary-statistic QC for study ", study, + " could not parse sumstat$variant as genomic variant IDs: ", + conditionMessage(e), " Skipping summary-statistic QC for this study.") + ) + NULL + } + ) + if (is.null(parsed)) next + variant_id <- format_variant_id(parsed$chrom, parsed$pos, parsed$A2, parsed$A1) + ss <- data.frame(parsed, z = sumstat[[i]]$z, + variant_id = variant_id, + stringsAsFactors = FALSE) + n <- if ("n" %in% colnames(sumstat[[i]])) unique(sumstat[[i]]$n)[1] else NULL + var_y <- if ("var_y" %in% colnames(sumstat[[i]])) unique(sumstat[[i]]$var_y)[1] else 1 + rss_input[[study]] <- list(sumstats = ss, n = n, var_y = var_y) + LD_data[[study]] <- ld_data + } + list(rss_input = rss_input, LD_data = LD_data, skip_reasons = skip_reasons) +} - # Store: X subset if genotype source, R otherwise - final_vars <- sumstat$sumstats$variant_id - if (has_genotype) { - missing <- setdiff(final_vars, colnames(LD_data$LD_matrix)) - if (length(missing) > 0) { - stop("BUG: ", length(missing), " QC'd variants not found in genotype matrix X. ", - "First few: ", paste(head(missing, 3), collapse = ", ")) - } - mat_to_store <- LD_data$LD_matrix[, final_vars, drop = FALSE] +.cb_format_sumstat <- function(sumstat_qc) { + valid_sumstat_entry <- function(ss_df, min_variants = 2) { + if (is.null(ss_df) || !is.data.frame(ss_df)) return(FALSE) + if (nrow(ss_df) < min_variants) return(FALSE) + if (all(is.na(ss_df$z))) return(FALSE) + if (all(ss_df$n <= 0 | is.na(ss_df$n))) return(FALSE) + TRUE + } + filter_valid_sumstats <- function(sumstats, LD_mat, min_variants = 2) { + dedupe_ld <- function(LD_mat, studies) { + unique_ld <- list() + LD_match <- character() + for (study in studies) { + ld <- LD_mat[[study]] + matched <- names(unique_ld)[vapply(unique_ld, identical, logical(1), ld)] + if (length(matched) > 0) { + LD_match <- c(LD_match, matched[[1]]) } else { - mat_to_store <- R_mat + unique_ld[[study]] <- ld + LD_match <- c(LD_match, study) } + } + list(LD_mat = unique_ld, LD_match = LD_match) + } + + valid_idx <- vapply(sumstats, valid_sumstat_entry, logical(1), + min_variants = min_variants) + if (!any(valid_idx)) { + message("No valid summary statistic studies remain after validation.") + return(NULL) + } + removed <- names(sumstats)[!valid_idx] + if (length(removed) > 0) { + message("Removed invalid sumstat studies: ", paste(removed, collapse = ", ")) + } + sumstats <- sumstats[valid_idx] + LD_mat <- LD_mat[valid_idx] + deduped <- dedupe_ld(LD_mat, names(sumstats)) + LD_mat <- deduped$LD_mat + LD_match <- deduped$LD_match + dict_sumstatLD <- cbind(seq_along(sumstats), match(LD_match, names(LD_mat))) + list(sumstats = sumstats, LD_mat = LD_mat, LD_match = LD_match, + dict_sumstatLD = dict_sumstatLD) + } + if (length(sumstat_qc) == 0) return(list()) + if (!is.null(sumstat_qc$rss_input) && !is.null(sumstat_qc$LD_matrix)) { + sumstat_qc <- list(sumstat = sumstat_qc) + } + sumstat <- lapply(sumstat_qc, function(x) { + ss <- x$rss_input$sumstats + variant_id <- if ("variant_id" %in% colnames(ss)) { + ss$variant_id + } else { + format_variant_id(ss$chrom, ss$pos, ss$A2, ss$A1) + } + data.frame(z = ss$z, n = x$rss_input$n, + variant = normalize_variant_id(variant_id), + stringsAsFactors = FALSE) + }) + LD_mat <- lapply(sumstat_qc, `[[`, "LD_matrix") + filtered <- filter_valid_sumstats(sumstat, LD_mat) + if (is.null(filtered)) return(list()) + c( + list( + sumstat = filtered$sumstats, + dict_sumstatLD = filtered$dict_sumstatLD + ), + build_ld_args(filtered$LD_mat) + ) +} - # Collect sumstat - collected_sumstats[[conditions_sumstat]] <- sumstat - # Deduplicate LD using variant signature hash - variant_key <- paste(colnames(mat_to_store), collapse = ",") - if (variant_key %in% names(ld_variant_index)) { - collected_LD_match <- c(collected_LD_match, ld_variant_index[[variant_key]]) - } else { - collected_LD[[conditions_sumstat]] <- mat_to_store - ld_variant_index[[variant_key]] <- conditions_sumstat - collected_LD_match <- c(collected_LD_match, conditions_sumstat) - } +##### LD/reference QC helper functions ##### + +.cb_make_ld_data <- function(ld, is_genotype = NULL, + reference_info = NULL, + variant_convention = c("A2_A1", "A1_A2")) { + reference_info_to_ref_panel <- function(reference_info) { + if (is.character(reference_info) && length(reference_info) == 1) { + if (!file.exists(reference_info)) { + stop("LD_reference_info file does not exist: ", reference_info) } + reference_info <- read_variant_metadata(reference_info) + } + if (!is.data.frame(reference_info)) { + stop("LD_reference_info must be a .bim/.pvar path or data.frame when it is not a load_LD_matrix() result.") } - return(list(sumstats = collected_sumstats, LD_mat = collected_LD, LD_match = collected_LD_match)) + info <- as.data.frame(reference_info, stringsAsFactors = FALSE) + names(info) <- sub("^#", "", names(info)) + names(info) <- sub("^chr$", "chrom", names(info), ignore.case = TRUE) + names(info) <- sub("^CHROM$", "chrom", names(info)) + names(info) <- sub("^POS$", "pos", names(info)) + names(info) <- sub("^ID$", "id", names(info)) + names(info) <- sub("^REF$", "A2", names(info)) + names(info) <- sub("^ALT$", "A1", names(info)) + names(info) <- sub("^a0$", "A2", names(info)) + names(info) <- sub("^a1$", "A1", names(info)) + names(info) <- sub("^rsid$", "id", names(info), ignore.case = TRUE) + + if ("variants" %in% names(info) && !"variant_id" %in% names(info)) { + info$variant_id <- normalize_variant_id(as.character(info$variants)) + } + if (!"variant_id" %in% names(info)) { + missing_cols <- setdiff(c("chrom", "pos", "A2", "A1"), names(info)) + if (length(missing_cols) > 0) { + stop("LD_reference_info must contain variant_id or columns chrom, pos, A2, A1. Missing: ", + paste(missing_cols, collapse = ", "), ".") + } + info$variant_id <- normalize_variant_id(format_variant_id(info$chrom, info$pos, info$A2, info$A1)) + } else { + info$variant_id <- normalize_variant_id(as.character(info$variant_id)) + } + if (!all(c("chrom", "pos", "A2", "A1") %in% names(info))) { + parsed <- parse_variant_id(info$variant_id) + info$chrom <- parsed$chrom + info$pos <- parsed$pos + info$A2 <- parsed$A2 + info$A1 <- parsed$A1 + } + keep_cols <- intersect(c("chrom", "id", "pos", "A2", "A1", "variant_id", + "allele_freq", "variance", "n_nomiss"), names(info)) + info[, keep_cols, drop = FALSE] } + align_ref_panel_to_ld <- function(ref_panel, ld_names, n_variants) { + if (!is.null(ld_names)) { + match_idx <- rep(NA_integer_, length(ld_names)) + if ("id" %in% names(ref_panel)) { + match_idx <- match(ld_names, ref_panel$id) + } + if (any(is.na(match_idx))) { + variant_match <- match(normalize_variant_id(ld_names), ref_panel$variant_id) + match_idx[is.na(match_idx)] <- variant_match[is.na(match_idx)] + } + if (all(!is.na(match_idx))) { + ref_panel <- ref_panel[match_idx, , drop = FALSE] + rownames(ref_panel) <- NULL + return(ref_panel) + } + if (length(ld_names) == nrow(ref_panel)) { + message("QC track: LD_reference_info could not be matched by LD names; using LD_reference_info row order.") + return(ref_panel[seq_along(ld_names), , drop = FALSE]) + } + stop("LD_reference_info could not be matched to LD/X_ref names. ", + "Provide an id/variant_id column matching LD names, or provide rows in LD matrix order.") + } + if (nrow(ref_panel) != n_variants) { + stop("LD_reference_info has ", nrow(ref_panel), " variants but LD/X_ref has ", + n_variants, " columns. Provide LD_reference_info in LD matrix order or with matching LD names.") + } + message("QC track: LD/X_ref names are missing; using LD_reference_info row order.") + ref_panel + } - # - summary statistics QC - sumstat_data <- region_data$sumstat_data - if (!is.null(sumstat_data)) { - # - initial check PIP, qc or impute - sumstat_data <- summary_stats_qc_multitask(sumstat_data, - keep_indel = keep_indel, - pip_cutoff_to_skip_sumstat = pip_cutoff_to_skip_sumstat, - qc_method = qc_method, - impute = impute, impute_opts = impute_opts + variant_convention <- match.arg(variant_convention) + if (is.null(is_genotype)) is_genotype <- is.matrix(ld) && nrow(ld) != ncol(ld) + ref_ids <- if (is.matrix(ld) && nrow(ld) == ncol(ld)) rownames(ld) else colnames(ld) + + if (!is.null(reference_info)) { + ref_panel <- reference_info_to_ref_panel(reference_info) + ref_panel <- align_ref_panel_to_ld(ref_panel, ref_ids, ncol(ld)) + variant_ids <- ref_panel$variant_id + if (is.matrix(ld) && nrow(ld) == ncol(ld)) { + rownames(ld) <- colnames(ld) <- variant_ids + } else if (is.matrix(ld)) { + colnames(ld) <- variant_ids + } + return(list( + LD_matrix = ld, + LD_variants = variant_ids, + ref_panel = ref_panel, + block_metadata = if (!isTRUE(is_genotype)) .infer_single_ld_block_metadata(ref_panel) else NULL, + is_genotype = isTRUE(is_genotype) + )) + } + + if (is.null(ref_ids)) { + return(NULL) + } + + parsed <- if (!is.null(ref_ids)) { + tryCatch( + .cb_parse_variants_for_qc(ref_ids, variant_convention), + error = function(e) NULL ) + } else { + NULL + } + if (is.null(parsed)) return(NULL) + variant_ids <- if (!is.null(parsed)) format_variant_id(parsed$chrom, parsed$pos, parsed$A2, parsed$A1) else NULL + if (!is.null(variant_ids)) { + if (is.matrix(ld) && nrow(ld) == ncol(ld)) { + rownames(ld) <- colnames(ld) <- variant_ids + } else if (is.matrix(ld)) { + colnames(ld) <- variant_ids + } + parsed$variant_id <- variant_ids + } + list( + LD_matrix = ld, + LD_variants = variant_ids, + ref_panel = parsed, + block_metadata = if (!isTRUE(is_genotype) && !is.null(parsed)) .infer_single_ld_block_metadata(parsed) else NULL, + is_genotype = isTRUE(is_genotype) + ) +} + +.cb_parse_variants_for_qc <- function(ids, variant_convention = c("A2_A1", "A1_A2")) { + variant_convention <- match.arg(variant_convention) + parsed <- parse_variant_id(ids) + if (any(is.na(parsed$chrom)) || any(is.na(parsed$pos)) || + any(is.na(parsed$A1)) || any(is.na(parsed$A2))) { + stop("QC requires variant IDs parseable as chr:pos:A2:A1 by default. ", + "If the input uses chr:pos:A1:A2, set variant_convention = 'A1_A2'.") + } + if (identical(variant_convention, "A1_A2")) { + parsed <- data.frame(chrom = parsed$chrom, pos = parsed$pos, + A2 = parsed$A1, A1 = parsed$A2, + stringsAsFactors = FALSE) } - return(list( - individual_data = individual_data, - sumstat_data = sumstat_data - )) + parsed } diff --git a/R/file_utils.R b/R/file_utils.R index e8a9df85..6583695d 100644 --- a/R/file_utils.R +++ b/R/file_utils.R @@ -858,8 +858,19 @@ load_genotype_region <- function(genotype, region = NULL, keep_indel = TRUE, #' @importFrom magrittr %>% #' @noRd read_single_covariate <- function(path) { - df <- read_delim(path, "\t", col_types = cols()) %>% select(-1) - non_numeric <- names(df)[!sapply(df, is.numeric)] + raw_df <- read_delim(path, "\t", col_types = cols(.default = "c")) %>% select(-1) + df <- raw_df + non_numeric <- character() + for (nm in names(df)) { + values <- trimws(as.character(df[[nm]])) + converted <- suppressWarnings(as.numeric(values)) + bad <- !is.na(values) & values != "" & is.na(converted) + if (any(bad)) { + non_numeric <- c(non_numeric, nm) + } else { + df[[nm]] <- converted + } + } if (length(non_numeric) > 0) { stop("Non-numeric columns found in covariate file ", path, ": ", paste(non_numeric, collapse = ", "), @@ -1766,16 +1777,34 @@ load_multitask_regional_data <- function(region, # a string of chr:start-end for # - if exist individual level data individual_data <- NULL if (!is.null(genotype_list)) { + if (length(phenotype_list) != length(covariate_list)) { + stop("Data load error. 'phenotype_list' and 'covariate_list' must have the same length.") + } + if (is.null(conditions_list_individual)) { + conditions_list_individual <- paste0("condition", seq_along(phenotype_list)) + warning("Data load warning. 'conditions_list_individual' is not supplied; using default condition names. ", + "Provide 'conditions_list_individual' to preserve cohort or cell-type labels.") + } + if (length(conditions_list_individual) != length(phenotype_list)) { + stop("Data load error. 'conditions_list_individual' must have the same length as 'phenotype_list'.") + } #### FIXME: later if we have mulitple genotype list if (length(genotype_list) != 1 & is.null(match_geno_pheno)) { stop("Data load error. Please make sure 'match_geno_pheno' exists if you load data from multiple individual-level data.") } else if (length(genotype_list) == 1 & is.null(match_geno_pheno)) { match_geno_pheno <- rep(1, length(phenotype_list)) } + if (length(match_geno_pheno) != length(phenotype_list)) { + stop("Data load error. 'match_geno_pheno' must have the same length as 'phenotype_list'.") + } + if (any(is.na(match_geno_pheno)) || + any(match_geno_pheno < 1 | match_geno_pheno > length(genotype_list))) { + stop("Data load error. 'match_geno_pheno' must contain valid indices into 'genotype_list'.") + } # - load individual data from multiple datasets - n_dataset <- unique(match_geno_pheno) ### FIXME - for (i_data in 1:n_dataset) { + n_dataset <- unique(match_geno_pheno) + for (i_data in n_dataset) { # extract genotype file name genotype <- genotype_list[i_data] # extract phenotype and covariate file names @@ -1783,6 +1812,10 @@ load_multitask_regional_data <- function(region, # a string of chr:start-end for phenotype <- phenotype_list[pos] covariate <- covariate_list[pos] conditions <- conditions_list_individual[pos] + extract_region_name_i <- extract_region_name + if (is.list(extract_region_name) && length(extract_region_name) == length(phenotype_list)) { + extract_region_name_i <- extract_region_name[pos] + } dat <- load_regional_univariate_data( genotype = genotype, phenotype = phenotype, covariate = covariate, @@ -1792,7 +1825,7 @@ load_multitask_regional_data <- function(region, # a string of chr:start-end for maf_cutoff = maf_cutoff, mac_cutoff = mac_cutoff, imiss_cutoff = imiss_cutoff, keep_indel = keep_indel, keep_samples = keep_samples, keep_variants = keep_variants, - extract_region_name = extract_region_name, + extract_region_name = extract_region_name_i, phenotype_header = phenotype_header, region_name_col = region_name_col, scale_residuals = scale_residuals @@ -1800,9 +1833,9 @@ load_multitask_regional_data <- function(region, # a string of chr:start-end for if (is.null(individual_data)) { individual_data <- dat } else { - individual_data <- lapply(names(dat), function(k) { + individual_data <- stats::setNames(lapply(names(dat), function(k) { c(individual_data[[k]], dat[[k]]) - }) + }), names(dat)) individual_data$chrom <- dat$chrom individual_data$grange <- dat$grange } @@ -1866,6 +1899,219 @@ load_multitask_regional_data <- function(region, # a string of chr:start-end for )) } +#' Convert loaded regional data to individual-level inputs +#' +#' @param region_data A list returned by \code{load_multitask_regional_data()}. +#' @return A list containing \code{X}, \code{Y}, \code{maf}, +#' \code{X_variance}, and source information. +#' @export +region_data_to_ind_input <- function(region_data) { + first_non_null <- function(...) { + values <- list(...) + for (value in values) { + if (!is.null(value)) return(value) + } + NULL + } + + align_individual_contexts <- function(X, Y) { + cbind_y <- function(Y, fallback_names) { + keep <- !vapply(Y, is.null, logical(1)) + if (!any(keep)) return(NULL) + Y <- Y[keep] + fallback_names <- fallback_names[keep] + mats <- Map(function(y, nm) { + if (is.null(dim(y))) y <- matrix(y, ncol = 1) + if (is.null(colnames(y))) colnames(y) <- nm + y + }, Y, fallback_names) + do.call(cbind, mats) + } + + if (!is.list(X) || is.matrix(X) || is.data.frame(X) || + !is.list(Y) || is.matrix(Y) || is.data.frame(Y) || + is.null(names(X)) || is.null(names(Y)) || + length(intersect(names(X), names(Y))) > 0) { + return(list(X = X, Y = Y)) + } + x_names <- names(X) + y_names <- names(Y) + grouped <- list() + for (context in x_names) { + matched <- y_names[y_names == context | startsWith(y_names, paste0(context, "_"))] + if (length(matched) > 0) { + y_group <- cbind_y(Y[matched], matched) + if (!is.null(y_group)) grouped[[context]] <- y_group + } + } + if (length(grouped) == 0 && length(X) == 1 && length(Y) > 0) { + y_group <- cbind_y(Y, y_names) + if (!is.null(y_group)) grouped[[x_names[[1]]]] <- y_group + } + if (length(grouped) == 0) { + return(list(X = X, Y = Y)) + } + list(X = X[names(grouped)], Y = grouped) + } + + individual_data <- region_data$individual_data + if (is.null(individual_data)) { + return(list(X = NULL, Y = NULL, maf = NULL, X_variance = NULL, + source_info = list(has_individual = FALSE, contexts = character()))) + } + + X <- first_non_null(individual_data$residual_X, individual_data$X) + Y <- first_non_null(individual_data$residual_Y, individual_data$Y) + if (is.list(X) && !is.matrix(X) && !is.data.frame(X) && + is.null(names(X)) && !is.null(names(Y)) && length(X) == length(Y)) { + names(X) <- names(Y) + } + if (is.list(Y) && !is.matrix(Y) && !is.data.frame(Y) && + is.null(names(Y)) && !is.null(names(X)) && length(Y) == length(X)) { + names(Y) <- names(X) + } + if (is.matrix(X) && is.list(Y) && !is.null(names(Y))) { + X <- stats::setNames(rep(list(X), length(Y)), names(Y)) + } + aligned <- align_individual_contexts(X, Y) + X <- aligned$X + Y <- aligned$Y + maf <- individual_data$maf + X_variance <- individual_data$X_variance + if (is.list(maf) && is.null(names(maf)) && !is.null(names(X)) && length(maf) == length(X)) { + names(maf) <- names(X) + } + if (is.list(X_variance) && is.null(names(X_variance)) && !is.null(names(X)) && + length(X_variance) == length(X)) { + names(X_variance) <- names(X) + } + contexts <- unique(c(names(X), names(Y))) + list( + X = X, + Y = Y, + maf = maf, + X_variance = X_variance, + source_info = list(has_individual = !is.null(X) && !is.null(Y), + contexts = contexts) + ) +} + +#' Convert loaded regional data to RSS inputs +#' +#' @param region_data A list returned by \code{load_multitask_regional_data()}. +#' @return A list containing named RSS inputs, matched LD data, and source +#' information. +#' @export +region_data_to_rss_input <- function(region_data) { + make_ld_data_from_matrix <- function(ld, variant_ids = NULL) { + is_genotype <- is.matrix(ld) && nrow(ld) != ncol(ld) + if (!is.null(variant_ids) && is.matrix(ld)) { + if (is.null(colnames(ld)) && length(variant_ids) == ncol(ld)) { + colnames(ld) <- variant_ids + } + if (!is_genotype && is.null(rownames(ld)) && length(variant_ids) == nrow(ld)) { + rownames(ld) <- variant_ids + } + } + ids <- if (is.matrix(ld) && !is_genotype) rownames(ld) else colnames(ld) + parsed <- NULL + if (!is.null(ids) && length(ids) > 0) { + parsed <- tryCatch(parse_variant_id(ids), error = function(e) NULL) + if (!is.null(parsed)) { + ids <- format_variant_id(parsed$chrom, parsed$pos, parsed$A2, parsed$A1) + if (!is_genotype && is.matrix(ld)) rownames(ld) <- colnames(ld) <- ids + if (is_genotype && is.matrix(ld)) colnames(ld) <- ids + parsed$variant_id <- ids + } + } + list( + LD_matrix = ld, + LD_variants = ids, + ref_panel = parsed, + block_metadata = if (!is_genotype && !is.null(parsed)) .infer_single_ld_block_metadata(parsed) else NULL, + is_genotype = isTRUE(is_genotype) + ) + } + + rss_input_from_qced_sumstat <- function(sumstat_data) { + variant_ids_from_rss <- function(rss) { + ss <- rss$sumstats + if (is.null(ss)) return(character()) + if ("variant_id" %in% colnames(ss)) return(normalize_variant_id(as.character(ss$variant_id))) + if (all(c("chrom", "pos", "A2", "A1") %in% colnames(ss))) { + return(format_variant_id(ss$chrom, ss$pos, ss$A2, ss$A1)) + } + character() + } + + rss_input <- sumstat_data$sumstats + LD_mat <- sumstat_data$LD_mat + LD_match <- sumstat_data$LD_match + studies <- names(rss_input) + LD_data <- list() + ld_group <- character() + for (i in seq_along(studies)) { + study <- studies[[i]] + ld_name <- if (!is.null(LD_match) && length(LD_match) >= i) LD_match[[i]] else study + if (is.null(ld_name) || is.na(ld_name) || !ld_name %in% names(LD_mat)) { + ld_name <- names(LD_mat)[min(i, length(LD_mat))] + } + ld <- LD_mat[[ld_name]] + rss <- rss_input[[study]] + variant_ids <- variant_ids_from_rss(rss) + LD_data[[study]] <- make_ld_data_from_matrix(ld, variant_ids) + ld_group[[study]] <- ld_name + } + list( + rss_input = rss_input, + LD_data = LD_data, + source_info = list(has_sumstat = length(rss_input) > 0, + studies = names(rss_input), + ld_group = ld_group) + ) + } + + sumstat_data <- region_data$sumstat_data + if (is.null(sumstat_data) || is.null(sumstat_data$sumstats)) { + return(list(rss_input = list(), LD_data = list(), + source_info = list(has_sumstat = FALSE, studies = character(), + ld_group = character()))) + } + if (!is.null(sumstat_data$LD_mat)) { + return(rss_input_from_qced_sumstat(sumstat_data)) + } + + rss_input <- list() + LD_data <- list() + ld_group <- character() + + for (i in seq_along(sumstat_data$sumstats)) { + studies <- sumstat_data$sumstats[[i]] + ld_index <- min(i, length(sumstat_data$LD_info)) + group_name <- names(sumstat_data$LD_info)[ld_index] + if (is.null(group_name) || is.na(group_name) || group_name == "") { + group_name <- paste0("LD", ld_index) + } + for (study in names(studies)) { + output_name <- study + if (output_name %in% names(rss_input)) { + output_name <- make.unique(c(names(rss_input), output_name))[length(rss_input) + 1] + } + rss_input[[output_name]] <- studies[[study]] + LD_data[[output_name]] <- sumstat_data$LD_info[[ld_index]] + ld_group[[output_name]] <- group_name + } + } + + list( + rss_input = rss_input, + LD_data = LD_data, + source_info = list(has_sumstat = length(rss_input) > 0, + studies = names(rss_input), + ld_group = ld_group) + ) +} + #' Load and filter tabular data with optional region subsetting #' #' This function loads summary statistics data from tabular files (TSV, TXT). @@ -1909,6 +2155,17 @@ load_tsv_region <- function(file_path, region = NULL, extract_region_name = NULL if (length(hdr) > 0) { last_hdr <- hdr[length(hdr)] col_names_vec <- strsplit(sub("^#", "", last_hdr), "\t")[[1]] + } else { + header_con <- gzfile(file_path, "rt") + first_line <- readLines(header_con, n = 1) + close(header_con) + first_fields <- strsplit(sub("^#", "", first_line), "\t")[[1]] + header_tokens <- c("chrom", "chr", "#chrom", "pos", "bp", "snp", + "variant_id", "a1", "a2", "beta", "se", "z", + "p", "pvalue") + if (any(tolower(first_fields) %in% header_tokens)) { + col_names_vec <- first_fields + } } txt <- paste(lines, collapse = "\n") diff --git a/R/sumstats_qc.R b/R/sumstats_qc.R index 22e73066..aca54aea 100644 --- a/R/sumstats_qc.R +++ b/R/sumstats_qc.R @@ -8,6 +8,10 @@ #' @param LD_data A list containing combined LD variants data that is generated by load_LD_matrix. #' @param skip_region A character vector specifying regions to be skipped in the analysis (optional). #' Each region should be in the format "chrom:start-end" (e.g., "1:1000000-2000000"). +#' @param return_LD_mat Logical; if \code{FALSE}, return only harmonized +#' summary statistics and skip LD-matrix subsetting. This is useful when the +#' reference input is genotype-backed \code{X_ref}. Defaults to \code{TRUE} +#' for backwards compatibility. #' #' @return A list containing the processed summary statistics and LD matrix. #' - sumstats: A data frame containing the processed summary statistics. @@ -18,7 +22,8 @@ #' @importFrom tidyr separate #' @importFrom magrittr %>% #' @export -rss_basic_qc <- function(sumstats, LD_data, skip_region = NULL, keep_indel = TRUE) { +rss_basic_qc <- function(sumstats, LD_data, skip_region = NULL, keep_indel = TRUE, + return_LD_mat = TRUE) { # Check if required columns are present in sumstats required_cols <- c("chrom", "pos", "A1", "A2") missing_cols <- setdiff(required_cols, colnames(sumstats)) @@ -63,6 +68,10 @@ rss_basic_qc <- function(sumstats, LD_data, skip_region = NULL, keep_indel = TRU sumstats_processed <- allele_flip$target_data_qced %>% arrange(pos) + if (!isTRUE(return_LD_mat)) { + return(list(sumstats = sumstats_processed, LD_mat = NULL)) + } + # Align and subset LD by mapping core IDs (strip trailing build suffix) to exact LD IDs ld_mat <- LD_data$LD_matrix ld_ids <- tryCatch(rownames(ld_mat), error = function(e) NULL) @@ -138,41 +147,425 @@ ld_mismatch_qc <- function(zScore, R = NULL, X = NULL, nSample = NULL, } } +.resolve_summary_qc_method <- function(qc_method) { + if (is.null(qc_method)) return("none") + qc_method <- match.arg(qc_method, c("none", "rss_qc", "slalom", "dentist")) + if (identical(qc_method, "rss_qc")) "none" else qc_method +} + #' Perform Quality Control on Summary Statistics #' -#' This function performs quality control on the processed summary statistics using the specified method. -#' It wraps \code{\link{ld_mismatch_qc}} and handles subsetting of the summary statistics and LD matrix. +#' This function performs quality control on the processed summary statistics +#' using the specified method. It wraps \code{\link{ld_mismatch_qc}} and handles +#' subsetting of the summary statistics and LD matrix. #' #' @param sumstats A data frame containing the processed summary statistics. -#' @param LD_data A list containing the combined LD variants data generated by load_LD_matrix. +#' @param LD_data A list containing the combined LD variants data generated by +#' \code{load_LD_matrix()}. #' @param n Sample size for the LD reference panel (used by dentist method). -#' @param method The quality control method to use. Options are "slalom" or "dentist" (default: "slalom"). +#' @param method The quality control method to use. Options are \code{"slalom"} +#' or \code{"dentist"} (default: \code{"slalom"}). +#' @param rss_input Optional loaded RSS input, either one \code{load_rss_data()} +#' result or a named list of them. Supplying this selects the additional +#' combined RSS/ColocBoost QC workflow. A single RSS record is detected by +#' structure, not by the name \code{sumstats}, so a multi-study list may safely +#' include a study named \code{"sumstats"}. +#' @param keep_indel,skip_region,pip_cutoff_to_skip,qc_method,impute,impute_opts,study,var_y,return_on_skip,R_finite,R_mismatch +#' Additional controls for the combined RSS/ColocBoost QC workflow. They are +#' ignored by the historical LD-mismatch-only call unless \code{rss_input} or +#' combined-QC options are supplied. #' -#' @return A list containing the quality-controlled summary statistics and updated LD matrix. -#' - sumstats: The quality-controlled summary statistics data frame. -#' - LD_mat: The updated LD matrix after quality control. -#' - outlier_number: The number of outlier variants removed. +#' @return A list containing the quality-controlled summary statistics and +#' updated LD matrix for the historical call: +#' \itemize{ +#' \item sumstats: The quality-controlled summary statistics data frame. +#' \item LD_mat: The updated LD matrix after quality control. +#' \item outlier_number: The number of outlier variants removed. +#' } +#' When \code{rss_input} or combined-QC controls are supplied, returns a +#' cleaned RSS/LD record for one RSS record, or a named list of records for a +#' list of RSS records. #' -#' @details This function applies the specified quality control method to the processed summary statistics -#' via \code{\link{ld_mismatch_qc}}, then subsets the summary statistics and LD matrix to keep -#' only non-outlier variants. +#' @details This function applies the specified quality control method to the +#' processed summary statistics via \code{\link{ld_mismatch_qc}}, then subsets +#' the summary statistics and LD matrix to keep only non-outlier variants. +#' +#' As an additional workflow for ColocBoost/RSS pipelines, callers may supply +#' \code{rss_input} or combined-QC controls. That path first runs +#' \code{rss_basic_qc()}, optional PIP screening, optional LD-mismatch QC +#' through this same function, and optional RAISS imputation. The combined +#' path normalizes one or many RSS records to the same loop internally; only +#' true single-record input is unwrapped on return. +#' \code{qc_method = NULL}, \code{"none"}, and \code{"rss_qc"} all mean +#' basic-only summary-stat preprocessing without SLALOM/DENTIST outlier QC. +#' +#' When the combined path receives genotype-backed reference data +#' (\code{X_ref}), basic harmonization avoids constructing an LD matrix. PIP +#' screening uses an LD-independent single-effect screen with +#' \code{susie_rss(R = diag(p), L = 1, max_iter = 1)}, and LD-mismatch QC +#' computes only the filtered local correlation matrix required by +#' SLALOM/DENTIST. RAISS imputation temporarily centers/scales +#' genotype-backed \code{X_ref} before using the whole-region genotype/SVD +#' path. This avoids LD-block partition artifacts while matching the LD scale +#' used by \code{compute_LD(X_ref)}. Final ColocBoost calls still keep the +#' original \code{X_ref} as the reference input. #' #' @examples #' # Perform SLALOM quality control (default) #' qc_results <- summary_stats_qc(sumstats, LD_data, method = "slalom") #' +#' # Additional combined basic-only RSS QC. +#' qc_results <- summary_stats_qc(rss_input = rss_input, LD_data = LD_data, +#' qc_method = "none") +#' #' @importFrom dplyr mutate row_number filter pull #' @export -summary_stats_qc <- function(sumstats, LD_data, n = NULL, method = c("slalom", "dentist")) { - method <- match.arg(method) - LD_extract <- LD_data$LD_matrix[sumstats$variant_id, sumstats$variant_id, drop = FALSE] +summary_stats_qc <- function(sumstats, LD_data, n = NULL, + method = c("slalom", "dentist"), + rss_input = NULL, + keep_indel = TRUE, + skip_region = NULL, + pip_cutoff_to_skip = 0, + qc_method = NULL, + impute = FALSE, + impute_opts = list(rcond = 0.01, R2_threshold = 0.6, + minimum_ld = 5, lamb = 0.01), + study = NULL, + var_y = NULL, + return_on_skip = c("null", "preprocess"), + R_finite = NULL, + R_mismatch = NULL) { + if (missing(sumstats)) sumstats <- NULL + return_on_skip <- match.arg(return_on_skip) + if (is.null(rss_input) && is.data.frame(sumstats) && is.null(qc_method) && + isFALSE(impute) && identical(pip_cutoff_to_skip, 0) && is.null(skip_region)) { + method <- match.arg(method) + LD_extract <- LD_data$LD_matrix[sumstats$variant_id, sumstats$variant_id, drop = FALSE] + qc_results <- ld_mismatch_qc(zScore = sumstats$z, R = LD_extract, + nSample = n, method = method) + keep_index <- which(!qc_results$outlier) + sumstats_qc <- sumstats[keep_index, , drop = FALSE] + LD_mat_qc <- LD_extract[sumstats_qc$variant_id, sumstats_qc$variant_id, drop = FALSE] + outlier_number <- nrow(sumstats) - nrow(sumstats_qc) + return(list(sumstats = sumstats_qc, LD_mat = LD_mat_qc, outlier_number = outlier_number)) + } + + qc_method <- .resolve_summary_qc_method(qc_method) + + if (is.null(rss_input)) { + if (is.null(sumstats)) stop("summary_stats_qc requires sumstats or rss_input.") + rss_input <- if (is.data.frame(sumstats)) { + list(sumstats = sumstats, n = n, var_y = if (is.null(var_y)) 1 else var_y) + } else { + sumstats + } + } + + is_rss_record <- function(x) { + is.list(x) && is.data.frame(x$sumstats) + } + is_ld_record <- function(x) { + is.matrix(x) || (is.list(x) && !is.null(x$LD_matrix)) + } + first_ld_record <- function(x, study_name = NULL) { + if (is_ld_record(x)) return(x) + if (!is.null(study_name) && study_name %in% names(x)) return(x[[study_name]]) + if (is.list(x) && length(x) == 1) return(x[[1]]) + x + } + + single_rss_input <- is_rss_record(rss_input) + if (single_rss_input) { + studies <- if (is.null(study)) "input" else study + rss_records <- stats::setNames(list(rss_input), studies) + LD_records <- stats::setNames(list(first_ld_record(LD_data, studies)), studies) + } else { + if (!is.list(rss_input) || length(rss_input) == 0) { + stop("No summary-statistic studies supplied for QC.") + } + rss_records <- rss_input + studies <- names(rss_records) + if (is.null(studies)) studies <- rep("", length(rss_records)) + empty_names <- is.na(studies) | studies == "" + if (any(empty_names)) { + studies[empty_names] <- paste0("study", which(empty_names)) + names(rss_records) <- studies + } + if (!all(vapply(rss_records, is_rss_record, logical(1)))) { + stop("rss_input must be one RSS record or a list of RSS records with a sumstats data frame.") + } + + if (is_ld_record(LD_data)) { + LD_records <- stats::setNames(rep(list(LD_data), length(studies)), studies) + } else if (is.list(LD_data) && length(LD_data) == 1) { + LD_records <- stats::setNames(rep(list(LD_data[[1]]), length(studies)), studies) + } else if (is.list(LD_data) && all(studies %in% names(LD_data))) { + LD_records <- LD_data[studies] + } else if (is.list(LD_data) && length(LD_data) == length(studies)) { + LD_records <- LD_data + names(LD_records) <- studies + } else { + stop("LD_data must be one LD object or a list matching rss_input studies.") + } + } + + if (length(pip_cutoff_to_skip) == 1 && is.null(names(pip_cutoff_to_skip))) { + cutoffs <- stats::setNames(rep(pip_cutoff_to_skip, length(studies)), studies) + } else if (!is.null(names(pip_cutoff_to_skip))) { + missing <- setdiff(studies, names(pip_cutoff_to_skip)) + if (length(missing) > 0) { + pip_cutoff_to_skip <- c(pip_cutoff_to_skip, stats::setNames(rep(0, length(missing)), missing)) + } + cutoffs <- pip_cutoff_to_skip[studies] + } else if (length(pip_cutoff_to_skip) == length(studies)) { + cutoffs <- stats::setNames(pip_cutoff_to_skip, studies) + } else { + stop("pip_cutoff_to_skip must be a scalar, named vector, or match the number of studies.") + } + + out <- list() + for (study_name in studies) { + result <- .summary_stats_qc_single_study( + rss_records[[study_name]], LD_records[[study_name]], + keep_indel = keep_indel, + skip_region = skip_region, + pip_cutoff_to_skip = cutoffs[[study_name]], + qc_method = qc_method, + impute = impute, + impute_opts = impute_opts, + study = study_name, + return_on_skip = return_on_skip, + R_finite = R_finite, + R_mismatch = R_mismatch + ) + if (!is.null(result)) out[[study_name]] <- result + } + if (single_rss_input) out[[studies]] else out +} + +.infer_single_ld_block_metadata <- function(ref_panel) { + n_variants <- nrow(ref_panel) + data.frame( + block_id = 1L, + chrom = as.character(ref_panel$chrom[1]), + block_start = suppressWarnings(min(ref_panel$pos, na.rm = TRUE)), + block_end = suppressWarnings(max(ref_panel$pos, na.rm = TRUE)), + size = n_variants, + start_idx = 1L, + end_idx = n_variants, + stringsAsFactors = FALSE + ) +} + +.normalize_ld_data_for_qc <- function(LD_data) { + if (is.null(LD_data)) return(NULL) + if (is.matrix(LD_data)) { + ids <- if (nrow(LD_data) == ncol(LD_data)) rownames(LD_data) else colnames(LD_data) + LD_data <- list(LD_matrix = LD_data, LD_variants = ids) + } + if (is.data.frame(LD_data$LD_variants)) { + variant_ids <- if ("variant_id" %in% colnames(LD_data$LD_variants)) { + as.character(LD_data$LD_variants$variant_id) + } else { + format_variant_id(LD_data$LD_variants$chrom, LD_data$LD_variants$pos, + LD_data$LD_variants$A2, LD_data$LD_variants$A1) + } + LD_data$LD_variants <- variant_ids + } + if (is.data.frame(LD_data$ref_panel) && !"variant_id" %in% colnames(LD_data$ref_panel)) { + LD_data$ref_panel$variant_id <- format_variant_id( + LD_data$ref_panel$chrom, LD_data$ref_panel$pos, + LD_data$ref_panel$A2, LD_data$ref_panel$A1 + ) + } + if (is.null(LD_data$LD_variants) && is.data.frame(LD_data$ref_panel) && + "variant_id" %in% colnames(LD_data$ref_panel)) { + LD_data$LD_variants <- as.character(LD_data$ref_panel$variant_id) + } + if (is.null(LD_data$ref_panel) && !is.null(LD_data$LD_variants)) { + parsed <- tryCatch(parse_variant_id(LD_data$LD_variants), error = function(e) NULL) + if (!is.null(parsed)) { + LD_data$ref_panel <- parsed + LD_data$ref_panel$variant_id <- LD_data$LD_variants + } + } + if (!is.null(LD_data$LD_variants) && is.matrix(LD_data$LD_matrix)) { + if (nrow(LD_data$LD_matrix) == ncol(LD_data$LD_matrix) && + length(LD_data$LD_variants) == nrow(LD_data$LD_matrix)) { + rownames(LD_data$LD_matrix) <- colnames(LD_data$LD_matrix) <- LD_data$LD_variants + } else if (length(LD_data$LD_variants) == ncol(LD_data$LD_matrix)) { + colnames(LD_data$LD_matrix) <- LD_data$LD_variants + } + } + LD_data +} - qc_results <- ld_mismatch_qc(zScore = sumstats$z, R = LD_extract, nSample = n, - method = method) - keep_index <- which(!qc_results$outlier) - sumstats_qc <- sumstats[keep_index, , drop = FALSE] - LD_mat_qc <- LD_extract[sumstats_qc$variant_id, sumstats_qc$variant_id, drop = FALSE] - outlier_number <- nrow(sumstats) - nrow(sumstats_qc) +.summary_stats_qc_single_study <- function(rss_input, LD_data, keep_indel, skip_region, + pip_cutoff_to_skip, qc_method, impute, + impute_opts, study, return_on_skip, + R_finite = NULL, R_mismatch = NULL) { + skipped_result <- function(sumstats, LD_mat, reason) { + if (!identical(return_on_skip, "preprocess")) return(NULL) + list( + rss_input = list(sumstats = sumstats, n = rss_input$n, var_y = rss_input$var_y), + LD_matrix = LD_mat, + preprocess = list(sumstats = sumstats, LD_mat = LD_mat), + outlier_number = 0L, + skipped = TRUE, + skip_reason = reason + ) + } - return(list(sumstats = sumstats_qc, LD_mat = LD_mat_qc, outlier_number = outlier_number)) + if (is.null(rss_input) || is.null(LD_data)) return(NULL) + message("QC track: starting basic allele harmonization for summary-stat study ", study, ".") + message("QC track: basic summary-stat QC requires sumstat$variant and LD_data$LD_variants ", + "or LD/X_ref variant names for study ", study, ".") + LD_data_for_qc <- .normalize_ld_data_for_qc(LD_data) + has_genotype <- isTRUE(LD_data_for_qc$is_genotype) || + (is.matrix(LD_data_for_qc$LD_matrix) && + nrow(LD_data_for_qc$LD_matrix) != ncol(LD_data_for_qc$LD_matrix)) + if (has_genotype && is.data.frame(LD_data_for_qc$ref_panel) && + all(c("chrom", "pos", "A2", "A1") %in% colnames(LD_data_for_qc$ref_panel))) { + canonical_ids <- format_variant_id(LD_data_for_qc$ref_panel$chrom, + LD_data_for_qc$ref_panel$pos, + LD_data_for_qc$ref_panel$A2, + LD_data_for_qc$ref_panel$A1) + LD_data_for_qc$ref_panel$variant_id <- canonical_ids + LD_data_for_qc$LD_variants <- canonical_ids + if (is.matrix(LD_data_for_qc$LD_matrix) && + length(canonical_ids) == ncol(LD_data_for_qc$LD_matrix)) { + colnames(LD_data_for_qc$LD_matrix) <- canonical_ids + } + } + X_ref <- if (has_genotype) LD_data_for_qc$LD_matrix else NULL + basic <- rss_basic_qc(rss_input$sumstats, LD_data_for_qc, + skip_region = skip_region, keep_indel = keep_indel, + return_LD_mat = !has_genotype) + sumstats <- basic$sumstats + R_mat <- basic$LD_mat + n <- rss_input$n + var_y <- rss_input$var_y + reference_for_variants <- function(variants) { + if (has_genotype) { + ref_names <- colnames(X_ref) + ref_idx <- match(variants, ref_names) + if (anyNA(ref_idx)) { + ref_idx <- match(strip_chr_prefix(strip_build_suffix(variants)), + strip_chr_prefix(strip_build_suffix(ref_names))) + } + if (anyNA(ref_idx)) { + missing_variants <- variants[is.na(ref_idx)] + stop("Genotype reference is missing ", length(missing_variants), + " post-QC variant(s): ", paste(utils::head(missing_variants, 3), collapse = ", ")) + } + X_subset <- X_ref[, ref_idx, drop = FALSE] + colnames(X_subset) <- variants + return(X_subset) + } + if (is.null(R_mat)) return(NULL) + R_mat[variants, variants, drop = FALSE] + } + ld_data_with_local_R <- function(current_sumstats) { + if (!has_genotype) return(LD_data_for_qc) + variants <- current_sumstats$variant_id + X_local <- reference_for_variants(variants) + R_local <- compute_LD(X_local, method = "sample") + rownames(R_local) <- colnames(R_local) <- variants + ref_panel <- LD_data_for_qc$ref_panel + if (is.data.frame(ref_panel) && "variant_id" %in% colnames(ref_panel)) { + ref_idx <- match(variants, ref_panel$variant_id) + if (anyNA(ref_idx)) { + ref_panel <- parse_variant_id(variants) + ref_panel$variant_id <- variants + } else { + ref_panel <- ref_panel[ref_idx, , drop = FALSE] + } + } else { + ref_panel <- parse_variant_id(variants) + ref_panel$variant_id <- variants + } + ld_data <- LD_data_for_qc + ld_data$LD_matrix <- R_local + ld_data$LD_variants <- variants + ld_data$ref_panel <- ref_panel + ld_data$block_metadata <- .infer_single_ld_block_metadata(ref_panel) + ld_data$is_genotype <- FALSE + R_mat <<- R_local + ld_data + } + message("QC track: basic harmonization retained ", nrow(sumstats), + " variants for summary-stat study ", study, ".") + if (nrow(sumstats) < 2) { + return(skipped_result(sumstats, reference_for_variants(sumstats$variant_id), + "fewer than two variants remain after basic QC")) + } + + preprocess <- list(sumstats = sumstats, LD_mat = R_mat) + if (!is.null(pip_cutoff_to_skip) && pip_cutoff_to_skip != 0) { + cutoff <- pip_cutoff_to_skip + if (cutoff < 0) cutoff <- 3 / nrow(sumstats) + message("QC track: running LD-independent single-effect initial screen for summary-stat study ", + study, ".") + pip_args <- list(z = sumstats$z, R = diag(nrow(sumstats)), + L = 1, max_iter = 1, n = n) + pip <- do.call(susie_rss, pip_args)$pip + if (!any(pip > cutoff)) { + message("Skipping follow-up analysis: No signals above PIP threshold ", cutoff) + message("QC track: skipping summary-stat study ", study, + ". No signals above PIP threshold ", cutoff, ".") + return(skipped_result(sumstats, reference_for_variants(sumstats$variant_id), + "no signals above PIP threshold")) + } + message("Follow-up on region: signals above PIP threshold ", cutoff, " detected.") + } + + outlier_number <- 0L + if (!is.null(qc_method) && !qc_method %in% c("none", "rss_qc")) { + message("QC track: running ", qc_method, " LD-mismatch QC for summary-stat study ", study, ".") + qc <- summary_stats_qc(sumstats = sumstats, LD_data = ld_data_with_local_R(sumstats), + n = n, method = qc_method) + sumstats <- qc$sumstats + R_mat <- qc$LD_mat + outlier_number <- qc$outlier_number + message("QC track: removed ", outlier_number, + " LD-mismatch outlier(s) for summary-stat study ", study, ".") + } + if (isTRUE(impute)) { + if (is.null(LD_data_for_qc$ref_panel)) { + warning("Skipping imputation for summary-stat study ", study, + ": LD_data does not include ref_panel.") + } else { + message("QC track: running imputation for summary-stat study ", study, ".") + imputed <- if (has_genotype) { + X_ref_scaled <- scale(X_ref) + X_ref_scaled[is.na(X_ref_scaled)] <- 0 + colnames(X_ref_scaled) <- colnames(X_ref) + raiss(LD_data_for_qc$ref_panel, sumstats, + genotype_matrix = X_ref_scaled, + svd_tol = if (is.null(impute_opts$svd_tol)) 1e-12 else impute_opts$svd_tol, + R2_threshold = impute_opts$R2_threshold, + minimum_ld = impute_opts$minimum_ld, + lamb = impute_opts$lamb) + } else { + if (is.null(LD_data_for_qc$block_metadata)) { + LD_data_for_qc$block_metadata <- .infer_single_ld_block_metadata(LD_data_for_qc$ref_panel) + } + raiss(LD_data_for_qc$ref_panel, sumstats, partition_LD_matrix(LD_data_for_qc), + rcond = impute_opts$rcond, + R2_threshold = impute_opts$R2_threshold, + minimum_ld = impute_opts$minimum_ld, + lamb = impute_opts$lamb) + } + sumstats <- imputed$result_filter + if (!is.null(imputed$LD_mat)) R_mat <- imputed$LD_mat + } + } + final_vars <- sumstats$variant_id + list( + rss_input = list(sumstats = sumstats, n = n, var_y = var_y), + LD_matrix = if (has_genotype) reference_for_variants(final_vars) else R_mat, + preprocess = preprocess, + outlier_number = outlier_number, + skipped = FALSE + ) } diff --git a/R/univariate_pipeline.R b/R/univariate_pipeline.R index 87391d3b..f92b70ff 100644 --- a/R/univariate_pipeline.R +++ b/R/univariate_pipeline.R @@ -193,7 +193,8 @@ load_study_LD <- function(ld_path, region) { #' @param LD_data A list from load_LD_matrix containing LD_matrix, LD_variants, #' ref_panel, block_metadata, and is_genotype flag. When is_genotype=TRUE #' (from return_genotype=TRUE), LD_matrix contains genotype X and susie_rss -#' uses the z+X interface. R is computed internally for QC/imputation. +#' uses the z+X interface. Local R is computed only for QC stages that +#' require a correlation matrix. #' @param n_sample Sample size. If 0, retrieved from the sumstat file. #' @param n_case Number of cases (for case-control studies). #' @param n_control Number of controls (for case-control studies). @@ -201,7 +202,9 @@ load_study_LD <- function(ld_path, region) { #' @param skip_region Character vector of regions to skip (format "chrom:start-end"). #' @param extract_region_name Gene/phenotype name to subset. #' @param region_name_col Column to filter for extract_region_name. -#' @param qc_method QC method: "slalom" or "dentist". +#' @param qc_method Summary-statistic QC method. \code{"slalom"} and +#' \code{"dentist"} run basic allele harmonization plus LD-mismatch QC; +#' \code{"rss_qc"} and \code{"none"} run basic allele harmonization only. #' @param finemapping_method One of "susie_rss", "single_effect", "bayesian_conditional_regression". #' @param finemapping_opts List of fine-mapping options (L, L_greedy, coverage, #' signal_cutoff, min_abs_corr). @@ -224,7 +227,7 @@ rss_analysis_pipeline <- function( sumstat_path, column_file_path, LD_data, n_sample = 0, n_case = 0, n_control = 0, region = NULL, skip_region = NULL, extract_region_name = NULL, region_name_col = NULL, - qc_method = c("slalom", "dentist"), + qc_method = c("slalom", "dentist", "rss_qc", "none"), finemapping_method = c("susie_rss", "single_effect", "bayesian_conditional_regression"), finemapping_opts = list( L = 20, L_greedy = 5, @@ -240,10 +243,15 @@ rss_analysis_pipeline <- function( use_X <- isTRUE(LD_data$is_genotype) || is_X_list if (use_X) { X_data <- LD_data$LD_matrix - # Compute R from first panel (or single panel) for QC/imputation - X_for_R <- if (is_X_list) X_data[[1]] else X_data - LD_data$LD_matrix <- compute_LD(X_for_R, method = "sample") - LD_data$is_genotype <- FALSE + LD_data$is_genotype <- TRUE + } + subset_X_data <- function(variants) { + if (!use_X) return(NULL) + if (is_X_list) { + lapply(X_data, function(Xk) Xk[, variants, drop = FALSE]) + } else { + X_data[, variants, drop = FALSE] + } } res <- list() rss_input <- load_rss_data( @@ -261,44 +269,52 @@ rss_analysis_pipeline <- function( return(list(rss_data_analyzed = sumstats)) } - # Preprocess: QC and imputation require LD_data with correlation matrix R. - # When using X path, compute R from X for QC/imputation, then pass X to susie_rss. - preprocess_results <- rss_basic_qc(sumstats, LD_data, skip_region = skip_region, keep_indel = keep_indel) - sumstats <- preprocess_results$sumstats - LD_mat <- preprocess_results$LD_mat + qc_method_arg <- if (is.null(qc_method)) NULL else match.arg(qc_method) + qc_method <- qc_method_arg + qc_record <- summary_stats_qc( + rss_input = rss_input, + LD_data = LD_data, + keep_indel = keep_indel, + skip_region = skip_region, + pip_cutoff_to_skip = pip_cutoff_to_skip, + qc_method = if (is.null(qc_method_arg)) "none" else qc_method_arg, + impute = impute, + impute_opts = impute_opts, + return_on_skip = "preprocess", + R_finite = R_finite, + R_mismatch = R_mismatch + ) + if (!is.null(qc_record$rss_input)) { + preprocess_results <- qc_record$preprocess + sumstats <- qc_record$rss_input$sumstats + LD_mat <- qc_record$LD_matrix + qc_results <- list(outlier_number = qc_record$outlier_number) + } else { + # Compatibility for tests or callers that mock the historical + # LD-mismatch-only summary_stats_qc() return shape. + preprocess_results <- list(sumstats = qc_record$sumstats, LD_mat = qc_record$LD_mat) + sumstats <- qc_record$sumstats + LD_mat <- qc_record$LD_mat + qc_results <- qc_record + if (isTRUE(impute)) { + LD_matrix <- partition_LD_matrix(LD_data) + impute_results <- raiss(LD_data$ref_panel, sumstats, LD_matrix, + rcond = impute_opts$rcond, + R2_threshold = impute_opts$R2_threshold, + minimum_ld = impute_opts$minimum_ld, + lamb = impute_opts$lamb) + sumstats <- impute_results$result_filter + LD_mat <- impute_results$LD_mat + } + qc_record$skipped <- FALSE + } if (nrow(sumstats) == 0) { message("No variants left after preprocessing. Returning empty results.") return(list(rss_data_analyzed = sumstats)) } - - # PIP screening (always uses R) - if (pip_cutoff_to_skip != 0) { - if (pip_cutoff_to_skip < 0) pip_cutoff_to_skip <- 3 / nrow(sumstats) - top_model_pip <- susie_rss(z = sumstats$z, R = LD_mat, L = 1, L_greedy = NULL, max_iter = 1, - n = n, var_y = var_y, R_finite = R_finite, R_mismatch = R_mismatch)$pip - if (!any(top_model_pip > pip_cutoff_to_skip)) { - message("Skipping follow-up analysis: No signals above PIP threshold ", pip_cutoff_to_skip) - return(list(rss_data_analyzed = sumstats)) - } - message("Follow-up on region: signals above PIP threshold ", pip_cutoff_to_skip, " detected.") - } - - # Quality control (always uses R) - if (!is.null(qc_method)) { - qc_results <- summary_stats_qc(sumstats, LD_data, n = n, method = qc_method) - sumstats <- qc_results$sumstats - LD_mat <- qc_results$LD_mat - } - - # Imputation (always uses R) - if (impute) { - LD_matrix <- partition_LD_matrix(LD_data) - impute_results <- raiss(LD_data$ref_panel, sumstats, LD_matrix, - rcond = impute_opts$rcond, R2_threshold = impute_opts$R2_threshold, - minimum_ld = impute_opts$minimum_ld, lamb = impute_opts$lamb) - sumstats <- impute_results$result_filter - LD_mat <- impute_results$LD_mat + if (isTRUE(qc_record$skipped)) { + return(list(rss_data_analyzed = sumstats)) } # Fine-mapping: use X_mat if available, otherwise R @@ -306,17 +322,7 @@ rss_analysis_pipeline <- function( pri_coverage <- finemapping_opts$coverage[1] sec_coverage <- if (length(finemapping_opts$coverage) > 1) finemapping_opts$coverage[-1] else NULL - # When using X path, subset X to QCed/imputed variants. - # For mixture panels (list), subset each panel; susie_rss accepts X=list(). - if (use_X) { - if (is_X_list) { - X_mat_sub <- lapply(X_data, function(Xk) Xk[, sumstats$variant_id, drop = FALSE]) - } else { - X_mat_sub <- X_data[, sumstats$variant_id, drop = FALSE] - } - } else { - X_mat_sub <- NULL - } + X_mat_sub <- subset_X_data(sumstats$variant_id) res <- susie_rss_pipeline(sumstats, LD_mat = if (use_X) NULL else LD_mat, @@ -347,7 +353,9 @@ rss_analysis_pipeline <- function( } .run_reanalysis <- function(sumstats, LD_mat, method, finemapping_opts, pri_coverage, sec_coverage) { - susie_rss_pipeline(sumstats, LD_mat, + susie_rss_pipeline(sumstats, + LD_mat = if (use_X) NULL else LD_mat, + X_mat = subset_X_data(sumstats$variant_id), n = n, var_y = var_y, L = finemapping_opts$L, L_greedy = finemapping_opts$L_greedy, analysis_method = method, diff --git a/man/colocboost_analysis.Rd b/man/colocboost_analysis.Rd new file mode 100644 index 00000000..e7b23584 --- /dev/null +++ b/man/colocboost_analysis.Rd @@ -0,0 +1,114 @@ +% Generated by roxygen2: do not edit by hand +\name{colocboost_analysis} +\alias{colocboost_analysis} +\title{ColocBoost analysis with optional pipeline QC} +\usage{ +colocboost_analysis( + ..., + missing_rate_thresh = NULL, + maf_cutoff = NULL, + xvar_cutoff = NULL, + ld_reference_meta_file = NULL, + pip_cutoff_to_skip_ind = NULL, + keep_indel = TRUE, + pip_cutoff_to_skip_sumstat = NULL, + qc_method = NULL, + impute = FALSE, + impute_opts = list(rcond = 0.01, R2_threshold = 0.6, minimum_ld = 5, lamb = 0.01), + LD_reference_info = NULL, + variant_convention = c("A2_A1", "A1_A2") +) +} +\arguments{ +\item{...}{Arguments passed to \code{colocboost::colocboost()}, including +data inputs such as \code{X}, \code{Y}, \code{sumstat}, \code{LD}, +\code{X_ref}, \code{dict_YX}, \code{dict_sumstatLD}, +\code{outcome_names}, and all ColocBoost model/post-processing options. +QC can only inspect inputs that are supplied by name.} + +\item{missing_rate_thresh, maf_cutoff, xvar_cutoff, ld_reference_meta_file, pip_cutoff_to_skip_ind}{Individual-level QC controls. +If all are \code{NULL}, individual-level QC is not run.} + +\item{keep_indel, pip_cutoff_to_skip_sumstat, qc_method, impute, impute_opts}{Summary-statistic QC controls. +\code{qc_method = "none"} and \code{qc_method = "rss_qc"} run basic allele +harmonization without LD-mismatch outlier detection. Imputation is only run +when \code{impute = TRUE}.} + +\item{LD_reference_info}{Optional LD reference information for +summary-statistic QC. This is only needed when the native \code{LD} matrix +row/column names or \code{X_ref} column names are missing or are not parseable +genomic variant IDs. It can be a .bim/.pvar/.pvar.zst file path, a data.frame +with variant metadata, or a \code{load_LD_matrix()} result. This is a QC-only +argument and is not passed to \code{colocboost::colocboost()}.} + +\item{variant_convention}{Allele order used by native ColocBoost-style +\code{sumstat$variant} and LD/X_ref names when deriving QC inputs: +\code{"A2_A1"} for pecotmr canonical \code{chr:pos:A2:A1}, or +\code{"A1_A2"} for \code{chr:pos:A1:A2}.} +} +\value{ +The object returned by \code{colocboost::colocboost()}. +} +\description{ +This wrapper keeps the direct \code{colocboost()} argument surface. All +ColocBoost inputs and model parameters are supplied through \code{...}. When +no QC options are requested, the call is passed directly to +\code{colocboost::colocboost()}. When QC options are requested, the wrapper +inspects named \code{X}/\code{Y} and/or \code{sumstat}/\code{LD}/\code{X_ref} +arguments in \code{...}, runs the relevant reusable QC step, and then calls +ColocBoost on the cleaned inputs. If the required named inputs are not +available, QC is skipped with a warning and the original ColocBoost call is +used. +} +\details{ +Use \code{colocboost_analysis()} the same way you would use +\code{colocboost::colocboost()}: pass the native ColocBoost arguments by +name, for example \code{X}, \code{Y}, \code{sumstat}, \code{LD}, +\code{X_ref}, \code{dict_YX}, \code{dict_sumstatLD}, +\code{outcome_names}, \code{focal_outcome_idx}, \code{effect_est}, +\code{effect_se}, \code{effect_n}, \code{M}, and other ColocBoost model or +post-processing options. These arguments are forwarded unchanged unless one +or more QC controls are requested. + +Individual-level QC is only attempted when at least one individual QC control +is non-\code{NULL} and named \code{X} and \code{Y} inputs are available in +\code{...}. Summary-statistic QC is only attempted when \code{qc_method}, +\code{pip_cutoff_to_skip_sumstat}, \code{impute = TRUE}, or +\code{LD_reference_info} is supplied and named \code{sumstat} plus either +\code{LD}, \code{X_ref}, or \code{LD_reference_info} are available. +\code{qc_method = "none"} and \code{qc_method = "rss_qc"} mean run basic +allele/variant harmonization only; they do not run SLALOM/DENTIST +LD-mismatch QC. RAISS imputation is controlled separately by +\code{impute = TRUE}. + +If no QC controls are supplied, this function is a thin direct call to +\code{colocboost::colocboost(...)}. When QC removes outcomes, +\code{outcome_names} and \code{focal_outcome_idx} are updated to match the +post-QC outcome order. If the requested focal outcome is removed by QC, +\code{focal_outcome_idx} is set to \code{NULL} with a warning. +} +\examples{ +\dontrun{ +# Direct ColocBoost call without QC. +fit <- colocboost_analysis(X = X, Y = Y, M = 500) + +# Summary-statistic input with basic allele/variant harmonization only. +fit <- colocboost_analysis(sumstat = sumstat, LD = LD, + qc_method = "none", M = 500) + +# Summary-statistic input with LD-mismatch QC and RAISS imputation. +fit <- colocboost_analysis(sumstat = sumstat, LD = LD, + qc_method = "slalom", impute = TRUE) + +# Use richer LD metadata from load_LD_matrix() for QC, while still passing +# ColocBoost's native LD input. +ld_data <- load_LD_matrix(ld_meta_file, region) +fit <- colocboost_analysis(sumstat = sumstat, LD = ld_data$LD_matrix, + LD_reference_info = ld_data, qc_method = "none") + +# Individual-level input with explicit genotype QC thresholds. +fit <- colocboost_analysis(X = X, Y = Y, + missing_rate_thresh = 0.1, + maf_cutoff = 0.0005) +} +} diff --git a/man/colocboost_analysis_pipeline.Rd b/man/colocboost_pipeline.Rd similarity index 60% rename from man/colocboost_analysis_pipeline.Rd rename to man/colocboost_pipeline.Rd index 0268e2dd..a708b883 100644 --- a/man/colocboost_analysis_pipeline.Rd +++ b/man/colocboost_pipeline.Rd @@ -1,10 +1,10 @@ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/colocboost_pipeline.R -\name{colocboost_analysis_pipeline} -\alias{colocboost_analysis_pipeline} -\title{Multi-trait colocalization analysis pipeline} +\name{colocboost_pipeline} +\alias{colocboost_pipeline} +\title{Multi-trait colocalization analysis protocol pipeline} \usage{ -colocboost_analysis_pipeline( +colocboost_pipeline( region_data, focal_trait = NULL, event_filters = NULL, @@ -15,7 +15,7 @@ colocboost_analysis_pipeline( pip_cutoff_to_skip_ind = 0, keep_indel = TRUE, pip_cutoff_to_skip_sumstat = 0, - qc_method = c("slalom", "dentist", "none"), + qc_method = NULL, impute = TRUE, impute_opts = list(rcond = 0.01, R2_threshold = 0.6, minimum_ld = 5, lamb = 0.01), ... @@ -34,25 +34,21 @@ colocboost_analysis_pipeline( \item{pip_cutoff_to_skip_sumstat}{A vector of cutoff values for skipping analysis based on PIP values for each sumstat Default is 0.} -\item{qc_method}{Quality control method to use. Options are "slalom" or "dentist" (default: "slalom").} +\item{qc_method}{Quality control method to use. Options are "none", "rss_qc", "slalom", or "dentist". \code{NULL} and \code{"rss_qc"} are treated as \code{"none"} for basic-only summary-stat preprocessing.} \item{impute}{Logical; if TRUE, performs imputation for outliers identified in the analysis (default: TRUE).} \item{impute_opts}{A list of imputation options including rcond, R2_threshold, and minimum_ld (default: list(rcond = 0.01, R2_threshold = 0.6, minimum_ld = 5)).} + +\item{...}{Additional arguments passed to the ColocBoost analysis calls.} } \value{ -A list containing the individual_data and sumstat_data after QC: -individual_data contains the following components if exist -\itemize{ - \item Y: A list of residualized phenotype values for all tasks. - \item X: A list of residualized genotype matrices all tasks. -} -sumstat_data contains the following components if exist -\itemize{ - \item sumstats: A list of summary statistics f or the matched LD_info, each sublist contains sumstats, n, var_y from \code{load_rss_data}. - \item LD_info: A list of LD information, each sublist contains LD_variants, LD_matrix, ref_panel \code{load_LD_matrix}. -} +A list containing \code{xqtl_coloc}, \code{joint_gwas}, \code{separate_gwas}, +and \code{computing_time}. Analyses that are not requested or cannot be run +return \code{NULL} in their corresponding result slot. } \description{ -This function perform a multi-trait colocalization using ColocBoost +This function performs protocol-level multi-trait colocalization using +ColocBoost. It accepts loaded regional data, performs QC once, then runs the +requested xQTL-only, joint GWAS, and separate GWAS analyses. } diff --git a/man/qc_individual_data.Rd b/man/qc_individual_data.Rd new file mode 100644 index 00000000..1f1e070f --- /dev/null +++ b/man/qc_individual_data.Rd @@ -0,0 +1,40 @@ +% Generated by roxygen2: do not edit by hand +\name{qc_individual_data} +\alias{qc_individual_data} +\title{Run reusable individual-level QC} +\usage{ +qc_individual_data( + X, + Y, + maf = NULL, + X_variance = NULL, + missing_rate_thresh = NULL, + maf_cutoff = 5e-04, + xvar_cutoff = 0, + ld_reference_meta_file = NULL, + keep_indel = TRUE, + pip_cutoff_to_skip = 0, + context = NULL +) +} +\arguments{ +\item{X}{Genotype matrix or named list of genotype matrices.} +\item{Y}{Phenotype vector/matrix or named list of phenotype matrices.} +\item{maf}{Optional MAF vector or named list.} +\item{X_variance}{Optional variant variance vector or named list.} +\item{missing_rate_thresh}{Maximum missing genotype rate.} +\item{maf_cutoff}{Minimum MAF cutoff.} +\item{xvar_cutoff}{Minimum genotype variance cutoff.} +\item{ld_reference_meta_file}{Optional LD reference metadata file.} +\item{keep_indel}{Whether indel variants are kept during LD-reference filtering.} +\item{pip_cutoff_to_skip}{Optional single-effect PIP cutoff.} +\item{context}{Optional context label for matrix inputs.} +} +\value{ +A named list of cleaned context-level \code{X}/\code{Y} records, or one +cleaned record for matrix inputs. +} +\description{ +Runs individual-level preprocessing by reusing existing pecotmr genotype +filtering helpers and optional single-effect screening. +} diff --git a/man/region_data_to_colocboost_input.Rd b/man/region_data_to_colocboost_input.Rd new file mode 100644 index 00000000..b49f57dc --- /dev/null +++ b/man/region_data_to_colocboost_input.Rd @@ -0,0 +1,18 @@ +% Generated by roxygen2: do not edit by hand +\name{region_data_to_colocboost_input} +\alias{region_data_to_colocboost_input} +\title{Convert loaded regional data to ColocBoost inputs} +\usage{ +region_data_to_colocboost_input(region_data) +} +\arguments{ +\item{region_data}{A list returned by \code{load_multitask_regional_data()}.} +} +\value{ +A structured list containing \code{colocboost_input}, \code{qc_input}, and +\code{source_info}. +} +\description{ +Builds direct ColocBoost input records plus QC input records from loaded +regional data. +} diff --git a/man/region_data_to_ind_input.Rd b/man/region_data_to_ind_input.Rd new file mode 100644 index 00000000..f3ee98e4 --- /dev/null +++ b/man/region_data_to_ind_input.Rd @@ -0,0 +1,18 @@ +% Generated by roxygen2: do not edit by hand +\name{region_data_to_ind_input} +\alias{region_data_to_ind_input} +\title{Convert loaded regional data to individual-level inputs} +\usage{ +region_data_to_ind_input(region_data) +} +\arguments{ +\item{region_data}{A list returned by \code{load_multitask_regional_data()}.} +} +\value{ +A list containing \code{X}, \code{Y}, \code{maf}, \code{X_variance}, and +source information. +} +\description{ +Extracts individual-level matrices from loaded regional data without running +QC or model fitting. +} diff --git a/man/region_data_to_rss_input.Rd b/man/region_data_to_rss_input.Rd new file mode 100644 index 00000000..b6006c01 --- /dev/null +++ b/man/region_data_to_rss_input.Rd @@ -0,0 +1,17 @@ +% Generated by roxygen2: do not edit by hand +\name{region_data_to_rss_input} +\alias{region_data_to_rss_input} +\title{Convert loaded regional data to RSS inputs} +\usage{ +region_data_to_rss_input(region_data) +} +\arguments{ +\item{region_data}{A list returned by \code{load_multitask_regional_data()}.} +} +\value{ +A list containing named RSS inputs, matched LD data, and source information. +} +\description{ +Extracts loaded RSS summary-statistic records and their matched LD/reference +data without running QC or model fitting. +} diff --git a/man/rss_analysis_pipeline.Rd b/man/rss_analysis_pipeline.Rd index 99f27291..df410a0e 100644 --- a/man/rss_analysis_pipeline.Rd +++ b/man/rss_analysis_pipeline.Rd @@ -15,7 +15,7 @@ rss_analysis_pipeline( skip_region = NULL, extract_region_name = NULL, region_name_col = NULL, - qc_method = c("slalom", "dentist"), + qc_method = c("slalom", "dentist", "rss_qc", "none"), finemapping_method = c("susie_rss", "single_effect", "bayesian_conditional_regression"), finemapping_opts = list(L = 20, L_greedy = 5, coverage = c(0.95, 0.7, 0.5), signal_cutoff = 0.025, min_abs_corr = 0.8), @@ -37,7 +37,8 @@ rss_analysis_pipeline( \item{LD_data}{A list from load_LD_matrix containing LD_matrix, LD_variants, ref_panel, block_metadata, and is_genotype flag. When is_genotype=TRUE (from return_genotype=TRUE), LD_matrix contains genotype X and susie_rss -uses the z+X interface. R is computed internally for QC/imputation.} +uses the z+X interface. Local R is computed only for QC stages that +require a correlation matrix.} \item{n_sample}{Sample size. If 0, retrieved from the sumstat file.} @@ -53,7 +54,9 @@ uses the z+X interface. R is computed internally for QC/imputation.} \item{region_name_col}{Column to filter for extract_region_name.} -\item{qc_method}{QC method: "slalom" or "dentist".} +\item{qc_method}{Summary-statistic QC method. \code{"slalom"} and +\code{"dentist"} run basic allele harmonization plus LD-mismatch QC; +\code{"rss_qc"} and \code{"none"} run basic allele harmonization only.} \item{finemapping_method}{One of "susie_rss", "single_effect", "bayesian_conditional_regression".} diff --git a/man/rss_basic_qc.Rd b/man/rss_basic_qc.Rd index b67620fc..020d0f03 100644 --- a/man/rss_basic_qc.Rd +++ b/man/rss_basic_qc.Rd @@ -4,7 +4,13 @@ \alias{rss_basic_qc} \title{Preprocess input data for RSS analysis} \usage{ -rss_basic_qc(sumstats, LD_data, skip_region = NULL, keep_indel = TRUE) +rss_basic_qc( + sumstats, + LD_data, + skip_region = NULL, + keep_indel = TRUE, + return_LD_mat = TRUE +) } \arguments{ \item{sumstats}{A data frame containing summary statistics with columns "chrom", "pos", "A1", and "A2".} @@ -13,6 +19,11 @@ rss_basic_qc(sumstats, LD_data, skip_region = NULL, keep_indel = TRUE) \item{skip_region}{A character vector specifying regions to be skipped in the analysis (optional). Each region should be in the format "chrom:start-end" (e.g., "1:1000000-2000000").} + +\item{return_LD_mat}{Logical; if \code{FALSE}, return only harmonized +summary statistics and skip LD-matrix subsetting. This is useful when the +reference input is genotype-backed \code{X_ref}. Defaults to \code{TRUE} +for backwards compatibility.} } \value{ A list containing the processed summary statistics and LD matrix. diff --git a/man/summary_stats_qc.Rd b/man/summary_stats_qc.Rd index 1aa065da..139c2aa7 100644 --- a/man/summary_stats_qc.Rd +++ b/man/summary_stats_qc.Rd @@ -4,34 +4,94 @@ \alias{summary_stats_qc} \title{Perform Quality Control on Summary Statistics} \usage{ -summary_stats_qc(sumstats, LD_data, n = NULL, method = c("slalom", "dentist")) +summary_stats_qc( + sumstats, + LD_data, + n = NULL, + method = c("slalom", "dentist"), + rss_input = NULL, + keep_indel = TRUE, + skip_region = NULL, + pip_cutoff_to_skip = 0, + qc_method = NULL, + impute = FALSE, + impute_opts = list(rcond = 0.01, R2_threshold = 0.6, minimum_ld = 5, lamb = 0.01), + study = NULL, + var_y = NULL, + return_on_skip = c("null", "preprocess"), + R_finite = NULL, + R_mismatch = NULL +) } \arguments{ \item{sumstats}{A data frame containing the processed summary statistics.} -\item{LD_data}{A list containing the combined LD variants data generated by load_LD_matrix.} +\item{LD_data}{A list containing the combined LD variants data generated by +\code{load_LD_matrix()}.} \item{n}{Sample size for the LD reference panel (used by dentist method).} -\item{method}{The quality control method to use. Options are "slalom" or "dentist" (default: "slalom").} +\item{method}{The quality control method to use. Options are \code{"slalom"} +or \code{"dentist"} (default: \code{"slalom"}).} + +\item{rss_input}{Optional loaded RSS input, either one \code{load_rss_data()} +result or a named list of them. Supplying this selects the additional +combined RSS/ColocBoost QC workflow. A single RSS record is detected by +structure, not by the name \code{sumstats}, so a multi-study list may safely +include a study named \code{"sumstats"}.} + +\item{keep_indel, skip_region, pip_cutoff_to_skip, qc_method, impute, impute_opts, study, var_y, return_on_skip, R_finite, R_mismatch}{Additional controls for the combined RSS/ColocBoost QC workflow. +They are ignored by the historical LD-mismatch-only call unless +\code{rss_input} or combined-QC options are supplied.} } \value{ -A list containing the quality-controlled summary statistics and updated LD matrix. - - sumstats: The quality-controlled summary statistics data frame. - - LD_mat: The updated LD matrix after quality control. - - outlier_number: The number of outlier variants removed. +A list containing the quality-controlled summary statistics and updated LD +matrix for the historical call: +\itemize{ + \item sumstats: The quality-controlled summary statistics data frame. + \item LD_mat: The updated LD matrix after quality control. + \item outlier_number: The number of outlier variants removed. +} +When \code{rss_input} or combined-QC controls are supplied, returns a cleaned +RSS/LD record for one RSS record, or a named list of records for a list of RSS +records. } \description{ -This function performs quality control on the processed summary statistics using the specified method. -It wraps \code{\link{ld_mismatch_qc}} and handles subsetting of the summary statistics and LD matrix. +This function performs quality control on the processed summary statistics +using the specified method. It wraps \code{\link{ld_mismatch_qc}} and handles +subsetting of the summary statistics and LD matrix. } \details{ -This function applies the specified quality control method to the processed summary statistics - via \code{\link{ld_mismatch_qc}}, then subsets the summary statistics and LD matrix to keep - only non-outlier variants. +This function applies the specified quality control method to the processed +summary statistics via \code{\link{ld_mismatch_qc}}, then subsets the summary +statistics and LD matrix to keep only non-outlier variants. + +As an additional workflow for ColocBoost/RSS pipelines, callers may supply +\code{rss_input} or combined-QC controls. That path first runs +\code{rss_basic_qc()}, optional PIP screening, optional LD-mismatch QC through +this same function, and optional RAISS imputation. The combined path normalizes +one or many RSS records to the same loop internally; only true single-record +input is unwrapped on return. +\code{qc_method = NULL}, \code{"none"}, and \code{"rss_qc"} all mean +basic-only summary-stat preprocessing without SLALOM/DENTIST outlier QC. + +When the combined path receives genotype-backed reference data +(\code{X_ref}), basic harmonization avoids constructing an LD matrix. PIP +screening uses an LD-independent single-effect screen with +\code{susie_rss(R = diag(p), L = 1, max_iter = 1)}, and LD-mismatch QC +computes only the filtered local correlation matrix required by +SLALOM/DENTIST. RAISS imputation temporarily centers/scales genotype-backed +\code{X_ref} before using the whole-region genotype/SVD path. This avoids +LD-block partition artifacts while matching the LD scale used by +\code{compute_LD(X_ref)}. Final ColocBoost calls still keep the original +\code{X_ref} as the reference input. } \examples{ # Perform SLALOM quality control (default) qc_results <- summary_stats_qc(sumstats, LD_data, method = "slalom") +# Additional combined basic-only RSS QC. +qc_results <- summary_stats_qc(rss_input = rss_input, LD_data = LD_data, + qc_method = "none") + } diff --git a/tests/testthat/test_colocboost_pipeline.R b/tests/testthat/test_colocboost_pipeline.R index 969360c1..e9214765 100644 --- a/tests/testthat/test_colocboost_pipeline.R +++ b/tests/testthat/test_colocboost_pipeline.R @@ -82,69 +82,6 @@ test_that("pip_cutoff correct length vector works", { expect_type(result, "list") }) -# ---- is_valid_sumstat_entry ---- -test_that("is_valid_sumstat_entry returns TRUE for valid data", { - ss <- data.frame(z = c(1.5, -2.3, 0.8), n = c(1000, 1000, 1000), - variant = c("chr1:100:A:G", "chr1:200:T:C", "chr1:300:G:A")) - expect_true(pecotmr:::is_valid_sumstat_entry(ss)) -}) - -test_that("is_valid_sumstat_entry returns FALSE for too few variants", { - ss <- data.frame(z = 1.5, n = 1000, variant = "chr1:100:A:G") - expect_false(pecotmr:::is_valid_sumstat_entry(ss, min_variants = 2)) -}) - -test_that("is_valid_sumstat_entry returns FALSE for all-NA z-scores", { - ss <- data.frame(z = c(NA, NA, NA), n = c(1000, 1000, 1000), - variant = c("v1", "v2", "v3")) - expect_false(pecotmr:::is_valid_sumstat_entry(ss)) -}) - -test_that("is_valid_sumstat_entry returns FALSE for NULL input", { - expect_false(pecotmr:::is_valid_sumstat_entry(NULL)) -}) - -test_that("is_valid_sumstat_entry returns FALSE for zero sample size", { - ss <- data.frame(z = c(1.5, -2.3), n = c(0, 0), variant = c("v1", "v2")) - expect_false(pecotmr:::is_valid_sumstat_entry(ss)) -}) - -# ---- filter_valid_sumstats ---- -test_that("filter_valid_sumstats removes invalid entries", { - good <- data.frame(z = c(1.0, 2.0, 3.0), n = c(100, 100, 100), - variant = c("v1", "v2", "v3")) - bad <- data.frame(z = c(NA, NA), n = c(0, 0), variant = c("v1", "v2")) - sumstats <- list(study_a = good, study_b = bad) - LD_mat <- list(ld1 = matrix(1, 3, 3)) - LD_match <- c("ld1", "ld1") - result <- pecotmr:::filter_valid_sumstats(sumstats, LD_mat, LD_match) - expect_equal(length(result$sumstats), 1) - expect_equal(names(result$sumstats), "study_a") -}) - -test_that("filter_valid_sumstats returns NULL when all invalid", { - bad1 <- data.frame(z = NA, n = 0, variant = "v1") - bad2 <- data.frame(z = NA, n = 0, variant = "v1") - sumstats <- list(s1 = bad1, s2 = bad2) - LD_mat <- list(ld1 = matrix(1)) - LD_match <- c("ld1", "ld1") - expect_message( - result <- pecotmr:::filter_valid_sumstats(sumstats, LD_mat, LD_match), - "No valid" - ) - expect_null(result) -}) - -test_that("filter_valid_sumstats keeps all when all valid", { - make_ss <- function() data.frame(z = c(1, 2, 3), n = c(100, 100, 100), - variant = c("v1", "v2", "v3")) - sumstats <- list(s1 = make_ss(), s2 = make_ss()) - LD_mat <- list(ld1 = matrix(1, 3, 3)) - LD_match <- c("ld1", "ld1") - result <- pecotmr:::filter_valid_sumstats(sumstats, LD_mat, LD_match) - expect_equal(length(result$sumstats), 2) -}) - # =========================================================================== # Tests from test_colocboost_pipeline_comprehensive.R # =========================================================================== @@ -227,13 +164,934 @@ make_sumstat_region_data <- function(n_variants = 5, n_studies = 2) { ) } +test_that("qc_regional_data treats NULL qc_method as basic-only none", { + region_data <- make_sumstat_region_data(n_variants = 5, n_studies = 1) + captured_qc_method <- NULL + LD_mat <- diag(1) + rownames(LD_mat) <- colnames(LD_mat) <- "chr1:100:A:G" + + local_mocked_bindings( + summary_stats_qc = function(..., qc_method) { + captured_qc_method <<- qc_method + list(study1 = list( + rss_input = list(sumstats = data.frame(variant_id = "chr1:100:A:G")), + LD_matrix = LD_mat + )) + } + ) + + result <- pecotmr:::qc_regional_data(region_data, qc_method = NULL, impute = FALSE) + expect_equal(captured_qc_method, "none") + expect_type(result, "list") +}) + +test_that("colocboost_pipeline default qc_method resolves to basic-only none", { + region_data <- make_individual_region_data(n = 12, p = 5, n_contexts = 1, n_events = 1) + captured_qc_method <- NULL + + local_mocked_bindings( + qc_regional_data = function(region_data, ..., qc_method) { + captured_qc_method <<- qc_method + list( + individual_data = list( + Y = region_data$individual_data$residual_Y, + X = region_data$individual_data$residual_X + ), + sumstat_data = NULL + ) + }, + .run_colocboost = function(label, ...) { + list(result = list(label = label), time = as.difftime(0, units = "secs")) + } + ) + + result <- suppressMessages(colocboost_pipeline(region_data)) + expect_equal(captured_qc_method, "none") + expect_equal(result$xqtl_coloc$label, "xQTL-only ColocBoost") +}) + # =========================================================================== -# 1. colocboost_analysis_pipeline: no analysis flags returns empty results +# New direct-input ColocBoost helpers +# =========================================================================== +test_that("RegionalData adapters expose individual and RSS inputs", { + ind_region <- make_individual_region_data(n = 12, p = 5, n_contexts = 1, n_events = 2) + ind_input <- region_data_to_ind_input(ind_region) + expect_true(ind_input$source_info$has_individual) + expect_equal(names(ind_input$X), "ctx1") + expect_equal(names(ind_input$Y), "ctx1") + + rss_region <- make_sumstat_region_data(n_variants = 5, n_studies = 2) + rss_input <- region_data_to_rss_input(rss_region) + expect_true(rss_input$source_info$has_sumstat) + expect_equal(names(rss_input$rss_input), c("study1", "study2")) + expect_true(all(names(rss_input$rss_input) %in% names(rss_input$LD_data))) +}) + +test_that("RegionalData individual adapter restores context names from residual_Y", { + ind_region <- make_individual_region_data(n = 12, p = 5, n_contexts = 2, n_events = 1) + names(ind_region$individual_data$residual_X) <- NULL + names(ind_region$individual_data$maf) <- NULL + ind_region$individual_data$X_variance <- lapply( + ind_region$individual_data$residual_X, + function(x) matrixStats::colVars(x) + ) + + ind_input <- region_data_to_ind_input(ind_region) + expect_equal(names(ind_input$X), c("ctx1", "ctx2")) + expect_equal(names(ind_input$maf), c("ctx1", "ctx2")) + expect_equal(names(ind_input$X_variance), c("ctx1", "ctx2")) + + converted <- region_data_to_colocboost_input(ind_region) + expect_equal(length(converted$colocboost_input$X), 2) + expect_equal(nrow(converted$colocboost_input$dict_YX), 2) +}) + +test_that("RegionalData adapters handle missing data without fabricating inputs", { + empty_region <- list(individual_data = NULL, sumstat_data = NULL) + + ind_input <- region_data_to_ind_input(empty_region) + expect_false(ind_input$source_info$has_individual) + expect_null(ind_input$X) + expect_null(ind_input$Y) + + rss_input <- region_data_to_rss_input(empty_region) + expect_false(rss_input$source_info$has_sumstat) + expect_equal(rss_input$rss_input, list()) + expect_equal(rss_input$LD_data, list()) + + converted <- region_data_to_colocboost_input(empty_region) + expect_equal(converted$colocboost_input, list()) + expect_false(converted$source_info$individual$has_individual) + expect_false(converted$source_info$sumstat$has_sumstat) +}) + +test_that("region_data_to_rss_input keeps duplicate study names unique by LD group", { + base_region <- make_sumstat_region_data(n_variants = 4, n_studies = 1) + study <- base_region$sumstat_data$sumstats[[1]] + ld_info <- base_region$sumstat_data$LD_info[[1]] + region_data <- list( + individual_data = NULL, + sumstat_data = list( + sumstats = list(study, study), + LD_info = list(ldA = ld_info, ldB = ld_info) + ) + ) + + rss_input <- region_data_to_rss_input(region_data) + expect_equal(names(rss_input$rss_input), c("study1", "study1.1")) + expect_equal(names(rss_input$LD_data), c("study1", "study1.1")) + expect_equal(unname(rss_input$source_info$ld_group), c("ldA", "ldB")) +}) + +test_that("region_data_to_colocboost_input returns core and QC inputs", { + region_data <- make_individual_region_data(n = 12, p = 5, n_contexts = 1, n_events = 2) + converted <- region_data_to_colocboost_input(region_data) + expect_true("colocboost_input" %in% names(converted)) + expect_true("qc_input" %in% names(converted)) + expect_true("source_info" %in% names(converted)) + expect_equal(length(converted$colocboost_input$X), 1) + expect_equal(length(converted$colocboost_input$Y), 2) + expect_equal(nrow(converted$colocboost_input$dict_YX), 2) +}) + +test_that("region_data_to_colocboost_input preserves loaded X_ref instead of precomputing LD", { + region_data <- make_sumstat_region_data(n_variants = 5, n_studies = 1) + variants <- region_data$sumstat_data$sumstats[[1]][[1]]$sumstats$variant_id + X_ref <- matrix(rnorm(50), 10, 5) + colnames(X_ref) <- variants + region_data$sumstat_data$LD_info[[1]]$LD_matrix <- X_ref + region_data$sumstat_data$LD_info[[1]]$is_genotype <- TRUE + + converted <- region_data_to_colocboost_input(region_data) + + expect_null(converted$colocboost_input$LD) + expect_equal(length(converted$colocboost_input$X_ref), 1) + expect_equal(dim(converted$colocboost_input$X_ref[[1]]), c(10, 5)) + expect_equal(colnames(converted$colocboost_input$X_ref[[1]]), variants) +}) + +test_that("region_data_to_colocboost_input preserves duplicated outcome names across contexts", { + region_data <- make_individual_region_data(n = 12, p = 5, n_contexts = 2, n_events = 1) + converted <- region_data_to_colocboost_input(region_data) + + expect_equal(names(converted$colocboost_input$Y), c("ctx1_event1", "ctx2_event1")) + expect_equal(length(converted$colocboost_input$Y), 2) + expect_equal(nrow(converted$colocboost_input$dict_YX), 2) + expect_equal(converted$colocboost_input$dict_YX[, "X"], c(1, 2)) +}) + +test_that("region_data_to_colocboost_input deduplicates shared individual X", { + region_data <- make_individual_region_data(n = 12, p = 5, n_contexts = 2, n_events = 1) + region_data$individual_data$residual_X$ctx2 <- region_data$individual_data$residual_X$ctx1 + converted <- region_data_to_colocboost_input(region_data) + + expect_equal(length(converted$colocboost_input$X), 1) + expect_equal(length(converted$colocboost_input$Y), 2) + expect_equal(converted$colocboost_input$dict_YX[, "X"], c(1, 1)) +}) + +test_that("region_data_to_colocboost_input combines individual and RSS inputs", { + ind_region <- make_individual_region_data(n = 12, p = 5, n_contexts = 1, n_events = 2) + rss_region <- make_sumstat_region_data(n_variants = 5, n_studies = 1) + region_data <- list( + individual_data = ind_region$individual_data, + sumstat_data = rss_region$sumstat_data + ) + + converted <- region_data_to_colocboost_input(region_data) + expect_equal(length(converted$colocboost_input$X), 1) + expect_equal(length(converted$colocboost_input$Y), 2) + expect_equal(nrow(converted$colocboost_input$dict_YX), 2) + expect_equal(length(converted$colocboost_input$sumstat), 1) + expect_equal(length(converted$colocboost_input$LD), 1) + expect_equal(nrow(converted$colocboost_input$dict_sumstatLD), 1) + expect_true(converted$source_info$individual$has_individual) + expect_true(converted$source_info$sumstat$has_sumstat) +}) + +test_that("qc_individual_data uses existing genotype filtering helpers", { + region_data <- make_individual_region_data(n = 12, p = 5, n_contexts = 1, n_events = 2) + X <- region_data$individual_data$residual_X + Y <- region_data$individual_data$residual_Y + maf <- region_data$individual_data$maf + expect_message( + result <- qc_individual_data(X, Y, maf = maf, maf_cutoff = 0), + "QC track" + ) + expect_equal(names(result), "ctx1") + expect_equal(ncol(result$ctx1$Y), 2) +}) + +test_that("qc_individual_data supports direct matrix inputs and keeps context labels", { + region_data <- make_individual_region_data(n = 12, p = 5, n_contexts = 1, n_events = 2) + X <- region_data$individual_data$residual_X$ctx1 + Y <- region_data$individual_data$residual_Y$ctx1 + maf <- stats::setNames(rep(0.2, ncol(X)), colnames(X)) + maf[1] <- 0.001 + + expect_message( + result <- qc_individual_data(X, Y, maf = maf, maf_cutoff = 0.05, context = "ctx1"), + "retained" + ) + expect_false(names(maf)[1] %in% colnames(result$X)) + expect_true(all(startsWith(colnames(result$Y), "ctx1_"))) + expect_equal(ncol(result$Y), ncol(Y)) +}) + +test_that("summary_stats_qc runs combined basic harmonization when qc_method is none", { + region_data <- make_sumstat_region_data(n_variants = 5, n_studies = 1) + rss_input <- region_data_to_rss_input(region_data) + expect_message( + result <- summary_stats_qc( + rss_input = rss_input$rss_input, + LD_data = rss_input$LD_data, + qc_method = "none", + impute = FALSE + ), + "basic allele harmonization" + ) + expect_equal(names(result), "study1") + expect_true(nrow(result$study1$rss_input$sumstats) > 0) +}) + +test_that("summary_stats_qc returns one cleaned record for one RSS record", { + region_data <- make_sumstat_region_data(n_variants = 5, n_studies = 1) + rss_input <- region_data_to_rss_input(region_data) + + expect_message( + result <- summary_stats_qc( + rss_input = rss_input$rss_input$study1, + LD_data = rss_input$LD_data$study1, + qc_method = "none", + impute = FALSE + ), + "basic allele harmonization" + ) + expect_true("rss_input" %in% names(result)) + expect_true("LD_matrix" %in% names(result)) + expect_true(nrow(result$rss_input$sumstats) > 0) +}) + +test_that("summary_stats_qc treats a study named sumstats as multiple-study input", { + region_data <- make_sumstat_region_data(n_variants = 5, n_studies = 2) + rss_input <- region_data_to_rss_input(region_data) + names(rss_input$rss_input)[1] <- "sumstats" + names(rss_input$LD_data)[1] <- "sumstats" + + expect_message( + result <- summary_stats_qc( + rss_input = rss_input$rss_input, + LD_data = rss_input$LD_data, + qc_method = "none", + impute = FALSE + ), + "basic allele harmonization" + ) + expect_equal(names(result), c("sumstats", "study2")) + expect_true("rss_input" %in% names(result$sumstats)) + expect_true("rss_input" %in% names(result$study2)) +}) + +test_that("summary_stats_qc imputes when block metadata can be inferred from LD matrix", { + region_data <- make_sumstat_region_data(n_variants = 5, n_studies = 1) + rss_input <- region_data_to_rss_input(region_data) + + expect_message( + result <- summary_stats_qc( + rss_input = rss_input$rss_input, + LD_data = rss_input$LD_data, + qc_method = "none", + impute = TRUE, + impute_opts = list(rcond = 0.01, R2_threshold = -Inf, minimum_ld = -Inf, lamb = 0.01) + ), + "running imputation" + ) + expect_equal(names(result), "study1") + expect_true(nrow(result$study1$rss_input$sumstats) > 0) +}) + +test_that("colocboost_analysis directly forwards core inputs without QC", { + local_mocked_bindings( + .cb_call_colocboost = function(args, dots) { + list(args = args, dots = dots) + } + ) + X <- matrix(rnorm(20), 5, 4) + colnames(X) <- paste0("v", 1:4) + Y <- matrix(rnorm(10), 5, 2) + colnames(Y) <- c("y1", "y2") + result <- colocboost_analysis(X = X, Y = Y, M = 2) + expect_identical(result$args$X, X) + expect_identical(result$args$Y, Y) + expect_equal(result$args$M, 2) + expect_length(result$dots, 0) +}) + +test_that("colocboost_analysis runs individual QC from colocboost-style inputs", { + local_mocked_bindings( + .cb_call_colocboost = function(args, dots) { + list(args = args, dots = dots) + } + ) + region_data <- make_individual_region_data(n = 12, p = 5, n_contexts = 1, n_events = 2) + converted <- region_data_to_colocboost_input(region_data) + expect_message( + result <- do.call( + colocboost_analysis, + c(converted$colocboost_input, list(missing_rate_thresh = 1, M = 2)) + ), + "individual-level" + ) + expect_equal(length(result$args$X), 1) + expect_equal(length(result$args$Y), 2) + expect_equal(result$args$M, 2) +}) + +test_that("colocboost_analysis deduplicates shared X after individual QC", { + local_mocked_bindings( + .cb_call_colocboost = function(args, dots) { + list(args = args, dots = dots) + } + ) + X <- matrix(rnorm(60), 12, 5) + colnames(X) <- paste0("chr1:", seq_len(5) * 100, ":A:G") + Y <- list( + ctx1 = matrix(rnorm(12), 12, 1, dimnames = list(NULL, "event1")), + ctx2 = matrix(rnorm(12), 12, 1, dimnames = list(NULL, "event2")) + ) + X_list <- list(ctx1 = X, ctx2 = X) + + expect_message( + result <- colocboost_analysis(X = X_list, Y = Y, missing_rate_thresh = 1, M = 2), + "individual-level" + ) + expect_equal(length(result$args$X), 1) + expect_equal(result$args$dict_YX[, "X"], c(1, 1)) +}) + +test_that("colocboost_analysis refreshes outcome_names after combined QC", { + local_mocked_bindings( + .cb_call_colocboost = function(args, dots) { + list(args = args, dots = dots) + } + ) + ind_region <- make_individual_region_data(n = 12, p = 5, n_contexts = 1, n_events = 2) + rss_region <- make_sumstat_region_data(n_variants = 5, n_studies = 1) + region_data <- list( + individual_data = ind_region$individual_data, + sumstat_data = rss_region$sumstat_data + ) + converted <- region_data_to_colocboost_input(region_data) + + expect_message( + result <- do.call( + colocboost_analysis, + c(converted$colocboost_input, list(missing_rate_thresh = 1, qc_method = "none", M = 2)) + ), + "summary-statistic" + ) + expect_equal( + result$args$outcome_names, + c(names(result$args$Y), names(result$args$sumstat)) + ) +}) + +test_that("colocboost_analysis remaps focal_outcome_idx after QC keeps focal outcome", { + local_mocked_bindings( + .cb_call_colocboost = function(args, dots) { + list(args = args, dots = dots) + }, + qc_individual_data = function(X, Y, ...) { + list(ctx = list( + X = X$ctx, + Y = Y$ctx[, "raw2", drop = FALSE] + )) + } + ) + X <- list(ctx = matrix(rnorm(40), 10, 4)) + colnames(X$ctx) <- paste0("v", 1:4) + Y <- list(ctx = matrix(rnorm(20), 10, 2)) + colnames(Y$ctx) <- c("raw1", "raw2") + + result <- colocboost_analysis( + X = X, Y = Y, + outcome_names = c("trait1", "trait2"), + focal_outcome_idx = 2, + missing_rate_thresh = 1, + M = 2 + ) + + expect_equal(result$args$outcome_names, "trait2") + expect_equal(result$args$focal_outcome_idx, 1) +}) + +test_that("colocboost_analysis clears focal_outcome_idx when QC removes focal outcome", { + local_mocked_bindings( + .cb_call_colocboost = function(args, dots) { + list(args = args, dots = dots) + }, + qc_individual_data = function(X, Y, ...) { + list(ctx = list( + X = X$ctx, + Y = Y$ctx[, "raw2", drop = FALSE] + )) + } + ) + X <- list(ctx = matrix(rnorm(40), 10, 4)) + colnames(X$ctx) <- paste0("v", 1:4) + Y <- list(ctx = matrix(rnorm(20), 10, 2)) + colnames(Y$ctx) <- c("raw1", "raw2") + + expect_warning( + result <- colocboost_analysis( + X = X, Y = Y, + outcome_names = c("trait1", "trait2"), + focal_outcome_idx = 1, + missing_rate_thresh = 1, + M = 2 + ), + "not present after QC" + ) + + expect_equal(result$args$outcome_names, "trait2") + expect_null(result$args$focal_outcome_idx) +}) + +test_that("colocboost_analysis skips empty individual QC for sumstat-only inputs", { + local_mocked_bindings( + .cb_call_colocboost = function(args, dots) { + list(args = args, dots = dots) + } + ) + region_data <- make_sumstat_region_data(n_variants = 5, n_studies = 1) + converted <- region_data_to_colocboost_input(region_data) + + messages <- character() + withCallingHandlers( + result <- do.call( + colocboost_analysis, + c(converted$colocboost_input, list(qc_method = "none", M = 2)) + ), + message = function(m) { + messages <<- c(messages, conditionMessage(m)) + invokeRestart("muffleMessage") + } + ) + + expect_false(any(grepl("individual-level", messages))) + expect_true(any(grepl("summary-statistic", messages))) + expect_equal(names(result$args$sumstat), "study1") +}) + +test_that("colocboost_analysis falls back to direct call when QC inputs are unavailable", { + local_mocked_bindings( + .cb_call_colocboost = function(args, dots) { + list(args = args, dots = dots) + } + ) + + expect_warning( + result <- colocboost_analysis(qc_method = "none", M = 2), + "required QC inputs are unavailable" + ) + expect_equal(result$args$M, 2) + expect_length(result$dots, 0) +}) + +test_that("colocboost_analysis falls back when individual QC cannot run", { + local_mocked_bindings( + .cb_call_colocboost = function(args, dots) { + list(args = args, dots = dots) + } + ) + X <- matrix(rnorm(20), 5, 4) + Y <- matrix(rnorm(10), 5, 2) + colnames(Y) <- c("y1", "y2") + + expect_warning( + result <- colocboost_analysis(X = X, Y = Y, missing_rate_thresh = 1, M = 2), + "QC requested but skipped" + ) + expect_identical(result$args$X, X) + expect_identical(result$args$Y, Y) + expect_equal(result$args$M, 2) +}) + +test_that("colocboost_analysis derives summary QC input from ColocBoost-style inputs", { + local_mocked_bindings( + .cb_call_colocboost = function(args, dots) { + list(args = args, dots = dots) + } + ) + variants <- paste0("chr1:", seq_len(5) * 100, ":A:G") + sumstat <- data.frame( + variant = variants, + z = rnorm(5), + n = rep(1000, 5), + stringsAsFactors = FALSE + ) + LD <- diag(5) + rownames(LD) <- colnames(LD) <- variants + + expect_message( + result <- colocboost_analysis(sumstat = sumstat, LD = LD, qc_method = "none", M = 2), + "summary-statistic" + ) + expect_equal(length(result$args$sumstat), 1) + expect_equal(length(result$args$LD), 1) + expect_equal(nrow(result$args$sumstat[[1]]), 5) + expect_equal(nrow(result$args$dict_sumstatLD), 1) + expect_equal(result$args$M, 2) +}) + +test_that("colocboost_analysis keeps multiple GWAS as colocboost list input after summary QC", { + local_mocked_bindings( + .cb_call_colocboost = function(args, dots) { + list(args = args, dots = dots) + } + ) + variants <- paste0("chr1:", seq_len(5) * 100, ":A:G") + make_sumstat <- function(seed) { + set.seed(seed) + data.frame( + variant = variants, + z = rnorm(5), + n = rep(1000, 5), + stringsAsFactors = FALSE + ) + } + LD <- diag(5) + rownames(LD) <- colnames(LD) <- variants + + expect_message( + result <- colocboost_analysis( + sumstat = list(gwas1 = make_sumstat(1), gwas2 = make_sumstat(2)), + LD = LD, + qc_method = "none", + M = 2 + ), + "summary-statistic" + ) + expect_equal(names(result$args$sumstat), c("gwas1", "gwas2")) + expect_equal(length(result$args$LD), 1) + expect_equal(nrow(result$args$dict_sumstatLD), 2) + expect_equal(result$args$dict_sumstatLD[, 2], c(1, 1)) +}) + +test_that("colocboost_analysis imputes native LD input using inferred block metadata", { + local_mocked_bindings( + .cb_call_colocboost = function(args, dots) { + list(args = args, dots = dots) + }, + raiss = function(ref_panel, known_zscores, LD_matrix = NULL, genotype_matrix = NULL, ...) { + expect_null(genotype_matrix) + expect_type(LD_matrix, "list") + expect_true("ld_matrices" %in% names(LD_matrix)) + LD_mat <- diag(nrow(known_zscores)) + rownames(LD_mat) <- colnames(LD_mat) <- known_zscores$variant_id + list( + result_filter = known_zscores, + LD_mat = LD_mat + ) + } + ) + variants <- paste0("chr1:", seq_len(5) * 100, ":A:G") + sumstat <- data.frame( + variant = variants, + z = rnorm(5), + n = rep(1000, 5), + stringsAsFactors = FALSE + ) + LD <- diag(5) + rownames(LD) <- colnames(LD) <- variants + + expect_warning( + expect_message( + result <- colocboost_analysis( + sumstat = sumstat, LD = LD, + qc_method = "none", impute = TRUE, M = 2 + ), + "running imputation" + ), + NA + ) + expect_equal(nrow(result$args$sumstat[[1]]), 5) +}) + +test_that("colocboost_analysis imputes native X_ref input through scaled genotype RAISS path", { + local_mocked_bindings( + .cb_call_colocboost = function(args, dots) { + list(args = args, dots = dots) + }, + compute_LD = function(...) stop("compute_LD should not be called"), + raiss = function(ref_panel, known_zscores, LD_matrix = NULL, genotype_matrix = NULL, ...) { + expect_null(LD_matrix) + expect_equal(ncol(genotype_matrix), nrow(ref_panel)) + expect_equal(unname(colMeans(genotype_matrix)), rep(0, ncol(genotype_matrix)), + tolerance = 1e-10) + list(result_filter = known_zscores, LD_mat = NULL) + } + ) + variants <- paste0("chr1:", seq_len(5) * 100, ":A:G") + sumstat <- data.frame( + variant = variants, + z = rnorm(5), + n = rep(1000, 5), + stringsAsFactors = FALSE + ) + X_ref <- matrix(rnorm(50), 10, 5) + colnames(X_ref) <- variants + + expect_warning( + expect_message( + result <- colocboost_analysis( + sumstat = sumstat, X_ref = X_ref, + qc_method = "none", impute = TRUE, M = 2 + ), + "running imputation" + ), + NA + ) + expect_equal(length(result$args$X_ref), 1) + expect_equal(ncol(result$args$X_ref[[1]]), 5) + expect_null(result$args$LD) +}) + +test_that("colocboost_analysis keeps QC-generated X_ref mutually exclusive with original LD", { + local_mocked_bindings( + .cb_call_colocboost = function(args, dots) { + list(args = args, dots = dots) + } + ) + variants <- paste0("chr1:", seq_len(5) * 100, ":A:G") + sumstat <- data.frame( + variant = variants, + z = rnorm(5), + n = rep(1000, 5), + stringsAsFactors = FALSE + ) + LD <- diag(5) + rownames(LD) <- colnames(LD) <- variants + X_ref <- matrix(rnorm(50), 10, 5) + colnames(X_ref) <- variants + ref_panel <- parse_variant_id(variants) + ref_panel$variant_id <- variants + LD_reference_info <- list( + LD_matrix = X_ref, + LD_variants = variants, + ref_panel = ref_panel, + is_genotype = TRUE + ) + + result <- suppressMessages(colocboost_analysis( + sumstat = sumstat, + LD = LD, + LD_reference_info = LD_reference_info, + qc_method = "none", + M = 2 + )) + + expect_null(result$args$LD) + expect_equal(length(result$args$X_ref), 1) + expect_equal(dim(result$args$X_ref[[1]]), c(10, 5)) +}) + +test_that("colocboost_analysis native summary QC supports explicit A1_A2 variant convention", { + local_mocked_bindings( + .cb_call_colocboost = function(args, dots) { + list(args = args, dots = dots) + } + ) + variants_a1a2 <- paste0("chr1:", seq_len(5) * 100, ":G:A") + sumstat <- data.frame( + variant = variants_a1a2, + z = rnorm(5), + n = rep(1000, 5), + stringsAsFactors = FALSE + ) + LD <- diag(5) + rownames(LD) <- colnames(LD) <- variants_a1a2 + + result <- suppressMessages(colocboost_analysis( + sumstat = sumstat, LD = LD, + qc_method = "none", + variant_convention = "A1_A2", M = 2 + )) + expect_equal(result$args$sumstat[[1]]$variant, paste0("chr1:", seq_len(5) * 100, ":A:G")) + expect_equal(rownames(result$args$LD[[1]]), paste0("chr1:", seq_len(5) * 100, ":A:G")) +}) + +test_that("colocboost_analysis uses LD_reference_info data frame for rsid-named LD", { + local_mocked_bindings( + .cb_call_colocboost = function(args, dots) { + list(args = args, dots = dots) + } + ) + rsids <- paste0("rs", seq_len(5)) + variants <- paste0("chr1:", seq_len(5) * 100, ":A:G") + sumstat <- data.frame( + variant = variants, + z = rnorm(5), + n = rep(1000, 5), + stringsAsFactors = FALSE + ) + LD <- diag(5) + rownames(LD) <- colnames(LD) <- rsids + LD_reference_info <- data.frame( + chrom = 1, + id = rsids, + pos = seq_len(5) * 100, + A2 = "A", + A1 = "G", + stringsAsFactors = FALSE + ) + + expect_message( + result <- colocboost_analysis( + sumstat = sumstat, LD = LD, + qc_method = "none", + LD_reference_info = LD_reference_info, + M = 2 + ), + "LD_reference_info" + ) + expect_equal(rownames(result$args$LD[[1]]), variants) + expect_equal(result$args$sumstat[[1]]$variant, variants) +}) + +test_that("colocboost_analysis uses LD_reference_info row order when LD names are absent", { + local_mocked_bindings( + .cb_call_colocboost = function(args, dots) { + list(args = args, dots = dots) + } + ) + variants <- paste0("chr1:", seq_len(5) * 100, ":A:G") + sumstat <- data.frame( + variant = variants, + z = rnorm(5), + n = rep(1000, 5), + stringsAsFactors = FALSE + ) + LD <- diag(5) + LD_reference_info <- data.frame( + chrom = 1, + pos = seq_len(5) * 100, + A2 = "A", + A1 = "G", + stringsAsFactors = FALSE + ) + + expect_message( + result <- colocboost_analysis( + sumstat = sumstat, LD = LD, + qc_method = "none", + LD_reference_info = LD_reference_info, + M = 2 + ), + "row order" + ) + expect_equal(rownames(result$args$LD[[1]]), variants) + expect_equal(result$args$sumstat[[1]]$variant, variants) +}) + +test_that("colocboost_analysis reads LD_reference_info from a bim file", { + local_mocked_bindings( + .cb_call_colocboost = function(args, dots) { + list(args = args, dots = dots) + } + ) + rsids <- paste0("rs", seq_len(5)) + variants <- paste0("chr1:", seq_len(5) * 100, ":A:G") + sumstat <- data.frame( + variant = variants, + z = rnorm(5), + n = rep(1000, 5), + stringsAsFactors = FALSE + ) + LD <- diag(5) + rownames(LD) <- colnames(LD) <- rsids + bim_file <- tempfile(fileext = ".bim") + write.table( + data.frame(1, rsids, 0, seq_len(5) * 100, "G", "A"), + bim_file, + quote = FALSE, + row.names = FALSE, + col.names = FALSE + ) + + result <- suppressMessages(colocboost_analysis( + sumstat = sumstat, LD = LD, + qc_method = "none", + LD_reference_info = bim_file, + M = 2 + )) + expect_equal(rownames(result$args$LD[[1]]), variants) +}) + +test_that("colocboost_analysis reports missing LD_reference_info when LD names are not genomic", { + local_mocked_bindings( + .cb_call_colocboost = function(args, dots) { + list(args = args, dots = dots) + } + ) + variants <- paste0("chr1:", seq_len(5) * 100, ":A:G") + sumstat <- data.frame( + variant = variants, + z = rnorm(5), + n = rep(1000, 5), + stringsAsFactors = FALSE + ) + LD <- diag(5) + rownames(LD) <- colnames(LD) <- paste0("rs", seq_len(5)) + + expect_warning( + result <- colocboost_analysis(sumstat = sumstat, LD = LD, qc_method = "none", M = 2), + "LD_reference_info" + ) + expect_identical(result$args$LD, LD) + expect_equal(result$args$M, 2) +}) + +test_that("colocboost_pipeline is the protocol entry", { + set.seed(450) + X <- matrix(rnorm(50), 10, 5) + colnames(X) <- paste0("chr1:", seq_len(5) * 100, ":A:G") + Y <- matrix(rnorm(10), 10, 1) + colnames(Y) <- "gene1" + region_data <- list( + individual_data = list( + residual_Y = list(ctx1 = Y), + residual_X = list(ctx1 = X), + maf = list(ctx1 = runif(5, 0.05, 0.45)) + ), + sumstat_data = NULL + ) + + local_mocked_bindings( + qc_regional_data = function(region_data, ...) { + list( + individual_data = list(Y = list(ctx1 = Y), X = list(ctx1 = X)), + sumstat_data = NULL + ) + }, + .run_colocboost = function(label, ...) { + list(result = list(label = label, args = list(...)), + time = as.difftime(0, units = "secs")) + } + ) + + result <- suppressMessages(colocboost_pipeline(region_data)) + expect_named(result, c("xqtl_coloc", "joint_gwas", "separate_gwas", "computing_time")) + expect_equal(result$xqtl_coloc$label, "xQTL-only ColocBoost") +}) + +test_that("colocboost_pipeline preserves result fields when analyses return NULL", { + set.seed(451) + X <- matrix(rnorm(50), 10, 5) + colnames(X) <- paste0("chr1:", seq_len(5) * 100, ":A:G") + Y <- matrix(rnorm(10), 10, 1) + colnames(Y) <- "gene1" + sumstat <- data.frame( + chrom = 1, + pos = seq_len(5) * 100, + A2 = "A", + A1 = "G", + z = rnorm(5), + n = 1000, + variant_id = colnames(X), + stringsAsFactors = FALSE + ) + LD <- diag(5) + rownames(LD) <- colnames(LD) <- colnames(X) + region_data <- list( + individual_data = list( + residual_Y = list(ctx1 = Y), + residual_X = list(ctx1 = X), + maf = list(ctx1 = runif(5, 0.05, 0.45)) + ), + sumstat_data = list( + sumstats = list(chr21_ref = list(study1 = list(sumstats = sumstat, n = 1000, var_y = 1))), + LD_info = list(chr21_ref = list( + LD_variants = colnames(X), + LD_matrix = LD, + ref_panel = cbind(parse_variant_id(colnames(X)), variant_id = colnames(X)), + block_metadata = data.frame( + block_id = 1L, chrom = "1", block_start = 100, block_end = 500, + size = 5L, start_idx = 1L, end_idx = 5L + ), + is_genotype = FALSE + )) + ) + ) + + local_mocked_bindings( + .run_colocboost = function(label, ...) { + list(result = NULL, time = as.difftime(0, units = "secs")) + } + ) + + result <- suppressMessages(colocboost_pipeline( + region_data, + xqtl_coloc = TRUE, + joint_gwas = TRUE, + separate_gwas = TRUE, + qc_method = "none", + impute = FALSE + )) + expect_named(result, c("xqtl_coloc", "joint_gwas", "separate_gwas", "computing_time")) + expect_null(result$xqtl_coloc) + expect_null(result$joint_gwas) + expect_named(result$separate_gwas, "study1") + expect_null(result$separate_gwas$study1) +}) + +# =========================================================================== +# 1. colocboost_pipeline: no analysis flags returns empty results # =========================================================================== test_that("pipeline returns empty results with message when no analysis flags set", { region_data <- make_individual_region_data() expect_message( - result <- colocboost_analysis_pipeline( + result <- colocboost_pipeline( region_data, xqtl_coloc = FALSE, joint_gwas = FALSE, @@ -248,12 +1106,12 @@ test_that("pipeline returns empty results with message when no analysis flags se }) # =========================================================================== -# 2. colocboost_analysis_pipeline: NULL individual_data and NULL sumstat_data +# 2. colocboost_pipeline: NULL individual_data and NULL sumstat_data # =========================================================================== test_that("pipeline returns early when both data sources are NULL", { region_data <- list(individual_data = NULL, sumstat_data = NULL) expect_message( - result <- colocboost_analysis_pipeline(region_data, xqtl_coloc = TRUE), + result <- colocboost_pipeline(region_data, xqtl_coloc = TRUE), "No individual data" ) expect_type(result, "list") @@ -264,7 +1122,7 @@ test_that("pipeline returns early when both data sources are NULL", { # 3. filter_events: type_pattern, valid_pattern, exclude_pattern # =========================================================================== test_that("filter_events keeps events matching valid_pattern", { - # Access the internal function from within colocboost_analysis_pipeline's environment + # Access the internal function from within colocboost_pipeline's environment # We need to build a minimal call through the pipeline to test filter_events indirectly. # Instead, recreate the inner function for testing purposes. @@ -311,7 +1169,7 @@ test_that("filter_events keeps events matching valid_pattern", { ) result <- suppressMessages( - colocboost_analysis_pipeline( + colocboost_pipeline( region_data, event_filters = event_filters, xqtl_coloc = FALSE, @@ -349,7 +1207,7 @@ test_that("filter_events errors on missing type_pattern", { expect_error( suppressMessages( - colocboost_analysis_pipeline( + colocboost_pipeline( region_data, event_filters = bad_filter, xqtl_coloc = TRUE, @@ -382,7 +1240,7 @@ test_that("filter_events errors when only type_pattern is given (no valid or exc expect_error( suppressMessages( - colocboost_analysis_pipeline( + colocboost_pipeline( region_data, event_filters = bad_filter, xqtl_coloc = TRUE @@ -412,7 +1270,7 @@ test_that("extract_contexts_studies returns individual contexts and sumstat stud # Pipeline calls extract_contexts_studies internally. # With no analysis flags it will still run extraction and return. expect_message( - result <- colocboost_analysis_pipeline( + result <- colocboost_pipeline( region_data, xqtl_coloc = FALSE, joint_gwas = FALSE, @@ -466,7 +1324,7 @@ test_that("extract_contexts_studies reports after-QC when some individual data r ) expect_message( - result <- colocboost_analysis_pipeline( + result <- colocboost_pipeline( region_data, xqtl_coloc = TRUE, joint_gwas = FALSE, @@ -492,9 +1350,12 @@ test_that("qc_regional_data handles named pip_cutoff_to_skip_sumstat vector", { LD_mat <- LD_data$LD_matrix[sumstats$variant_id, sumstats$variant_id, drop = FALSE] list(sumstats = sumstats, LD_mat = LD_mat) }, - summary_stats_qc = function(sumstats, LD_data, ...) { - LD_mat <- LD_data$LD_matrix[sumstats$variant_id, sumstats$variant_id, drop = FALSE] - list(sumstats = sumstats, LD_mat = LD_mat, outlier_number = 0) + summary_stats_qc = function(rss_input = NULL, LD_data, ...) { + stats::setNames(lapply(names(rss_input), function(study) { + ss <- rss_input[[study]]$sumstats + LD_mat <- LD_data[[study]]$LD_matrix[ss$variant_id, ss$variant_id, drop = FALSE] + list(rss_input = rss_input[[study]], LD_matrix = LD_mat, outlier_number = 0) + }), names(rss_input)) }, raiss = function(...) { list(result_filter = data.frame(z = rnorm(5)), LD_mat = diag(5)) @@ -526,9 +1387,12 @@ test_that("qc_regional_data fills missing study names with 0 for pip_cutoff_to_s LD_mat <- LD_data$LD_matrix[sumstats$variant_id, sumstats$variant_id, drop = FALSE] list(sumstats = sumstats, LD_mat = LD_mat) }, - summary_stats_qc = function(sumstats, LD_data, ...) { - LD_mat <- LD_data$LD_matrix[sumstats$variant_id, sumstats$variant_id, drop = FALSE] - list(sumstats = sumstats, LD_mat = LD_mat, outlier_number = 0) + summary_stats_qc = function(rss_input = NULL, LD_data, ...) { + stats::setNames(lapply(names(rss_input), function(study) { + ss <- rss_input[[study]]$sumstats + LD_mat <- LD_data[[study]]$LD_matrix[ss$variant_id, ss$variant_id, drop = FALSE] + list(rss_input = rss_input[[study]], LD_matrix = LD_mat, outlier_number = 0) + }), names(rss_input)) }, raiss = function(...) list(result_filter = data.frame(z = rnorm(5)), LD_mat = diag(5)), partition_LD_matrix = function(...) diag(5) @@ -537,24 +1401,32 @@ test_that("qc_regional_data fills missing study names with 0 for pip_cutoff_to_s # Named vector with only one of the two studies: the missing study should get 0 pip_partial <- c("study1" = 0.05) - result <- suppressMessages( - pecotmr:::qc_regional_data( - region_data, - pip_cutoff_to_skip_sumstat = pip_partial, - qc_method = "slalom", - impute = FALSE - ) + result <- withCallingHandlers( + suppressMessages( + pecotmr:::qc_regional_data( + region_data, + pip_cutoff_to_skip_sumstat = pip_partial, + qc_method = "slalom", + impute = FALSE + ) + ), + warning = function(w) { + if (grepl("IBSS algorithm did not converge", conditionMessage(w), fixed = TRUE)) { + invokeRestart("muffleWarning") + } + stop(w) + } ) expect_type(result, "list") }) # =========================================================================== -# 9. colocboost_analysis_pipeline: output structure verification +# 9. colocboost_pipeline: output structure verification # =========================================================================== test_that("pipeline output structure has expected top-level keys", { region_data <- list(individual_data = NULL, sumstat_data = NULL) result <- suppressMessages( - colocboost_analysis_pipeline(region_data, xqtl_coloc = FALSE, joint_gwas = FALSE, separate_gwas = FALSE) + colocboost_pipeline(region_data, xqtl_coloc = FALSE, joint_gwas = FALSE, separate_gwas = FALSE) ) expect_true("xqtl_coloc" %in% names(result)) expect_true("joint_gwas" %in% names(result)) @@ -590,7 +1462,7 @@ test_that("pipeline with individual data enters xqtl_coloc path and records timi ) result <- suppressMessages( - colocboost_analysis_pipeline( + colocboost_pipeline( region_data, xqtl_coloc = TRUE, joint_gwas = FALSE, @@ -650,7 +1522,7 @@ test_that("filter_events exclude_pattern removes matching events via pipeline", ) expect_message( - result <- colocboost_analysis_pipeline( + result <- colocboost_pipeline( region_data, event_filters = event_filters, xqtl_coloc = TRUE @@ -693,7 +1565,7 @@ test_that("pipeline catches colocboost xqtl error and returns NULL result", { ) expect_message( - result <- colocboost_analysis_pipeline( + result <- colocboost_pipeline( region_data, xqtl_coloc = TRUE, joint_gwas = FALSE, @@ -718,7 +1590,7 @@ test_that("pipeline returns early when all data fails QC", { ) expect_message( - result <- colocboost_analysis_pipeline( + result <- colocboost_pipeline( region_data, xqtl_coloc = TRUE, joint_gwas = FALSE, @@ -729,6 +1601,199 @@ test_that("pipeline returns early when all data fails QC", { expect_type(result, "list") }) +make_qced_sumstat_data <- function(studies = c("study1"), n_variants = 5) { + vids <- paste0("chr1:", seq_len(n_variants) * 100, ":A:G") + LD_mat <- diag(n_variants) + rownames(LD_mat) <- colnames(LD_mat) <- vids + sumstats <- lapply(studies, function(study) { + list( + sumstats = data.frame( + z = rnorm(n_variants), + variant_id = vids, + stringsAsFactors = FALSE + ), + n = 5000, + var_y = 1 + ) + }) + names(sumstats) <- studies + list( + sumstats = sumstats, + LD_mat = stats::setNames(list(LD_mat), studies[1]), + LD_match = stats::setNames(rep(studies[1], length(studies)), studies) + ) +} + +test_that("pipeline skips xqtl branch when QC removes individual data but keeps sumstats", { + ind_region <- make_individual_region_data(n = 20, p = 8, n_contexts = 1, n_events = 2) + sumstat_region <- make_sumstat_region_data(n_variants = 5, n_studies = 1) + region_data <- list( + individual_data = ind_region$individual_data, + sumstat_data = sumstat_region$sumstat_data + ) + calls <- character() + + local_mocked_bindings( + qc_regional_data = function(region_data, ...) { + list( + individual_data = NULL, + sumstat_data = make_qced_sumstat_data("study1") + ) + }, + .run_colocboost = function(label, ...) { + calls <<- c(calls, label) + list(result = list(label = label, args = list(...)), + time = as.difftime(0, units = "secs")) + } + ) + + expect_message( + result <- colocboost_pipeline( + region_data, + xqtl_coloc = TRUE, + joint_gwas = TRUE, + separate_gwas = FALSE + ), + "No individual data pass QC" + ) + expect_false(any(grepl("xQTL-only", calls))) + expect_true(any(grepl("Joint GWAS", calls))) + expect_null(result$computing_time$Analysis$xqtl_coloc) + expect_s3_class(result$computing_time$Analysis$joint_gwas, "difftime") +}) + +test_that("pipeline skips joint_gwas when QC removes sumstats but keeps individual data", { + ind_region <- make_individual_region_data(n = 20, p = 8, n_contexts = 1, n_events = 2) + sumstat_region <- make_sumstat_region_data(n_variants = 5, n_studies = 1) + region_data <- list( + individual_data = ind_region$individual_data, + sumstat_data = sumstat_region$sumstat_data + ) + calls <- character() + + local_mocked_bindings( + qc_regional_data = function(region_data, ...) { + Y1 <- region_data$individual_data$residual_Y[[1]] + colnames(Y1) <- paste0("ctx1_", colnames(Y1)) + list( + individual_data = list( + Y = list(ctx1 = Y1), + X = list(ctx1 = region_data$individual_data$residual_X[[1]]) + ), + sumstat_data = list(sumstats = NULL) + ) + }, + .run_colocboost = function(label, ...) { + calls <<- c(calls, label) + list(result = list(label = label, args = list(...)), + time = as.difftime(0, units = "secs")) + } + ) + + expect_message( + result <- colocboost_pipeline( + region_data, + xqtl_coloc = TRUE, + joint_gwas = TRUE, + separate_gwas = FALSE + ), + "Skipping follow-up analysis for sumstat studies" + ) + expect_true(any(grepl("xQTL-only", calls))) + expect_false(any(grepl("Joint GWAS", calls))) + expect_s3_class(result$computing_time$Analysis$xqtl_coloc, "difftime") + expect_null(result$computing_time$Analysis$joint_gwas) +}) + +test_that("pipeline separate_gwas loops only over sumstats that survive QC", { + region_data <- make_sumstat_region_data(n_variants = 5, n_studies = 2) + calls <- character() + + local_mocked_bindings( + qc_regional_data = function(region_data, ...) { + list( + individual_data = NULL, + sumstat_data = make_qced_sumstat_data("study1") + ) + }, + .run_colocboost = function(label, ...) { + calls <<- c(calls, label) + list(result = list(label = label, args = list(...)), + time = as.difftime(0, units = "secs")) + } + ) + + expect_message( + result <- colocboost_pipeline( + region_data, + xqtl_coloc = FALSE, + joint_gwas = FALSE, + separate_gwas = TRUE + ), + "Skipping follow-up analysis for sumstat studies" + ) + expect_equal(length(calls), 1) + expect_true(grepl("study1", calls[[1]])) + expect_equal(result$computing_time$Analysis$separate_gwas$n_studies, 1) + expect_true("study2" %in% names(result$separate_gwas)) + expect_null(result$separate_gwas$study2) +}) + +test_that("pipeline event filters can remove all events in one context while keeping another", { + set.seed(1401) + n <- 10 + p <- 5 + X <- matrix(rnorm(n * p), n, p) + colnames(X) <- paste0("chr1:", seq_len(p) * 100, ":A:G") + Y_drop <- matrix(rnorm(n * 2), n, 2) + colnames(Y_drop) <- c("bad_event1", "bad_event2") + Y_keep <- matrix(rnorm(n * 2), n, 2) + colnames(Y_keep) <- c("keep_event", "bad_event") + region_data <- list( + individual_data = list( + residual_Y = list(ctx_drop = Y_drop, ctx_keep = Y_keep), + residual_X = list(ctx_drop = X, ctx_keep = X), + maf = list(ctx_drop = runif(p, 0.05, 0.45), ctx_keep = runif(p, 0.05, 0.45)) + ), + sumstat_data = NULL + ) + event_filters <- list(list( + type_pattern = ".*_event.*", + valid_pattern = "keep_event" + )) + calls <- character() + + local_mocked_bindings( + qc_regional_data = function(region_data, ...) { + list( + individual_data = list( + Y = region_data$individual_data$residual_Y, + X = region_data$individual_data$residual_X + ), + sumstat_data = NULL + ) + }, + .run_colocboost = function(label, ...) { + calls <<- c(calls, label) + list(result = list(label = label, args = list(...)), + time = as.difftime(0, units = "secs")) + } + ) + + expect_message( + result <- colocboost_pipeline( + region_data, + event_filters = event_filters, + xqtl_coloc = TRUE, + joint_gwas = FALSE, + separate_gwas = FALSE + ), + "Skipping follow-up analysis for individual traits ctx_drop" + ) + expect_equal(length(calls), 1) + expect_s3_class(result$computing_time$Analysis$xqtl_coloc, "difftime") +}) + # =========================================================================== # Tests from test_twas_colocboost_round3.R (colocboost-related) @@ -809,22 +1874,8 @@ make_sumstat_region_data <- function(n_variants = 5, n_studies = 2) { } -test_that("filter_valid_sumstats: all valid entries pass through unchanged", { - ss1 <- data.frame(z = c(1.0, 2.0), n = c(100, 100), variant = c("v1", "v2")) - ss2 <- data.frame(z = c(-1.0, 0.5), n = c(200, 200), variant = c("v3", "v4")) - - sumstats <- list(s1 = ss1, s2 = ss2) - LD_mat <- list(s1 = diag(2), s2 = diag(2)) - LD_match <- c("s1", "s2") - - # No message about removed entries when all are valid - result <- pecotmr:::filter_valid_sumstats(sumstats, LD_mat, LD_match) - expect_equal(length(result$sumstats), 2) - expect_equal(nrow(result$dict_sumstatLD), 2) -}) - # =========================================================================== -# SECTION C: colocboost_analysis_pipeline - filter_events valid_pattern path +# SECTION C: colocboost_pipeline - filter_events valid_pattern path # (lines 64, 67, 71-72, 74, 82, 84-85) # =========================================================================== @@ -869,7 +1920,7 @@ test_that("filter_events: valid_pattern with no matching groups returns NULL (li # The filter returns NULL for the context -> residual_Y entry becomes NULL expect_message( - result <- colocboost_analysis_pipeline( + result <- colocboost_pipeline( region_data, event_filters = event_filters, xqtl_coloc = TRUE, @@ -922,7 +1973,7 @@ test_that("filter_events: type_pattern matches nothing skips via next (line 64)" ) result <- suppressMessages( - colocboost_analysis_pipeline( + colocboost_pipeline( region_data, event_filters = event_filters, xqtl_coloc = TRUE, @@ -972,7 +2023,7 @@ test_that("filter_events: all events pass -> 'included in following analysis' me ) expect_message( - result <- colocboost_analysis_pipeline( + result <- colocboost_pipeline( region_data, event_filters = event_filters, xqtl_coloc = TRUE, @@ -1026,7 +2077,7 @@ test_that("filter_events: valid_pattern with successful groups retains valid eve ) expect_message( - result <- colocboost_analysis_pipeline( + result <- colocboost_pipeline( region_data, event_filters = event_filters, xqtl_coloc = TRUE, @@ -1063,7 +2114,7 @@ test_that("extract_contexts_studies: all individual data pass QC (line 126)", { ) expect_message( - result <- colocboost_analysis_pipeline( + result <- colocboost_pipeline( region_data, xqtl_coloc = TRUE, joint_gwas = FALSE, @@ -1090,7 +2141,7 @@ test_that("extract_contexts_studies: all individual data fail QC (line 134-135)" ) expect_message( - result <- colocboost_analysis_pipeline( + result <- colocboost_pipeline( region_data, xqtl_coloc = TRUE, joint_gwas = FALSE, @@ -1130,7 +2181,7 @@ test_that("extract_contexts_studies: sumstat studies extraction on initial call # With xqtl_coloc=FALSE, separate_gwas=TRUE -> will enter the sumstat code paths result <- suppressMessages( - colocboost_analysis_pipeline( + colocboost_pipeline( region_data, xqtl_coloc = FALSE, joint_gwas = FALSE, @@ -1161,7 +2212,7 @@ test_that("extract_contexts_studies: after-QC sumstat all pass (line 144)", { ) expect_message( - result <- colocboost_analysis_pipeline( + result <- colocboost_pipeline( region_data, xqtl_coloc = FALSE, joint_gwas = FALSE, @@ -1193,7 +2244,7 @@ test_that("extract_contexts_studies: after-QC sumstat partial pass (line 146)", ) expect_message( - result <- colocboost_analysis_pipeline( + result <- colocboost_pipeline( region_data, xqtl_coloc = FALSE, joint_gwas = FALSE, @@ -1205,7 +2256,7 @@ test_that("extract_contexts_studies: after-QC sumstat partial pass (line 146)", }) # =========================================================================== -# SECTION E: colocboost_analysis_pipeline - sumstat processing block +# SECTION E: colocboost_pipeline - sumstat processing block # (lines 244-281: organizing sumstats, normalizing variant IDs, LD normalization) # =========================================================================== @@ -1248,7 +2299,7 @@ test_that("pipeline sumstat block: normalizes variant IDs and processes LD matri ) result <- suppressMessages( - colocboost_analysis_pipeline( + colocboost_pipeline( region_data, xqtl_coloc = FALSE, joint_gwas = FALSE, @@ -1300,7 +2351,7 @@ test_that("pipeline sumstat block: single sumstat study initializes separate_gwa ) result <- suppressMessages( - colocboost_analysis_pipeline( + colocboost_pipeline( region_data, xqtl_coloc = FALSE, joint_gwas = FALSE, @@ -1354,7 +2405,7 @@ test_that("pipeline sumstat block: multiple sumstat studies initializes separate ) result <- suppressMessages( - colocboost_analysis_pipeline( + colocboost_pipeline( region_data, xqtl_coloc = FALSE, joint_gwas = FALSE, @@ -1368,11 +2419,11 @@ test_that("pipeline sumstat block: multiple sumstat studies initializes separate }) # =========================================================================== -# SECTION F: colocboost_analysis_pipeline - filter_valid_sumstats returning NULL +# SECTION F: colocboost_pipeline - no valid summary statistics after validation # (lines 275-276: pipeline with all invalid sumstats -> "No data pass QC") # =========================================================================== -test_that("pipeline: all sumstats invalid after filter_valid_sumstats returns No data pass QC (line 275-276)", { +test_that("pipeline: all sumstats invalid returns No data pass QC (line 275-276)", { set.seed(830) n_variants <- 3 vids <- paste0("chr1:", seq_len(n_variants) * 100, ":A:G") @@ -1414,7 +2465,7 @@ test_that("pipeline: all sumstats invalid after filter_valid_sumstats returns No ) expect_message( - result <- colocboost_analysis_pipeline( + result <- colocboost_pipeline( region_data, xqtl_coloc = FALSE, joint_gwas = FALSE, @@ -1426,7 +2477,7 @@ test_that("pipeline: all sumstats invalid after filter_valid_sumstats returns No }) # =========================================================================== -# SECTION G: colocboost_analysis_pipeline - focal_trait matching (lines 300-301) +# SECTION G: colocboost_pipeline - focal_trait matching (lines 300-301) # =========================================================================== test_that("pipeline: focal_trait matches a trait name sets focal_outcome_idx (lines 300-301)", { @@ -1450,7 +2501,7 @@ test_that("pipeline: focal_trait matches a trait name sets focal_outcome_idx (li # focal_trait = "ctx1_event2" should match and set focal_outcome_idx result <- suppressMessages( - colocboost_analysis_pipeline( + colocboost_pipeline( region_data, focal_trait = "ctx1_event2", xqtl_coloc = TRUE, @@ -1483,7 +2534,7 @@ test_that("pipeline: focal_trait does NOT match leaves focal_outcome_idx NULL (l # focal_trait is specified but doesn't match any trait result <- suppressMessages( - colocboost_analysis_pipeline( + colocboost_pipeline( region_data, focal_trait = "nonexistent_trait", xqtl_coloc = TRUE, @@ -1496,7 +2547,7 @@ test_that("pipeline: focal_trait does NOT match leaves focal_outcome_idx NULL (l }) # =========================================================================== -# SECTION H: colocboost_analysis_pipeline - joint_gwas path (lines 320-323) +# SECTION H: colocboost_pipeline - joint_gwas path (lines 320-323) # =========================================================================== test_that("pipeline: joint_gwas path is entered with both individual and sumstat data (lines 320-323)", { @@ -1548,7 +2599,7 @@ test_that("pipeline: joint_gwas path is entered with both individual and sumstat # joint_gwas=TRUE should trigger the joint GWAS branch (lines 320-323) expect_message( - result <- colocboost_analysis_pipeline( + result <- colocboost_pipeline( region_data, xqtl_coloc = FALSE, joint_gwas = TRUE, @@ -1562,7 +2613,7 @@ test_that("pipeline: joint_gwas path is entered with both individual and sumstat # =========================================================================== -# SECTION I: colocboost_analysis_pipeline - separate_gwas path (lines 341+) +# SECTION I: colocboost_pipeline - separate_gwas path (lines 341+) # =========================================================================== test_that("pipeline: separate_gwas path is entered for each GWAS study", { @@ -1611,7 +2662,7 @@ test_that("pipeline: separate_gwas path is entered for each GWAS study", { ) expect_message( - result <- colocboost_analysis_pipeline( + result <- colocboost_pipeline( region_data, xqtl_coloc = FALSE, joint_gwas = FALSE, @@ -1624,7 +2675,7 @@ test_that("pipeline: separate_gwas path is entered for each GWAS study", { }) # =========================================================================== -# SECTION J: colocboost_analysis_pipeline - all Y NULL after organizing (lines 225-227) +# SECTION J: colocboost_pipeline - all Y NULL after organizing (lines 225-227) # =========================================================================== test_that("pipeline: all Y become NULL after organizing individual data (lines 225-227)", { @@ -1644,7 +2695,7 @@ test_that("pipeline: all Y become NULL after organizing individual data (lines 2 ) expect_message( - result <- colocboost_analysis_pipeline( + result <- colocboost_pipeline( region_data, xqtl_coloc = TRUE, joint_gwas = FALSE, @@ -1656,7 +2707,7 @@ test_that("pipeline: all Y become NULL after organizing individual data (lines 2 }) # =========================================================================== -# SECTION K: colocboost_analysis_pipeline - sumstat all z NA (line 252-255) +# SECTION K: colocboost_pipeline - sumstat all z NA (line 252-255) # =========================================================================== test_that("pipeline: sumstat with all NA z-scores yields warning message (lines 252-255)", { @@ -1700,7 +2751,7 @@ test_that("pipeline: sumstat with all NA z-scores yields warning message (lines ) expect_message( - result <- colocboost_analysis_pipeline( + result <- colocboost_pipeline( region_data, xqtl_coloc = FALSE, joint_gwas = FALSE, @@ -1712,7 +2763,7 @@ test_that("pipeline: sumstat with all NA z-scores yields warning message (lines }) # =========================================================================== -# SECTION L: colocboost_analysis_pipeline - no sumstat_data pass QC (line 152) +# SECTION L: colocboost_pipeline - no sumstat_data pass QC (line 152) # =========================================================================== test_that("extract_contexts_studies: no sumstat data pass QC message", { @@ -1740,13 +2791,13 @@ test_that("extract_contexts_studies: no sumstat data pass QC message", { ) expect_message( - result <- colocboost_analysis_pipeline( + result <- colocboost_pipeline( region_data, xqtl_coloc = TRUE, joint_gwas = FALSE, separate_gwas = FALSE ), - "No sumstat data pass QC" + "Skipping follow-up analysis for sumstat studies" ) expect_type(result, "list") }) @@ -1841,7 +2892,7 @@ test_that("qc_regional_data: pip_cutoff_to_skip_ind lookup works when X and Y ha }) # =========================================================================== -# SECTION U: colocboost_analysis_pipeline - sumstat with NA z filtering (line 252-258) +# SECTION U: colocboost_pipeline - sumstat with NA z filtering (line 252-258) # =========================================================================== test_that("pipeline sumstat processing handles all-NA z-scores with warning (lines 252-258)", { @@ -1886,7 +2937,7 @@ test_that("pipeline sumstat processing handles all-NA z-scores with warning (lin # Should produce message about NA z-scores or no valid studies result <- suppressMessages( - colocboost_analysis_pipeline( + colocboost_pipeline( region_data, xqtl_coloc = FALSE, joint_gwas = FALSE, @@ -1897,7 +2948,7 @@ test_that("pipeline sumstat processing handles all-NA z-scores with warning (lin }) # =========================================================================== -# SECTION V: colocboost_analysis_pipeline - LD matrix normalization (lines 263-269) +# SECTION V: colocboost_pipeline - LD matrix normalization (lines 263-269) # =========================================================================== test_that("pipeline: LD matrix dimnames are normalized to canonical format (lines 261-269)", { @@ -1945,7 +2996,7 @@ test_that("pipeline: LD matrix dimnames are normalized to canonical format (line # normalize_variant_id should add chr prefix to LD matrix dimnames result <- suppressMessages( - colocboost_analysis_pipeline( + colocboost_pipeline( region_data, xqtl_coloc = FALSE, joint_gwas = FALSE, @@ -1967,9 +3018,11 @@ test_that("qc_regional_data: with only sumstat data processes correctly", { LD_mat <- LD_data$LD_matrix list(sumstats = sumstats, LD_mat = LD_mat) }, - summary_stats_qc = function(sumstats, LD_data, ...) { - LD_mat <- LD_data$LD_matrix - list(sumstats = sumstats, LD_mat = LD_mat, outlier_number = 0) + summary_stats_qc = function(rss_input = NULL, LD_data, ...) { + stats::setNames(lapply(names(rss_input), function(study) { + list(rss_input = rss_input[[study]], LD_matrix = LD_data[[study]]$LD_matrix, + outlier_number = 0) + }), names(rss_input)) }, raiss = function(...) { list(result_filter = data.frame(z = rnorm(5)), LD_mat = diag(5)) diff --git a/tests/testthat/test_file_utils.R b/tests/testthat/test_file_utils.R index c93f36c2..541c9935 100644 --- a/tests/testthat/test_file_utils.R +++ b/tests/testthat/test_file_utils.R @@ -1543,7 +1543,8 @@ test_that("load_multitask_regional_data errors with multiple genotypes and no ma region = "chr1:1-1000", genotype_list = c("geno1.bed", "geno2.bed"), phenotype_list = c("pheno1.gz"), - covariate_list = c("covar1.gz") + covariate_list = c("covar1.gz"), + conditions_list_individual = "cond1" ), "match_geno_pheno" ) @@ -1570,6 +1571,164 @@ test_that("load_multitask_regional_data individual-level path returns expected s expect_true("chrom" %in% names(result$individual_data)) }) +test_that("load_multitask_regional_data loads and merges multiple genotype groups", { + calls <- list() + make_mock_dat <- function(conditions) { + x <- setNames(lapply(conditions, function(nm) { + matrix(1, nrow = 2, ncol = 2, + dimnames = list(c("s1", "s2"), paste0(nm, "_v", 1:2))) + }), conditions) + y <- setNames(lapply(conditions, function(nm) { + matrix(1, nrow = 2, ncol = 1, + dimnames = list(c("s1", "s2"), nm)) + }), conditions) + list( + residual_Y = y, + residual_X = x, + residual_Y_scalar = setNames(as.list(rep(1, length(conditions))), conditions), + residual_X_scalar = setNames(as.list(rep(1, length(conditions))), conditions), + dropped_sample = list(X = list(), Y = list(), covar = list()), + covar = list(), + Y = y, + X_data = x, + X = do.call(cbind, x), + maf = setNames(lapply(x, function(mat) rep(0.1, ncol(mat))), conditions), + chrom = "chr1", + grange = c("1", "100"), + Y_coordinates = list() + ) + } + + local_mocked_bindings( + load_regional_univariate_data = function(genotype, phenotype, covariate, + conditions, extract_region_name, ...) { + calls[[length(calls) + 1]] <<- list( + genotype = genotype, + phenotype = phenotype, + covariate = covariate, + conditions = conditions, + extract_region_name = extract_region_name + ) + make_mock_dat(conditions) + } + ) + + result <- expect_warning( + load_multitask_regional_data( + region = "chr1:1-100", + genotype_list = c("geno1", "geno2"), + phenotype_list = paste0("pheno", 1:4), + covariate_list = paste0("covar", 1:4), + conditions_list_individual = paste0("cond", 1:4), + match_geno_pheno = c(1, 1, 2, 2), + extract_region_name = as.list(paste0("gene", 1:4)), + region_name_col = 4 + ), + NA + ) + + expect_equal(length(calls), 2L) + expect_equal(vapply(calls, `[[`, character(1), "genotype"), c("geno1", "geno2")) + expect_equal(calls[[1]]$phenotype, paste0("pheno", 1:2)) + expect_equal(calls[[2]]$phenotype, paste0("pheno", 3:4)) + expect_equal(calls[[1]]$extract_region_name, as.list(paste0("gene", 1:2))) + expect_equal(calls[[2]]$extract_region_name, as.list(paste0("gene", 3:4))) + expect_true("residual_X" %in% names(result$individual_data)) + expect_equal(names(result$individual_data$residual_X), paste0("cond", 1:4)) +}) + +test_that("load_multitask_regional_data defaults missing individual condition names", { + calls <- list() + make_mock_dat <- function(conditions) { + x <- setNames(lapply(conditions, function(nm) { + matrix(1, nrow = 2, ncol = 2, + dimnames = list(c("s1", "s2"), paste0(nm, "_v", 1:2))) + }), conditions) + y <- setNames(lapply(conditions, function(nm) { + matrix(1, nrow = 2, ncol = 1, + dimnames = list(c("s1", "s2"), nm)) + }), conditions) + list( + residual_Y = y, + residual_X = x, + residual_Y_scalar = setNames(as.list(rep(1, length(conditions))), conditions), + residual_X_scalar = setNames(as.list(rep(1, length(conditions))), conditions), + dropped_sample = list(X = list(), Y = list(), covar = list()), + covar = list(), + Y = y, + X_data = x, + X = do.call(cbind, x), + maf = setNames(lapply(x, function(mat) rep(0.1, ncol(mat))), conditions), + chrom = "chr1", + grange = c("1", "100"), + Y_coordinates = list() + ) + } + + local_mocked_bindings( + load_regional_univariate_data = function(genotype, phenotype, covariate, + conditions, ...) { + calls[[length(calls) + 1]] <<- list( + genotype = genotype, + phenotype = phenotype, + covariate = covariate, + conditions = conditions + ) + make_mock_dat(conditions) + } + ) + + result <- expect_warning( + load_multitask_regional_data( + region = "chr1:1-100", + genotype_list = c("geno1", "geno2"), + phenotype_list = paste0("pheno", 1:4), + covariate_list = paste0("covar", 1:4), + match_geno_pheno = c(1, 1, 2, 2) + ), + "conditions_list_individual" + ) + + expect_equal(length(calls), 2L) + expect_equal(calls[[1]]$conditions, paste0("condition", 1:2)) + expect_equal(calls[[2]]$conditions, paste0("condition", 3:4)) + expect_equal(names(result$individual_data$residual_X), paste0("condition", 1:4)) +}) + +test_that("load_multitask_regional_data validates individual input vector lengths", { + expect_error( + load_multitask_regional_data( + region = "chr1:1-100", + genotype_list = "geno", + phenotype_list = paste0("pheno", 1:2), + covariate_list = "covar", + conditions_list_individual = paste0("cond", 1:2) + ), + "phenotype_list" + ) + expect_error( + load_multitask_regional_data( + region = "chr1:1-100", + genotype_list = "geno", + phenotype_list = paste0("pheno", 1:2), + covariate_list = paste0("covar", 1:2), + conditions_list_individual = "cond" + ), + "conditions_list_individual" + ) + expect_error( + load_multitask_regional_data( + region = "chr1:1-100", + genotype_list = c("geno1", "geno2"), + phenotype_list = paste0("pheno", 1:2), + covariate_list = paste0("covar", 1:2), + conditions_list_individual = paste0("cond", 1:2), + match_geno_pheno = c(1, 3) + ), + "match_geno_pheno" + ) +}) + test_that("load_multitask_regional_data sumstat path returns expected structure", { skip_if_not_installed("pgenlibr") td <- test_path("test_data") diff --git a/tests/testthat/test_raiss.R b/tests/testthat/test_raiss.R index 2315798b..085da5cb 100644 --- a/tests/testthat/test_raiss.R +++ b/tests/testthat/test_raiss.R @@ -1390,6 +1390,68 @@ test_that("X path R2 filtering matches R path", { info = "Same variants should pass R2/LD filtering") }) +test_that("raw genotype_matrix path is not equivalent to LD path used by legacy pipeline", { + set.seed(1) + n <- 80 + p <- 40 + n_known <- 20 + X_raw <- matrix(sample(0:2, n * p, replace = TRUE, + prob = c(0.35, 0.45, 0.20)), + nrow = n, ncol = p) + ref_panel <- data.frame( + chrom = rep(1, p), + pos = seq_len(p) * 100, + variant_id = paste0("rs", seq_len(p)), + A1 = rep("A", p), + A2 = rep("G", p), + stringsAsFactors = FALSE + ) + colnames(X_raw) <- ref_panel$variant_id + + known_idx <- sort(sample(seq_len(p), n_known)) + known_zscores <- data.frame( + chrom = rep(1, n_known), + pos = ref_panel$pos[known_idx], + variant_id = ref_panel$variant_id[known_idx], + A1 = ref_panel$A1[known_idx], + A2 = ref_panel$A2[known_idx], + z = rnorm(n_known), + stringsAsFactors = FALSE + ) + + R <- compute_LD(X_raw, method = "sample") + rownames(R) <- colnames(R) <- ref_panel$variant_id + X_scaled <- scale(X_raw) + X_scaled[is.na(X_scaled)] <- 0 + colnames(X_scaled) <- ref_panel$variant_id + + result_LD <- raiss(ref_panel, known_zscores, LD_matrix = R, + lamb = 0.01, rcond = 0.01, + R2_threshold = 0.6, minimum_ld = 0, + verbose = FALSE) + result_raw_X <- raiss(ref_panel, known_zscores, genotype_matrix = X_raw, + lamb = 0.01, svd_tol = 1e-8, + R2_threshold = 0.6, minimum_ld = 0, + verbose = FALSE) + result_scaled_X <- raiss(ref_panel, known_zscores, genotype_matrix = X_scaled, + lamb = 0.01, svd_tol = 1e-12, + R2_threshold = 0.6, minimum_ld = 0, + verbose = FALSE) + + ld_ids <- sort(result_LD$result_filter$variant_id) + raw_x_ids <- sort(result_raw_X$result_filter$variant_id) + scaled_x_ids <- sort(result_scaled_X$result_filter$variant_id) + + expect_false(identical(ld_ids, raw_x_ids)) + expect_gt(nrow(result_raw_X$result_filter), nrow(result_LD$result_filter)) + expect_equal(scaled_x_ids, ld_ids) + + ld_sorted <- result_LD$result_nofilter %>% arrange(variant_id) + scaled_sorted <- result_scaled_X$result_nofilter %>% arrange(variant_id) + expect_equal(ld_sorted$z, scaled_sorted$z, tolerance = 1e-10) + expect_equal(ld_sorted$raiss_R2, scaled_sorted$raiss_R2, tolerance = 1e-10) +}) + test_that("raiss rejects both LD_matrix and genotype_matrix", { data <- generate_X_test_data(n = 50, p = 20, n_known = 10, seed = 1) expect_error( @@ -1454,4 +1516,4 @@ test_that("X path with collinear variants matches R path", { expect_equal(r_sorted$z, x_sorted$z, tolerance = 1e-3, info = "Collinear case: z-scores should be close") -}) \ No newline at end of file +}) diff --git a/tests/testthat/test_sumstats_qc.R b/tests/testthat/test_sumstats_qc.R index 2159404e..366596f0 100644 --- a/tests/testthat/test_sumstats_qc.R +++ b/tests/testthat/test_sumstats_qc.R @@ -257,6 +257,19 @@ test_that("rss_basic_qc handles multiple skip regions", { expect_true(500 %in% remaining_pos) }) +test_that("rss_basic_qc can skip LD matrix subsetting for genotype references", { + td <- make_test_sumstats_ld(n_variants = 3) + X_ref <- matrix(rnorm(30), 10, 3) + colnames(X_ref) <- td$variant_ids + td$LD_data$LD_matrix <- X_ref + td$LD_data$is_genotype <- TRUE + + result <- rss_basic_qc(td$sumstats, td$LD_data, return_LD_mat = FALSE) + + expect_true(nrow(result$sumstats) > 0) + expect_null(result$LD_mat) +}) + # =========================================================================== # summary_stats_qc # =========================================================================== @@ -391,6 +404,110 @@ test_that("summary_stats_qc returns LD_mat matching filtered sumstats dimensions expect_equal(ncol(result$LD_mat), nrow(result$sumstats)) }) +test_that("summary_stats_qc basic genotype-backed path does not compute LD", { + td <- make_test_sumstats_ld(n_variants = 5) + X_ref <- matrix(rnorm(50), 10, 5) + colnames(X_ref) <- td$variant_ids + td$LD_data$LD_matrix <- X_ref + td$LD_data$is_genotype <- TRUE + rss_input <- list(sumstats = td$sumstats, n = 1000, var_y = 1) + + local_mocked_bindings( + compute_LD = function(...) stop("compute_LD should not be called") + ) + + expect_message( + result <- summary_stats_qc(rss_input = rss_input, LD_data = td$LD_data, + qc_method = "none", impute = FALSE), + "basic harmonization retained" + ) + expect_equal(nrow(result$LD_matrix), nrow(X_ref)) + expect_equal(ncol(result$LD_matrix), nrow(result$rss_input$sumstats)) +}) + +test_that("summary_stats_qc PIP screening uses LD-independent SER", { + td <- make_test_sumstats_ld(n_variants = 5) + X_ref <- matrix(rnorm(50), 10, 5) + colnames(X_ref) <- td$variant_ids + td$LD_data$LD_matrix <- X_ref + td$LD_data$is_genotype <- TRUE + rss_input <- list(sumstats = td$sumstats, n = 1000, var_y = 1) + + local_mocked_bindings( + compute_LD = function(...) stop("compute_LD should not be called"), + susie_rss = function(z, R = NULL, X = NULL, ...) { + expect_null(X) + expect_equal(R, diag(length(z))) + list(pip = rep(1, length(z))) + } + ) + + result <- suppressMessages(summary_stats_qc( + rss_input = rss_input, + LD_data = td$LD_data, + qc_method = "none", + pip_cutoff_to_skip = 0.1, + impute = FALSE + )) + expect_equal(ncol(result$LD_matrix), nrow(result$rss_input$sumstats)) +}) + +test_that("summary_stats_qc treats NULL qc_method as basic-only none", { + td <- make_test_sumstats_ld(n_variants = 5) + rss_input <- list(sumstats = td$sumstats, n = 1000, var_y = 1) + + local_mocked_bindings( + ld_mismatch_qc = function(...) stop("ld_mismatch_qc should not be called") + ) + + expect_message( + result <- summary_stats_qc( + rss_input = rss_input, + LD_data = td$LD_data, + qc_method = NULL, + impute = FALSE + ), + "basic harmonization retained" + ) + expect_equal(nrow(result$rss_input$sumstats), nrow(td$sumstats)) +}) + +test_that("summary_stats_qc LD-mismatch QC computes only filtered local LD from X_ref", { + td <- make_test_sumstats_ld(n_variants = 5) + X_ref <- matrix(rnorm(50), 10, 5) + colnames(X_ref) <- td$variant_ids + td$LD_data$LD_matrix <- X_ref + td$LD_data$is_genotype <- TRUE + rss_input <- list(sumstats = td$sumstats, n = 1000, var_y = 1) + compute_calls <- 0 + + local_mocked_bindings( + compute_LD = function(X, ...) { + compute_calls <<- compute_calls + 1 + expect_equal(ncol(X), 3) + R <- diag(ncol(X)) + rownames(R) <- colnames(R) <- colnames(X) + R + }, + ld_mismatch_qc = function(zScore, R, nSample = NULL, method = NULL, ...) { + expect_equal(nrow(R), length(zScore)) + expect_equal(ncol(R), length(zScore)) + data.frame(outlier = rep(FALSE, length(zScore))) + } + ) + + result <- suppressMessages(summary_stats_qc( + rss_input = rss_input, + LD_data = td$LD_data, + qc_method = "slalom", + skip_region = "1:150-350", + impute = FALSE + )) + expect_equal(compute_calls, 1) + expect_equal(ncol(result$LD_matrix), nrow(result$rss_input$sumstats)) + expect_equal(ncol(result$LD_matrix), 3) +}) + # =========================================================================== # ld_mismatch_qc # =========================================================================== diff --git a/tests/testthat/test_univariate_pipeline.R b/tests/testthat/test_univariate_pipeline.R index ba3b2ff7..cb32abee 100644 --- a/tests/testthat/test_univariate_pipeline.R +++ b/tests/testthat/test_univariate_pipeline.R @@ -971,7 +971,10 @@ test_that("rss: pip_cutoff_to_skip > 0, signal detected => continues", { load_rss_data = function(...) list(sumstats = ss, n = 1000, var_y = 1), rss_basic_qc = function(...) list(sumstats = ss, LD_mat = ld_mat), susie_rss = function(...) list(pip = c(0.9, 0.01, 0.01, 0.01, 0.01)), - summary_stats_qc = function(...) list(sumstats = ss, LD_mat = ld_mat, outlier_number = 0), + summary_stats_qc = function(...) { + message("Follow-up on region: signals above PIP threshold 0.5 detected.") + list(sumstats = ss, LD_mat = ld_mat, outlier_number = 0) + }, partition_LD_matrix = function(...) ld_mat, raiss = function(...) list(result_filter = ss, LD_mat = ld_mat), susie_rss_pipeline = function(...) fake_rss_result, @@ -1179,7 +1182,7 @@ test_that("rss: finemapping_method = NULL skips fine-mapping", { # SECTION 14: rss_analysis_pipeline - QC = NULL skips QC # ======================================================================== -test_that("rss: qc_method = NULL skips quality control", { +test_that("rss: qc_method = NULL uses combined basic QC without LD-mismatch method", { ss <- make_rss_sumstats(5) ld_mat <- make_rss_ld_mat(5) fake_result <- make_fake_post_result(5) @@ -1204,7 +1207,7 @@ test_that("rss: qc_method = NULL skips quality control", { finemapping_method = "susie_rss" ) - expect_false(qc_called) + expect_true(qc_called) }) # ======================================================================== @@ -1947,7 +1950,7 @@ test_that("load_study_LD with comma-separated paths builds list of X panels", { # rss_analysis_pipeline: X-as-LD path (genotype interface) # ======================================================================== -test_that("rss: is_genotype=TRUE path computes R from X and uses X for fine-mapping", { +test_that("rss: is_genotype=TRUE path does not precompute R and uses X for fine-mapping", { ss <- make_rss_sumstats(5) ld_mat <- make_rss_ld_mat(5) X_geno <- matrix(rnorm(20 * 5), nrow = 20, ncol = 5) @@ -1961,7 +1964,7 @@ test_that("rss: is_genotype=TRUE path computes R from X and uses X for fine-mapp local_mocked_bindings( compute_LD = function(X, method = "sample") { compute_LD_called <<- TRUE - ld_mat + stop("rss_analysis_pipeline should not precompute LD from X") }, load_rss_data = function(...) list(sumstats = ss, n = 1000, var_y = 1), rss_basic_qc = function(...) list(sumstats = ss, LD_mat = ld_mat), @@ -1983,7 +1986,7 @@ test_that("rss: is_genotype=TRUE path computes R from X and uses X for fine-mapp finemapping_method = "susie_rss" ) - expect_true(compute_LD_called) + expect_false(compute_LD_called) expect_true(is.matrix(susie_X_arg)) expect_null(susie_LD_arg) expect_true(any(grepl("susie_rss", names(result)))) @@ -1998,9 +2001,13 @@ test_that("rss: mixture LD_data (list of X panels) preserves list shape into sus fake_result <- make_fake_post_result(5) susie_X_arg <- NULL + compute_LD_called <- FALSE local_mocked_bindings( - compute_LD = function(X, method = "sample") ld_mat, + compute_LD = function(X, method = "sample") { + compute_LD_called <<- TRUE + stop("rss_analysis_pipeline should not precompute LD from mixture X") + }, load_rss_data = function(...) list(sumstats = ss, n = 1000, var_y = 1), rss_basic_qc = function(...) list(sumstats = ss, LD_mat = ld_mat), summary_stats_qc = function(...) list(sumstats = ss, LD_mat = ld_mat, outlier_number = 0), @@ -2021,6 +2028,7 @@ test_that("rss: mixture LD_data (list of X panels) preserves list shape into sus ) # Mixture path => list of subset matrices passed to susie_rss_pipeline as X_mat + expect_false(compute_LD_called) expect_true(is.list(susie_X_arg)) expect_length(susie_X_arg, 2) expect_true(all(sapply(susie_X_arg, is.matrix))) diff --git a/vignettes/colocboost-pipeline.Rmd b/vignettes/colocboost-pipeline.Rmd index 86857ccb..7a4e65eb 100644 --- a/vignettes/colocboost-pipeline.Rmd +++ b/vignettes/colocboost-pipeline.Rmd @@ -25,7 +25,7 @@ This vignette demonstrates how to use the bioinformatics pipeline for ColocBoost Acknowledgment: Thanks to Kate (Kathryn) Lawrence (GitHub:@kal26) for her contributions to this vignette. -# 1. Loading Data using `colocboost_analysis_pipeline` function +# 1. Loading Data using `colocboost_pipeline` function This function harmonizes the input data and prepares it for colocalization analysis. @@ -34,7 +34,7 @@ In this section, we introduce how to load the regional data required for the Col This function loads mixed datasets for a specific region, including individual-level data (genotype, phenotype, covariate data), summary statistics (sumstats, LD), or a combination of both. It runs `load_regional_univariate_data` for each individual-level dataset and `load_rss_data` for each summary statistics dataset. The output is a list with `individual_data` and `sumstat_data` components, where `individual_data` is a list of individual-level data and `sumstat_data` is a list of summary statistics data. -This list is then passed to the `colocboost_analysis_pipeline` function for the colocalization analysis. +This list is then passed to the `colocboost_pipeline` function for the colocalization analysis. Below are the input parameters for this function for loading individual-level data: @@ -201,9 +201,9 @@ The LD metadata file is a tab-separated file with the following columns: ``` -# 2. Perform ColocBoost using `colocboost_analysis_pipeline` function +# 2. Perform ColocBoost using `colocboost_pipeline` function -In this section, we load region data for a combination of individual-level and summary statistics data, then perform the colocalization analysis using the `colocboost_analysis_pipeline` function. +In this section, we load region data for a combination of individual-level and summary statistics data, then perform the colocalization analysis using the `colocboost_pipeline` function. The colocalization analysis can be run in any one of three modes, or in a combination of these modes (names assume that individual-level data is xQTL data and summary statistics data is GWAS data): - **`xQTL-only mode`**: Only perform colocalization analysis on the individual-level data. Summary statistics data is not used. @@ -228,7 +228,7 @@ Example: for sQTL, `list(type_pattern = ".*clu_(\\d+_[+-?]).*", valid_pattern = Outputs: -- **`colocboost_results`**: List of colocboost objects (with `xqtl_coloc`, `joint_gwas`, `separate_gwas`); Output of the `colocboost_analysis_pipeline` function. If the mode is not run, the corresponding element will be `NULL`. +- **`colocboost_results`**: List of colocboost objects (with `xqtl_coloc`, `joint_gwas`, `separate_gwas`); Output of the `colocboost_pipeline` function. If the mode is not run, the corresponding element will be `NULL`. ```{r, colocboost-analysis, eval = FALSE} #### Please check the example code below #### @@ -265,7 +265,7 @@ pip_cutoff_to_skip_sumstat = rep(0, length(sumstat_path_list)) qc_method = "slalom" # run colocboost analysis -colocboost_results <- colocboost_analysis_pipeline( +colocboost_results <- colocboost_pipeline( region_data_combined, maf_cutoff = maf_cutoff, pip_cutoff_to_skip_ind = pip_cutoff_to_skip_ind, @@ -286,4 +286,4 @@ colocboost_plot(colocboost_results$joint_gwas) for (i in 1:length(colocboost_results$separate_gwas)) { colocboost_plot(colocboost_results$separate_gwas[[i]]) } -``` \ No newline at end of file +``` From 11db15185c4a637e0c74c4f32029f0430211d08a Mon Sep 17 00:00:00 2001 From: xueweic Date: Sun, 24 May 2026 16:27:11 +0000 Subject: [PATCH 2/5] Update documentation --- man/colocboost_analysis.Rd | 31 ++++++------ man/colocboost_pipeline.Rd | 20 +++++--- man/qc_individual_data.Rd | 21 +++++--- man/region_data_to_colocboost_input.Rd | 8 +-- man/region_data_to_ind_input.Rd | 8 +-- man/region_data_to_rss_input.Rd | 7 +-- man/summary_stats_qc.Rd | 68 +++++++++++++------------- 7 files changed, 92 insertions(+), 71 deletions(-) diff --git a/man/colocboost_analysis.Rd b/man/colocboost_analysis.Rd index e7b23584..c5249c57 100644 --- a/man/colocboost_analysis.Rd +++ b/man/colocboost_analysis.Rd @@ -1,4 +1,5 @@ % Generated by roxygen2: do not edit by hand +% Please edit documentation in R/colocboost_pipeline.R \name{colocboost_analysis} \alias{colocboost_analysis} \title{ColocBoost analysis with optional pipeline QC} @@ -26,20 +27,21 @@ data inputs such as \code{X}, \code{Y}, \code{sumstat}, \code{LD}, \code{outcome_names}, and all ColocBoost model/post-processing options. QC can only inspect inputs that are supplied by name.} -\item{missing_rate_thresh, maf_cutoff, xvar_cutoff, ld_reference_meta_file, pip_cutoff_to_skip_ind}{Individual-level QC controls. -If all are \code{NULL}, individual-level QC is not run.} +\item{missing_rate_thresh, maf_cutoff, xvar_cutoff, ld_reference_meta_file, pip_cutoff_to_skip_ind}{Individual-level QC controls. If all are \code{NULL}, individual-level QC +is not run.} -\item{keep_indel, pip_cutoff_to_skip_sumstat, qc_method, impute, impute_opts}{Summary-statistic QC controls. -\code{qc_method = "none"} and \code{qc_method = "rss_qc"} run basic allele -harmonization without LD-mismatch outlier detection. Imputation is only run -when \code{impute = TRUE}.} +\item{keep_indel, pip_cutoff_to_skip_sumstat, qc_method, impute, impute_opts}{Summary-statistic QC controls. \code{qc_method = "none"} and +\code{qc_method = "rss_qc"} run basic allele harmonization without +LD-mismatch outlier detection. Imputation is only run when +\code{impute = TRUE}.} \item{LD_reference_info}{Optional LD reference information for summary-statistic QC. This is only needed when the native \code{LD} matrix -row/column names or \code{X_ref} column names are missing or are not parseable -genomic variant IDs. It can be a .bim/.pvar/.pvar.zst file path, a data.frame -with variant metadata, or a \code{load_LD_matrix()} result. This is a QC-only -argument and is not passed to \code{colocboost::colocboost()}.} +row/column names or \code{X_ref} column names are missing or are not +parseable genomic variant IDs. It can be a .bim/.pvar/.pvar.zst file path, +a data.frame with variant metadata, or a \code{load_LD_matrix()} result. +This is a QC-only argument and is not passed to +\code{colocboost::colocboost()}.} \item{variant_convention}{Allele order used by native ColocBoost-style \code{sumstat$variant} and LD/X_ref names when deriving QC inputs: @@ -82,10 +84,11 @@ LD-mismatch QC. RAISS imputation is controlled separately by \code{impute = TRUE}. If no QC controls are supplied, this function is a thin direct call to -\code{colocboost::colocboost(...)}. When QC removes outcomes, -\code{outcome_names} and \code{focal_outcome_idx} are updated to match the -post-QC outcome order. If the requested focal outcome is removed by QC, -\code{focal_outcome_idx} is set to \code{NULL} with a warning. +\code{colocboost::colocboost(...)}. +When QC removes outcomes, \code{outcome_names} and \code{focal_outcome_idx} +are updated to match the post-QC outcome order. If the requested focal outcome +is removed by QC, \code{focal_outcome_idx} is set to \code{NULL} with a +warning. } \examples{ \dontrun{ diff --git a/man/colocboost_pipeline.Rd b/man/colocboost_pipeline.Rd index a708b883..ace84f3f 100644 --- a/man/colocboost_pipeline.Rd +++ b/man/colocboost_pipeline.Rd @@ -34,18 +34,26 @@ colocboost_pipeline( \item{pip_cutoff_to_skip_sumstat}{A vector of cutoff values for skipping analysis based on PIP values for each sumstat Default is 0.} -\item{qc_method}{Quality control method to use. Options are "none", "rss_qc", "slalom", or "dentist". \code{NULL} and \code{"rss_qc"} are treated as \code{"none"} for basic-only summary-stat preprocessing.} +\item{qc_method}{Quality control method to use. Options are "none", +"rss_qc", "slalom", or "dentist". \code{NULL} and \code{"rss_qc"} +are treated as \code{"none"} for basic-only summary-stat preprocessing.} \item{impute}{Logical; if TRUE, performs imputation for outliers identified in the analysis (default: TRUE).} \item{impute_opts}{A list of imputation options including rcond, R2_threshold, and minimum_ld (default: list(rcond = 0.01, R2_threshold = 0.6, minimum_ld = 5)).} - -\item{...}{Additional arguments passed to the ColocBoost analysis calls.} } \value{ -A list containing \code{xqtl_coloc}, \code{joint_gwas}, \code{separate_gwas}, -and \code{computing_time}. Analyses that are not requested or cannot be run -return \code{NULL} in their corresponding result slot. +A list containing the individual_data and sumstat_data after QC: +individual_data contains the following components if exist +\itemize{ + \item Y: A list of residualized phenotype values for all tasks. + \item X: A list of residualized genotype matrices all tasks. +} +sumstat_data contains the following components if exist +\itemize{ + \item sumstats: A list of summary statistics f or the matched LD_info, each sublist contains sumstats, n, var_y from \code{load_rss_data}. + \item LD_info: A list of LD information, each sublist contains LD_variants, LD_matrix, ref_panel \code{load_LD_matrix}. +} } \description{ This function performs protocol-level multi-trait colocalization using diff --git a/man/qc_individual_data.Rd b/man/qc_individual_data.Rd index 1f1e070f..677b7ba4 100644 --- a/man/qc_individual_data.Rd +++ b/man/qc_individual_data.Rd @@ -1,4 +1,5 @@ % Generated by roxygen2: do not edit by hand +% Please edit documentation in R/colocboost_pipeline.R \name{qc_individual_data} \alias{qc_individual_data} \title{Run reusable individual-level QC} @@ -19,22 +20,30 @@ qc_individual_data( } \arguments{ \item{X}{Genotype matrix or named list of genotype matrices.} + \item{Y}{Phenotype vector/matrix or named list of phenotype matrices.} + \item{maf}{Optional MAF vector or named list.} + \item{X_variance}{Optional variant variance vector or named list.} + \item{missing_rate_thresh}{Maximum missing genotype rate.} + \item{maf_cutoff}{Minimum MAF cutoff.} + \item{xvar_cutoff}{Minimum genotype variance cutoff.} + \item{ld_reference_meta_file}{Optional LD reference metadata file.} -\item{keep_indel}{Whether indel variants are kept during LD-reference filtering.} + +\item{keep_indel}{Whether indel variants are kept during LD-reference +filtering.} + \item{pip_cutoff_to_skip}{Optional single-effect PIP cutoff.} -\item{context}{Optional context label for matrix inputs.} } \value{ -A named list of cleaned context-level \code{X}/\code{Y} records, or one -cleaned record for matrix inputs. +A named list of cleaned context-level \code{X}/\code{Y} records, or + one cleaned record for matrix inputs. } \description{ -Runs individual-level preprocessing by reusing existing pecotmr genotype -filtering helpers and optional single-effect screening. +Run reusable individual-level QC } diff --git a/man/region_data_to_colocboost_input.Rd b/man/region_data_to_colocboost_input.Rd index b49f57dc..3f2f50dc 100644 --- a/man/region_data_to_colocboost_input.Rd +++ b/man/region_data_to_colocboost_input.Rd @@ -1,4 +1,5 @@ % Generated by roxygen2: do not edit by hand +% Please edit documentation in R/colocboost_pipeline.R \name{region_data_to_colocboost_input} \alias{region_data_to_colocboost_input} \title{Convert loaded regional data to ColocBoost inputs} @@ -9,10 +10,9 @@ region_data_to_colocboost_input(region_data) \item{region_data}{A list returned by \code{load_multitask_regional_data()}.} } \value{ -A structured list containing \code{colocboost_input}, \code{qc_input}, and -\code{source_info}. +A structured list containing \code{colocboost_input}, + \code{qc_input}, and \code{source_info}. } \description{ -Builds direct ColocBoost input records plus QC input records from loaded -regional data. +Convert loaded regional data to ColocBoost inputs } diff --git a/man/region_data_to_ind_input.Rd b/man/region_data_to_ind_input.Rd index f3ee98e4..b1af764b 100644 --- a/man/region_data_to_ind_input.Rd +++ b/man/region_data_to_ind_input.Rd @@ -1,4 +1,5 @@ % Generated by roxygen2: do not edit by hand +% Please edit documentation in R/file_utils.R \name{region_data_to_ind_input} \alias{region_data_to_ind_input} \title{Convert loaded regional data to individual-level inputs} @@ -9,10 +10,9 @@ region_data_to_ind_input(region_data) \item{region_data}{A list returned by \code{load_multitask_regional_data()}.} } \value{ -A list containing \code{X}, \code{Y}, \code{maf}, \code{X_variance}, and -source information. +A list containing \code{X}, \code{Y}, \code{maf}, + \code{X_variance}, and source information. } \description{ -Extracts individual-level matrices from loaded regional data without running -QC or model fitting. +Convert loaded regional data to individual-level inputs } diff --git a/man/region_data_to_rss_input.Rd b/man/region_data_to_rss_input.Rd index b6006c01..62e17229 100644 --- a/man/region_data_to_rss_input.Rd +++ b/man/region_data_to_rss_input.Rd @@ -1,4 +1,5 @@ % Generated by roxygen2: do not edit by hand +% Please edit documentation in R/file_utils.R \name{region_data_to_rss_input} \alias{region_data_to_rss_input} \title{Convert loaded regional data to RSS inputs} @@ -9,9 +10,9 @@ region_data_to_rss_input(region_data) \item{region_data}{A list returned by \code{load_multitask_regional_data()}.} } \value{ -A list containing named RSS inputs, matched LD data, and source information. +A list containing named RSS inputs, matched LD data, and source + information. } \description{ -Extracts loaded RSS summary-statistic records and their matched LD/reference -data without running QC or model fitting. +Convert loaded regional data to RSS inputs } diff --git a/man/summary_stats_qc.Rd b/man/summary_stats_qc.Rd index 139c2aa7..4ba39714 100644 --- a/man/summary_stats_qc.Rd +++ b/man/summary_stats_qc.Rd @@ -40,21 +40,21 @@ combined RSS/ColocBoost QC workflow. A single RSS record is detected by structure, not by the name \code{sumstats}, so a multi-study list may safely include a study named \code{"sumstats"}.} -\item{keep_indel, skip_region, pip_cutoff_to_skip, qc_method, impute, impute_opts, study, var_y, return_on_skip, R_finite, R_mismatch}{Additional controls for the combined RSS/ColocBoost QC workflow. -They are ignored by the historical LD-mismatch-only call unless -\code{rss_input} or combined-QC options are supplied.} +\item{keep_indel, skip_region, pip_cutoff_to_skip, qc_method, impute, impute_opts, study, var_y, return_on_skip, R_finite, R_mismatch}{Additional controls for the combined RSS/ColocBoost QC workflow. They are +ignored by the historical LD-mismatch-only call unless \code{rss_input} or +combined-QC options are supplied.} } \value{ -A list containing the quality-controlled summary statistics and updated LD -matrix for the historical call: -\itemize{ - \item sumstats: The quality-controlled summary statistics data frame. - \item LD_mat: The updated LD matrix after quality control. - \item outlier_number: The number of outlier variants removed. -} -When \code{rss_input} or combined-QC controls are supplied, returns a cleaned -RSS/LD record for one RSS record, or a named list of records for a list of RSS -records. +A list containing the quality-controlled summary statistics and + updated LD matrix for the historical call: + \itemize{ + \item sumstats: The quality-controlled summary statistics data frame. + \item LD_mat: The updated LD matrix after quality control. + \item outlier_number: The number of outlier variants removed. + } + When \code{rss_input} or combined-QC controls are supplied, returns a + cleaned RSS/LD record for one RSS record, or a named list of records for a + list of RSS records. } \description{ This function performs quality control on the processed summary statistics @@ -62,29 +62,29 @@ using the specified method. It wraps \code{\link{ld_mismatch_qc}} and handles subsetting of the summary statistics and LD matrix. } \details{ -This function applies the specified quality control method to the processed -summary statistics via \code{\link{ld_mismatch_qc}}, then subsets the summary -statistics and LD matrix to keep only non-outlier variants. +This function applies the specified quality control method to the + processed summary statistics via \code{\link{ld_mismatch_qc}}, then subsets + the summary statistics and LD matrix to keep only non-outlier variants. -As an additional workflow for ColocBoost/RSS pipelines, callers may supply -\code{rss_input} or combined-QC controls. That path first runs -\code{rss_basic_qc()}, optional PIP screening, optional LD-mismatch QC through -this same function, and optional RAISS imputation. The combined path normalizes -one or many RSS records to the same loop internally; only true single-record -input is unwrapped on return. -\code{qc_method = NULL}, \code{"none"}, and \code{"rss_qc"} all mean -basic-only summary-stat preprocessing without SLALOM/DENTIST outlier QC. + As an additional workflow for ColocBoost/RSS pipelines, callers may supply + \code{rss_input} or combined-QC controls. That path first runs + \code{rss_basic_qc()}, optional PIP screening, optional LD-mismatch QC + through this same function, and optional RAISS imputation. The combined + path normalizes one or many RSS records to the same loop internally; only + true single-record input is unwrapped on return. + \code{qc_method = NULL}, \code{"none"}, and \code{"rss_qc"} all mean + basic-only summary-stat preprocessing without SLALOM/DENTIST outlier QC. -When the combined path receives genotype-backed reference data -(\code{X_ref}), basic harmonization avoids constructing an LD matrix. PIP -screening uses an LD-independent single-effect screen with -\code{susie_rss(R = diag(p), L = 1, max_iter = 1)}, and LD-mismatch QC -computes only the filtered local correlation matrix required by -SLALOM/DENTIST. RAISS imputation temporarily centers/scales genotype-backed -\code{X_ref} before using the whole-region genotype/SVD path. This avoids -LD-block partition artifacts while matching the LD scale used by -\code{compute_LD(X_ref)}. Final ColocBoost calls still keep the original -\code{X_ref} as the reference input. + When the combined path receives genotype-backed reference data + (\code{X_ref}), basic harmonization avoids constructing an LD matrix. PIP + screening uses an LD-independent single-effect screen with + \code{susie_rss(R = diag(p), L = 1, max_iter = 1)}, and LD-mismatch QC + computes only the filtered local correlation matrix required by + SLALOM/DENTIST. RAISS imputation temporarily centers/scales + genotype-backed \code{X_ref} before using the whole-region genotype/SVD + path. This avoids LD-block partition artifacts while matching the LD scale + used by \code{compute_LD(X_ref)}. Final ColocBoost calls still keep the + original \code{X_ref} as the reference input. } \examples{ # Perform SLALOM quality control (default) From 52e04673c436638db0de1d9ed38441f5d9378e5c Mon Sep 17 00:00:00 2001 From: xuewei cao <36172337+xueweic@users.noreply.github.com> Date: Sun, 24 May 2026 16:10:37 -0500 Subject: [PATCH 3/5] Remove rss_qc from summary-stat QC methods Remove the legacy rss_qc alias so summary_stats_qc only accepts none, slalom, and dentist. --- R/colocboost_pipeline.R | 16 ++++++++-------- R/sumstats_qc.R | 9 ++++----- R/univariate_pipeline.R | 4 ++-- man/colocboost_analysis.Rd | 12 ++++++------ man/colocboost_pipeline.Rd | 4 +--- man/rss_analysis_pipeline.Rd | 4 ++-- man/summary_stats_qc.Rd | 16 ++++++++-------- tests/testthat/test_sumstats_qc.R | 14 ++++++++++++++ 8 files changed, 45 insertions(+), 34 deletions(-) diff --git a/R/colocboost_pipeline.R b/R/colocboost_pipeline.R index 61c4874d..63300b28 100644 --- a/R/colocboost_pipeline.R +++ b/R/colocboost_pipeline.R @@ -116,8 +116,8 @@ region_data_to_colocboost_input <- function(region_data) { #' \code{pip_cutoff_to_skip_sumstat}, \code{impute = TRUE}, or #' \code{LD_reference_info} is supplied and named \code{sumstat} plus either #' \code{LD}, \code{X_ref}, or \code{LD_reference_info} are available. -#' \code{qc_method = "none"} and \code{qc_method = "rss_qc"} mean run basic -#' allele/variant harmonization only; they do not run SLALOM/DENTIST +#' \code{qc_method = "none"} means run basic allele/variant harmonization +#' only; it does not run SLALOM/DENTIST #' LD-mismatch QC. RAISS imputation is controlled separately by #' \code{impute = TRUE}. #' @@ -137,8 +137,8 @@ region_data_to_colocboost_input <- function(region_data) { #' Individual-level QC controls. If all are \code{NULL}, individual-level QC #' is not run. #' @param keep_indel,pip_cutoff_to_skip_sumstat,qc_method,impute,impute_opts -#' Summary-statistic QC controls. \code{qc_method = "none"} and -#' \code{qc_method = "rss_qc"} run basic allele harmonization without +#' Summary-statistic QC controls. \code{qc_method = "none"} runs +#' basic allele harmonization without #' LD-mismatch outlier detection. Imputation is only run when #' \code{impute = TRUE}. #' @param LD_reference_info Optional LD reference information for @@ -343,8 +343,8 @@ colocboost_analysis <- function(..., #' @param pip_cutoff_to_skip_ind A vector of cutoff values for skipping analysis based on PIP values for each context. Default is 0. #' @param pip_cutoff_to_skip_sumstat A vector of cutoff values for skipping analysis based on PIP values for each sumstat Default is 0. #' @param qc_method Quality control method to use. Options are "none", -#' "rss_qc", "slalom", or "dentist". \code{NULL} and \code{"rss_qc"} -#' are treated as \code{"none"} for basic-only summary-stat preprocessing. +#' "slalom", or "dentist". \code{NULL} is treated as \code{"none"} for +#' basic-only summary-stat preprocessing. #' @param impute Logical; if TRUE, performs imputation for outliers identified in the analysis (default: TRUE). #' @param impute_opts A list of imputation options including rcond, R2_threshold, and minimum_ld (default: list(rcond = 0.01, R2_threshold = 0.6, minimum_ld = 5)). #' @@ -640,8 +640,8 @@ colocboost_pipeline <- function( #' @param pip_cutoff_to_skip_ind A vector of cutoff values for skipping individual contexts. #' @param pip_cutoff_to_skip_sumstat A vector of cutoff values for skipping summary-stat studies. #' @param qc_method Quality control method to use. Options are "none", -#' "rss_qc", "slalom", or "dentist". \code{NULL} and \code{"rss_qc"} -#' are treated as \code{"none"} for basic-only summary-stat preprocessing. +#' "slalom", or "dentist". \code{NULL} is treated as \code{"none"} for +#' basic-only summary-stat preprocessing. #' @param impute Logical; if TRUE, performs imputation when required metadata are available. #' @param impute_opts A list of imputation options. #' @return A list containing post-QC \code{individual_data} and \code{sumstat_data}. diff --git a/R/sumstats_qc.R b/R/sumstats_qc.R index aca54aea..f6d4928d 100644 --- a/R/sumstats_qc.R +++ b/R/sumstats_qc.R @@ -149,8 +149,7 @@ ld_mismatch_qc <- function(zScore, R = NULL, X = NULL, nSample = NULL, .resolve_summary_qc_method <- function(qc_method) { if (is.null(qc_method)) return("none") - qc_method <- match.arg(qc_method, c("none", "rss_qc", "slalom", "dentist")) - if (identical(qc_method, "rss_qc")) "none" else qc_method + match.arg(qc_method, c("none", "slalom", "dentist")) } #' Perform Quality Control on Summary Statistics @@ -196,8 +195,8 @@ ld_mismatch_qc <- function(zScore, R = NULL, X = NULL, nSample = NULL, #' through this same function, and optional RAISS imputation. The combined #' path normalizes one or many RSS records to the same loop internally; only #' true single-record input is unwrapped on return. -#' \code{qc_method = NULL}, \code{"none"}, and \code{"rss_qc"} all mean -#' basic-only summary-stat preprocessing without SLALOM/DENTIST outlier QC. +#' \code{qc_method = NULL} and \code{"none"} mean basic-only summary-stat +#' preprocessing without SLALOM/DENTIST outlier QC. #' #' When the combined path receives genotype-backed reference data #' (\code{X_ref}), basic harmonization avoids constructing an LD matrix. PIP @@ -520,7 +519,7 @@ summary_stats_qc <- function(sumstats, LD_data, n = NULL, } outlier_number <- 0L - if (!is.null(qc_method) && !qc_method %in% c("none", "rss_qc")) { + if (!is.null(qc_method) && !identical(qc_method, "none")) { message("QC track: running ", qc_method, " LD-mismatch QC for summary-stat study ", study, ".") qc <- summary_stats_qc(sumstats = sumstats, LD_data = ld_data_with_local_R(sumstats), n = n, method = qc_method) diff --git a/R/univariate_pipeline.R b/R/univariate_pipeline.R index f92b70ff..5bc2227b 100644 --- a/R/univariate_pipeline.R +++ b/R/univariate_pipeline.R @@ -204,7 +204,7 @@ load_study_LD <- function(ld_path, region) { #' @param region_name_col Column to filter for extract_region_name. #' @param qc_method Summary-statistic QC method. \code{"slalom"} and #' \code{"dentist"} run basic allele harmonization plus LD-mismatch QC; -#' \code{"rss_qc"} and \code{"none"} run basic allele harmonization only. +#' \code{"none"} runs basic allele harmonization only. #' @param finemapping_method One of "susie_rss", "single_effect", "bayesian_conditional_regression". #' @param finemapping_opts List of fine-mapping options (L, L_greedy, coverage, #' signal_cutoff, min_abs_corr). @@ -227,7 +227,7 @@ rss_analysis_pipeline <- function( sumstat_path, column_file_path, LD_data, n_sample = 0, n_case = 0, n_control = 0, region = NULL, skip_region = NULL, extract_region_name = NULL, region_name_col = NULL, - qc_method = c("slalom", "dentist", "rss_qc", "none"), + qc_method = c("slalom", "dentist", "none"), finemapping_method = c("susie_rss", "single_effect", "bayesian_conditional_regression"), finemapping_opts = list( L = 20, L_greedy = 5, diff --git a/man/colocboost_analysis.Rd b/man/colocboost_analysis.Rd index c5249c57..cc88741b 100644 --- a/man/colocboost_analysis.Rd +++ b/man/colocboost_analysis.Rd @@ -30,10 +30,10 @@ QC can only inspect inputs that are supplied by name.} \item{missing_rate_thresh, maf_cutoff, xvar_cutoff, ld_reference_meta_file, pip_cutoff_to_skip_ind}{Individual-level QC controls. If all are \code{NULL}, individual-level QC is not run.} -\item{keep_indel, pip_cutoff_to_skip_sumstat, qc_method, impute, impute_opts}{Summary-statistic QC controls. \code{qc_method = "none"} and -\code{qc_method = "rss_qc"} run basic allele harmonization without -LD-mismatch outlier detection. Imputation is only run when -\code{impute = TRUE}.} +\item{keep_indel, pip_cutoff_to_skip_sumstat, qc_method, impute, impute_opts}{Summary-statistic QC controls. +\code{qc_method = "none"} runs basic allele harmonization without LD-mismatch +outlier detection. Imputation is only run +when \code{impute = TRUE}.} \item{LD_reference_info}{Optional LD reference information for summary-statistic QC. This is only needed when the native \code{LD} matrix @@ -78,8 +78,8 @@ is non-\code{NULL} and named \code{X} and \code{Y} inputs are available in \code{pip_cutoff_to_skip_sumstat}, \code{impute = TRUE}, or \code{LD_reference_info} is supplied and named \code{sumstat} plus either \code{LD}, \code{X_ref}, or \code{LD_reference_info} are available. -\code{qc_method = "none"} and \code{qc_method = "rss_qc"} mean run basic -allele/variant harmonization only; they do not run SLALOM/DENTIST +\code{qc_method = "none"} means run basic allele/variant harmonization only; +it does not run SLALOM/DENTIST LD-mismatch QC. RAISS imputation is controlled separately by \code{impute = TRUE}. diff --git a/man/colocboost_pipeline.Rd b/man/colocboost_pipeline.Rd index ace84f3f..c06c0373 100644 --- a/man/colocboost_pipeline.Rd +++ b/man/colocboost_pipeline.Rd @@ -34,9 +34,7 @@ colocboost_pipeline( \item{pip_cutoff_to_skip_sumstat}{A vector of cutoff values for skipping analysis based on PIP values for each sumstat Default is 0.} -\item{qc_method}{Quality control method to use. Options are "none", -"rss_qc", "slalom", or "dentist". \code{NULL} and \code{"rss_qc"} -are treated as \code{"none"} for basic-only summary-stat preprocessing.} +\item{qc_method}{Quality control method to use. Options are "none", "slalom", or "dentist". \code{NULL} is treated as \code{"none"} for basic-only summary-stat preprocessing.} \item{impute}{Logical; if TRUE, performs imputation for outliers identified in the analysis (default: TRUE).} diff --git a/man/rss_analysis_pipeline.Rd b/man/rss_analysis_pipeline.Rd index df410a0e..d34d351e 100644 --- a/man/rss_analysis_pipeline.Rd +++ b/man/rss_analysis_pipeline.Rd @@ -15,7 +15,7 @@ rss_analysis_pipeline( skip_region = NULL, extract_region_name = NULL, region_name_col = NULL, - qc_method = c("slalom", "dentist", "rss_qc", "none"), + qc_method = c("slalom", "dentist", "none"), finemapping_method = c("susie_rss", "single_effect", "bayesian_conditional_regression"), finemapping_opts = list(L = 20, L_greedy = 5, coverage = c(0.95, 0.7, 0.5), signal_cutoff = 0.025, min_abs_corr = 0.8), @@ -56,7 +56,7 @@ require a correlation matrix.} \item{qc_method}{Summary-statistic QC method. \code{"slalom"} and \code{"dentist"} run basic allele harmonization plus LD-mismatch QC; -\code{"rss_qc"} and \code{"none"} run basic allele harmonization only.} +\code{"none"} runs basic allele harmonization only.} \item{finemapping_method}{One of "susie_rss", "single_effect", "bayesian_conditional_regression".} diff --git a/man/summary_stats_qc.Rd b/man/summary_stats_qc.Rd index 4ba39714..d43c9d22 100644 --- a/man/summary_stats_qc.Rd +++ b/man/summary_stats_qc.Rd @@ -66,14 +66,14 @@ This function applies the specified quality control method to the processed summary statistics via \code{\link{ld_mismatch_qc}}, then subsets the summary statistics and LD matrix to keep only non-outlier variants. - As an additional workflow for ColocBoost/RSS pipelines, callers may supply - \code{rss_input} or combined-QC controls. That path first runs - \code{rss_basic_qc()}, optional PIP screening, optional LD-mismatch QC - through this same function, and optional RAISS imputation. The combined - path normalizes one or many RSS records to the same loop internally; only - true single-record input is unwrapped on return. - \code{qc_method = NULL}, \code{"none"}, and \code{"rss_qc"} all mean - basic-only summary-stat preprocessing without SLALOM/DENTIST outlier QC. +As an additional workflow for ColocBoost/RSS pipelines, callers may supply +\code{rss_input} or combined-QC controls. That path first runs +\code{rss_basic_qc()}, optional PIP screening, optional LD-mismatch QC through +this same function, and optional RAISS imputation. The combined path normalizes +one or many RSS records to the same loop internally; only true single-record +input is unwrapped on return. +\code{qc_method = NULL} and \code{"none"} mean basic-only summary-stat +preprocessing without SLALOM/DENTIST outlier QC. When the combined path receives genotype-backed reference data (\code{X_ref}), basic harmonization avoids constructing an LD matrix. PIP diff --git a/tests/testthat/test_sumstats_qc.R b/tests/testthat/test_sumstats_qc.R index 366596f0..a7152f42 100644 --- a/tests/testthat/test_sumstats_qc.R +++ b/tests/testthat/test_sumstats_qc.R @@ -472,6 +472,20 @@ test_that("summary_stats_qc treats NULL qc_method as basic-only none", { expect_equal(nrow(result$rss_input$sumstats), nrow(td$sumstats)) }) +test_that("summary_stats_qc rejects invalid qc_method values", { + td <- make_test_sumstats_ld(n_variants = 5) + rss_input <- list(sumstats = td$sumstats, n = 1000, var_y = 1) + + expect_error( + summary_stats_qc( + rss_input = rss_input, + LD_data = td$LD_data, + qc_method = "bad_method" + ), + "should be one of" + ) +}) + test_that("summary_stats_qc LD-mismatch QC computes only filtered local LD from X_ref", { td <- make_test_sumstats_ld(n_variants = 5) X_ref <- matrix(rnorm(50), 10, 5) From 20464f4953351bf4b852d8dcef962048e2d2ad66 Mon Sep 17 00:00:00 2001 From: xueweic Date: Sun, 24 May 2026 21:24:31 +0000 Subject: [PATCH 4/5] Update documentation --- man/colocboost_analysis.Rd | 12 ++++++------ man/colocboost_pipeline.Rd | 4 +++- man/summary_stats_qc.Rd | 16 ++++++++-------- 3 files changed, 17 insertions(+), 15 deletions(-) diff --git a/man/colocboost_analysis.Rd b/man/colocboost_analysis.Rd index cc88741b..054a20bc 100644 --- a/man/colocboost_analysis.Rd +++ b/man/colocboost_analysis.Rd @@ -30,10 +30,10 @@ QC can only inspect inputs that are supplied by name.} \item{missing_rate_thresh, maf_cutoff, xvar_cutoff, ld_reference_meta_file, pip_cutoff_to_skip_ind}{Individual-level QC controls. If all are \code{NULL}, individual-level QC is not run.} -\item{keep_indel, pip_cutoff_to_skip_sumstat, qc_method, impute, impute_opts}{Summary-statistic QC controls. -\code{qc_method = "none"} runs basic allele harmonization without LD-mismatch -outlier detection. Imputation is only run -when \code{impute = TRUE}.} +\item{keep_indel, pip_cutoff_to_skip_sumstat, qc_method, impute, impute_opts}{Summary-statistic QC controls. \code{qc_method = "none"} runs +basic allele harmonization without +LD-mismatch outlier detection. Imputation is only run when +\code{impute = TRUE}.} \item{LD_reference_info}{Optional LD reference information for summary-statistic QC. This is only needed when the native \code{LD} matrix @@ -78,8 +78,8 @@ is non-\code{NULL} and named \code{X} and \code{Y} inputs are available in \code{pip_cutoff_to_skip_sumstat}, \code{impute = TRUE}, or \code{LD_reference_info} is supplied and named \code{sumstat} plus either \code{LD}, \code{X_ref}, or \code{LD_reference_info} are available. -\code{qc_method = "none"} means run basic allele/variant harmonization only; -it does not run SLALOM/DENTIST +\code{qc_method = "none"} means run basic allele/variant harmonization +only; it does not run SLALOM/DENTIST LD-mismatch QC. RAISS imputation is controlled separately by \code{impute = TRUE}. diff --git a/man/colocboost_pipeline.Rd b/man/colocboost_pipeline.Rd index c06c0373..2db58dd2 100644 --- a/man/colocboost_pipeline.Rd +++ b/man/colocboost_pipeline.Rd @@ -34,7 +34,9 @@ colocboost_pipeline( \item{pip_cutoff_to_skip_sumstat}{A vector of cutoff values for skipping analysis based on PIP values for each sumstat Default is 0.} -\item{qc_method}{Quality control method to use. Options are "none", "slalom", or "dentist". \code{NULL} is treated as \code{"none"} for basic-only summary-stat preprocessing.} +\item{qc_method}{Quality control method to use. Options are "none", +"slalom", or "dentist". \code{NULL} is treated as \code{"none"} for +basic-only summary-stat preprocessing.} \item{impute}{Logical; if TRUE, performs imputation for outliers identified in the analysis (default: TRUE).} diff --git a/man/summary_stats_qc.Rd b/man/summary_stats_qc.Rd index d43c9d22..b20f8cc8 100644 --- a/man/summary_stats_qc.Rd +++ b/man/summary_stats_qc.Rd @@ -66,14 +66,14 @@ This function applies the specified quality control method to the processed summary statistics via \code{\link{ld_mismatch_qc}}, then subsets the summary statistics and LD matrix to keep only non-outlier variants. -As an additional workflow for ColocBoost/RSS pipelines, callers may supply -\code{rss_input} or combined-QC controls. That path first runs -\code{rss_basic_qc()}, optional PIP screening, optional LD-mismatch QC through -this same function, and optional RAISS imputation. The combined path normalizes -one or many RSS records to the same loop internally; only true single-record -input is unwrapped on return. -\code{qc_method = NULL} and \code{"none"} mean basic-only summary-stat -preprocessing without SLALOM/DENTIST outlier QC. + As an additional workflow for ColocBoost/RSS pipelines, callers may supply + \code{rss_input} or combined-QC controls. That path first runs + \code{rss_basic_qc()}, optional PIP screening, optional LD-mismatch QC + through this same function, and optional RAISS imputation. The combined + path normalizes one or many RSS records to the same loop internally; only + true single-record input is unwrapped on return. + \code{qc_method = NULL} and \code{"none"} mean basic-only summary-stat + preprocessing without SLALOM/DENTIST outlier QC. When the combined path receives genotype-backed reference data (\code{X_ref}), basic harmonization avoids constructing an LD matrix. PIP From d86da9905a90876b0b4480db25720385b8f65f14 Mon Sep 17 00:00:00 2001 From: Daniel Nachun Date: Sun, 24 May 2026 19:47:47 -0500 Subject: [PATCH 5/5] fix unit test --- tests/testthat/test_univariate_pipeline.R | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/tests/testthat/test_univariate_pipeline.R b/tests/testthat/test_univariate_pipeline.R index cb32abee..1e3ac5ad 100644 --- a/tests/testthat/test_univariate_pipeline.R +++ b/tests/testthat/test_univariate_pipeline.R @@ -955,7 +955,8 @@ test_that("rss: pip_cutoff_to_skip > 0, no signal => early return", { sumstat_path = "/fake/sumstats.tsv", column_file_path = "/fake/columns.yml", LD_data = list(), - pip_cutoff_to_skip = 0.5 + pip_cutoff_to_skip = 0.5, + qc_method = "none" ), "Skipping follow-up" ) @@ -1012,7 +1013,8 @@ test_that("rss: negative pip_cutoff_to_skip auto-computes threshold", { sumstat_path = "/fake/sumstats.tsv", column_file_path = "/fake/columns.yml", LD_data = list(), - pip_cutoff_to_skip = -1 + pip_cutoff_to_skip = -1, + qc_method = "none" ), "Skipping follow-up" )