Skip to content

josephrich98/rgit-private

Repository files navigation

Radiogenomic Information-Theoretic Bounds

A statistical analysis framework for characterizing fundamental information-theoretic limits on the relationship between radiology imaging phenotypes and genomic data.

Overview

This project investigates the theoretical upper and lower bounds on how much predictive information flows between:

  • Imaging phenotypes — features derived from CT and MRI studies (radiomic descriptors, deep features, or structured reads)
  • Genomic alterations — somatic mutations, copy number variants (CNV), and germline SNPs

Rather than optimizing a specific predictor, the goal is to quantify the fundamental limits imposed by information theory: how much mutual information exists between these modalities, what rate-distortion trade-offs apply when compressing one modality to predict the other, and where the predictive ceiling lies regardless of model choice.

Research Questions

  1. What is the mutual information between imaging-derived features and genomic alteration profiles?
  2. What are the theoretical upper bounds on genomics → imaging (and imaging → genomics) prediction accuracy?
  3. How do data quantity, feature dimensionality, and noise affect the achievable information transfer?

Repository Structure

.
├── data/           # Raw and processed datasets (not committed)
├── notebooks/      # Exploratory analysis and figure generation
├── rgit/           # Core Python package
├── scripts/        # Standalone analysis scripts
└── pyproject.toml  # Python project configuration

Methods

The analysis draws on:

  • Mutual information estimation — non-parametric (KSG, MINE) and parametric estimators for continuous and mixed-type variables
  • Rate-distortion theory — characterizing the minimum description length of one modality needed to predict the other at a given fidelity
  • Data processing inequality — bounding information loss through feature extraction pipelines
  • Finite-sample corrections — bias correction and bootstrap confidence intervals for MI estimates in high-dimensional settings

Setup

conda create -n rgit -y python=3.10 && conda activate rgit
pip install -e .[processing,notebooks,dev]

Python 3.10+ is recommended. Dependencies are declared in pyproject.toml.

Checking math

``bash cd rgit_lean && ./check.sh


## Usage

Analysis notebooks are in `notebooks/`. Reusable estimation utilities live in the `rgit/` package. Batch scripts for large-scale runs are in `scripts/`.

### Reproducing the analysis without the notebook

The full recoverability pipeline that `notebooks/radiogenomic_recoverability.ipynb`
orchestrates is also packaged as importable code, so it can be reproduced after a
plain `pip install rgit` — no notebook, and (for the synthetic case) no data on disk.

With no `.h5ad` files it runs the synthetic probabilistic-CCA dataset with known
ground truth, writing `stats.json` and figures to the output directory:

```bash
# console script (installed with the package)
rgit-recoverability --output-dir out/synthetic

# real data
rgit-recoverability \
    --genomics data/tcga_kirc/genomics/mutated_genes.h5ad \
    --imaging  data/tcga_kirc/imaging/tumor_radimagenet.h5ad \
    --genomics-data-type variant \
    --output-dir out/kirc

Or from Python:

import rgit

# synthetic ground-truth run (nothing needed on disk)
report = rgit.run_recoverability_analysis(
    rgit.RecoverabilityConfig(output_dir="out/synthetic")
)
print(report.stats["effective_identifiable_rank"]["rank"])

# real cohort
report = rgit.run_recoverability_analysis(
    genomics_h5ad="data/.../mutated_genes.h5ad",
    imaging_h5ad="data/.../tumor_radimagenet.h5ad",
    genomics_data_type="variant",
    output_dir="out/kirc",
)

Every knob the notebook exposes lives on rgit.RecoverabilityConfig; the resolved config is written next to stats.json for provenance. Only the core dependencies are required — scanpy (HVG selection) falls back to variance ranking, and the RadImageNet feature extractor (torch) is loaded lazily, so neither is needed to run the recoverability analysis itself.

Synthetic data

papermill notebooks/radiogenomic_recoverability.ipynb notebooks/out/radiogenomic_recoverability_output_synthetic.ipynb # synthetic data

Real data (TCGA-KIRC example)

Imaging data processing

python scripts/process_imaging_tcga_kirc.py -d data/tcga_kirc/imaging

python scripts/make_imaging_matrix.py -o data/tcga_kirc/imaging/organ_radiomics.h5ad -m data/tcga_kirc/imaging/metadata.csv --mask_col organ_mask --embedder pyradiomics python scripts/make_imaging_matrix.py -o data/tcga_kirc/imaging/tumor_radiomics.h5ad -m data/tcga_kirc/imaging/metadata.csv --mask_col tumor_mask --embedder pyradiomics --label 2 python scripts/make_imaging_matrix.py -o data/tcga_kirc/imaging/whole_radimagenet.h5ad -m data/tcga_kirc/imaging/metadata.csv --embedder radimagenet --model_path data/models/RadImageNet_pytorch/ResNet50.pt --clip_min -200 --clip_max 300 --resample_spacing 0.8,0.8,3.0 --apply_mask --crop_size 625,625,200 python scripts/make_imaging_matrix.py -o data/tcga_kirc/imaging/organ_radimagenet.h5ad -m data/tcga_kirc/imaging/metadata.csv --mask_col organ_mask --embedder radimagenet --model_path data/models/RadImageNet_pytorch/ResNet50.pt --clip_min -200 --clip_max 300 --resample_spacing 0.8,0.8,3.0 --apply_mask --crop_size 185,185,75 python scripts/make_imaging_matrix.py -o data/tcga_kirc/imaging/tumor_radimagenet.h5ad -m data/tcga_kirc/imaging/metadata.csv --mask_col tumor_mask --embedder radimagenet --model_path data/models/RadImageNet_pytorch/ResNet50.pt --clip_min -200 --clip_max 300 --resample_spacing 0.8,0.8,3.0 --apply_mask --crop_size 185,185,75 --label 2

Genomics data processing

wget -O data/tcga_kirc/genomics/mc3.v0.2.8.PUBLIC.maf.gz https://api.gdc.cancer.gov/data/1c8cfe5f-e52d-41ba-94da-f15ea1337efc python scripts/make_genomics_matrix.py -o data/tcga_kirc/genomics/mutated_genes.h5ad --dataset tcga --feature gene_symbol --patient_ids data/tcga_kirc/imaging/metadata.csv data/tcga_kirc/genomics/mc3.v0.2.8.PUBLIC.maf.gz python scripts/make_genomics_matrix.py -o data/tcga_kirc/genomics/mutated_pathways.h5ad --dataset tcga --feature pathway --patient_ids data/tcga_kirc/imaging/metadata.csv data/tcga_kirc/genomics/mc3.v0.2.8.PUBLIC.maf.gz

gdc-client download -m data/tcga_kirc/genomics/gene_expression_manifest.txt -d data/tcga_kirc/genomics tar -xzvf data/tcga_kirc/genomics/gene_expression.tar.gz -C data/tcga_kirc/genomics/gene_expression python scripts/make_genomics_matrix.py -o data/tcga_kirc/genomics/gene_expression.h5ad --dataset tcga --feature gene_expression --patient_ids data/tcga_kirc/imaging/metadata.csv --filename_to_patientid data/tcga_kirc/genomics/gene_expression_filename_to_patientid.csv data/tcga_kirc/genomics/gene_expression

Run notebooks

for genomics_h5ad in data/tcga_kirc/genomics/.h5ad; do for imaging_h5ad in data/tcga_kirc/imaging/radimagenet.h5ad; do echo "Running recoverability analysis for genomics: $genomics_h5ad and imaging: $imaging_h5ad" output_notebook="notebooks/out/radiogenomic_recoverability_output_tcga_kirc_genomics$(basename ${genomics_h5ad%.})imaging$(basename ${imaging_h5ad%.}).ipynb" papermill notebooks/radiogenomic_recoverability.ipynb "$output_notebook" -p GENOMICS_H5AD "$genomics_h5ad" -p IMAGING_H5AD "$imaging_h5ad" done done

NSCLC

wget -O data/nsclc/imaging/manifest.tcia https://www.cancerimagingarchive.net/wp-content/uploads/NSCLC_Radiogenomics-6-1-21-Version-4.tcia wget -O data/nsclc/imaging/metadata.xlsx https://www.cancerimagingarchive.net/wp-content/uploads/NSCLC_Radiogenomics-6-1-21-Version-4-nbia-digest.xlsx wget -O data/nsclc/genomics/gene_expression.txt.gz https://ftp.ncbi.nlm.nih.gov/geo/series/GSE103nnn/GSE103584/suppl/GSE103584%5FR01%5FNSCLC%5FRNAseq%2Etxt%2Egz nbia-data-retriever --cli data/nsclc/imaging/manifest.tcia -d data/nsclc/imaging/dicom -v -f python scripts/make_imaging_matrix.py -o data/nsclc/imaging/organ_radiomics.h5ad -m data/nsclc/imaging/metadata.csv --mask_col organ_mask --embedder pyradiomics --label 1 python scripts/make_genomics_matrix.py -o data/nsclc/genomics/gene_expression.h5ad --dataset nsclc --feature gene_expression data/nsclc/genomics/gene_expression.txt.gz

ADNI

python scripts/process_imaging_adni.py python scripts/make_genomics_matrix.py -o data/adni/genomics/gene_expression.h5ad --dataset adni --feature gene_expression data/adni/genomics/ADNI_Gene_Expression_Profile.csv python /home/jrich/Desktop/rgit-private/scripts/make_adni_variant_matrix.py --inputs /home/jrich/Desktop/rgit-private/data/adni/genomics/WGS_Omni2.5M_20140220 --reference /home/jrich/data/reference/gatk_grch37/Homo_sapiens_assembly19.fasta --data-sources /home/jrich/data/reference/gatk_grch37/funcotator_dataSources.v1.8.hg19.20230908g --out-maf /home/jrich/Desktop/rgit-private/data/adni/genomics/WGS_Omni2.5M_20140220/genotype.maf.gz --out /home/jrich/Desktop/rgit-private/data/adni/genomics/genotype.h5ad --filter-impact --groupby-gene

Status

Early-stage research repository. Methods and structure are under active development.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors