A statistical analysis framework for characterizing fundamental information-theoretic limits on the relationship between radiology imaging phenotypes and genomic data.
This project investigates the theoretical upper and lower bounds on how much predictive information flows between:
- Imaging phenotypes — features derived from CT and MRI studies (radiomic descriptors, deep features, or structured reads)
- Genomic alterations — somatic mutations, copy number variants (CNV), and germline SNPs
Rather than optimizing a specific predictor, the goal is to quantify the fundamental limits imposed by information theory: how much mutual information exists between these modalities, what rate-distortion trade-offs apply when compressing one modality to predict the other, and where the predictive ceiling lies regardless of model choice.
- What is the mutual information between imaging-derived features and genomic alteration profiles?
- What are the theoretical upper bounds on genomics → imaging (and imaging → genomics) prediction accuracy?
- How do data quantity, feature dimensionality, and noise affect the achievable information transfer?
.
├── data/ # Raw and processed datasets (not committed)
├── notebooks/ # Exploratory analysis and figure generation
├── rgit/ # Core Python package
├── scripts/ # Standalone analysis scripts
└── pyproject.toml # Python project configuration
The analysis draws on:
- Mutual information estimation — non-parametric (KSG, MINE) and parametric estimators for continuous and mixed-type variables
- Rate-distortion theory — characterizing the minimum description length of one modality needed to predict the other at a given fidelity
- Data processing inequality — bounding information loss through feature extraction pipelines
- Finite-sample corrections — bias correction and bootstrap confidence intervals for MI estimates in high-dimensional settings
conda create -n rgit -y python=3.10 && conda activate rgit
pip install -e .[processing,notebooks,dev]Python 3.10+ is recommended. Dependencies are declared in pyproject.toml.
``bash cd rgit_lean && ./check.sh
## Usage
Analysis notebooks are in `notebooks/`. Reusable estimation utilities live in the `rgit/` package. Batch scripts for large-scale runs are in `scripts/`.
### Reproducing the analysis without the notebook
The full recoverability pipeline that `notebooks/radiogenomic_recoverability.ipynb`
orchestrates is also packaged as importable code, so it can be reproduced after a
plain `pip install rgit` — no notebook, and (for the synthetic case) no data on disk.
With no `.h5ad` files it runs the synthetic probabilistic-CCA dataset with known
ground truth, writing `stats.json` and figures to the output directory:
```bash
# console script (installed with the package)
rgit-recoverability --output-dir out/synthetic
# real data
rgit-recoverability \
--genomics data/tcga_kirc/genomics/mutated_genes.h5ad \
--imaging data/tcga_kirc/imaging/tumor_radimagenet.h5ad \
--genomics-data-type variant \
--output-dir out/kirc
Or from Python:
import rgit
# synthetic ground-truth run (nothing needed on disk)
report = rgit.run_recoverability_analysis(
rgit.RecoverabilityConfig(output_dir="out/synthetic")
)
print(report.stats["effective_identifiable_rank"]["rank"])
# real cohort
report = rgit.run_recoverability_analysis(
genomics_h5ad="data/.../mutated_genes.h5ad",
imaging_h5ad="data/.../tumor_radimagenet.h5ad",
genomics_data_type="variant",
output_dir="out/kirc",
)Every knob the notebook exposes lives on rgit.RecoverabilityConfig; the resolved
config is written next to stats.json for provenance. Only the core dependencies
are required — scanpy (HVG selection) falls back to variance ranking, and the
RadImageNet feature extractor (torch) is loaded lazily, so neither is needed to
run the recoverability analysis itself.
papermill notebooks/radiogenomic_recoverability.ipynb notebooks/out/radiogenomic_recoverability_output_synthetic.ipynb # synthetic data
python scripts/process_imaging_tcga_kirc.py -d data/tcga_kirc/imaging
python scripts/make_imaging_matrix.py -o data/tcga_kirc/imaging/organ_radiomics.h5ad -m data/tcga_kirc/imaging/metadata.csv --mask_col organ_mask --embedder pyradiomics python scripts/make_imaging_matrix.py -o data/tcga_kirc/imaging/tumor_radiomics.h5ad -m data/tcga_kirc/imaging/metadata.csv --mask_col tumor_mask --embedder pyradiomics --label 2 python scripts/make_imaging_matrix.py -o data/tcga_kirc/imaging/whole_radimagenet.h5ad -m data/tcga_kirc/imaging/metadata.csv --embedder radimagenet --model_path data/models/RadImageNet_pytorch/ResNet50.pt --clip_min -200 --clip_max 300 --resample_spacing 0.8,0.8,3.0 --apply_mask --crop_size 625,625,200 python scripts/make_imaging_matrix.py -o data/tcga_kirc/imaging/organ_radimagenet.h5ad -m data/tcga_kirc/imaging/metadata.csv --mask_col organ_mask --embedder radimagenet --model_path data/models/RadImageNet_pytorch/ResNet50.pt --clip_min -200 --clip_max 300 --resample_spacing 0.8,0.8,3.0 --apply_mask --crop_size 185,185,75 python scripts/make_imaging_matrix.py -o data/tcga_kirc/imaging/tumor_radimagenet.h5ad -m data/tcga_kirc/imaging/metadata.csv --mask_col tumor_mask --embedder radimagenet --model_path data/models/RadImageNet_pytorch/ResNet50.pt --clip_min -200 --clip_max 300 --resample_spacing 0.8,0.8,3.0 --apply_mask --crop_size 185,185,75 --label 2
wget -O data/tcga_kirc/genomics/mc3.v0.2.8.PUBLIC.maf.gz https://api.gdc.cancer.gov/data/1c8cfe5f-e52d-41ba-94da-f15ea1337efc python scripts/make_genomics_matrix.py -o data/tcga_kirc/genomics/mutated_genes.h5ad --dataset tcga --feature gene_symbol --patient_ids data/tcga_kirc/imaging/metadata.csv data/tcga_kirc/genomics/mc3.v0.2.8.PUBLIC.maf.gz python scripts/make_genomics_matrix.py -o data/tcga_kirc/genomics/mutated_pathways.h5ad --dataset tcga --feature pathway --patient_ids data/tcga_kirc/imaging/metadata.csv data/tcga_kirc/genomics/mc3.v0.2.8.PUBLIC.maf.gz
gdc-client download -m data/tcga_kirc/genomics/gene_expression_manifest.txt -d data/tcga_kirc/genomics tar -xzvf data/tcga_kirc/genomics/gene_expression.tar.gz -C data/tcga_kirc/genomics/gene_expression python scripts/make_genomics_matrix.py -o data/tcga_kirc/genomics/gene_expression.h5ad --dataset tcga --feature gene_expression --patient_ids data/tcga_kirc/imaging/metadata.csv --filename_to_patientid data/tcga_kirc/genomics/gene_expression_filename_to_patientid.csv data/tcga_kirc/genomics/gene_expression
for genomics_h5ad in data/tcga_kirc/genomics/.h5ad; do
for imaging_h5ad in data/tcga_kirc/imaging/radimagenet.h5ad; do
echo "Running recoverability analysis for genomics: $genomics_h5ad and imaging: $imaging_h5ad"
output_notebook="notebooks/out/radiogenomic_recoverability_output_tcga_kirc_genomics
wget -O data/nsclc/imaging/manifest.tcia https://www.cancerimagingarchive.net/wp-content/uploads/NSCLC_Radiogenomics-6-1-21-Version-4.tcia wget -O data/nsclc/imaging/metadata.xlsx https://www.cancerimagingarchive.net/wp-content/uploads/NSCLC_Radiogenomics-6-1-21-Version-4-nbia-digest.xlsx wget -O data/nsclc/genomics/gene_expression.txt.gz https://ftp.ncbi.nlm.nih.gov/geo/series/GSE103nnn/GSE103584/suppl/GSE103584%5FR01%5FNSCLC%5FRNAseq%2Etxt%2Egz nbia-data-retriever --cli data/nsclc/imaging/manifest.tcia -d data/nsclc/imaging/dicom -v -f python scripts/make_imaging_matrix.py -o data/nsclc/imaging/organ_radiomics.h5ad -m data/nsclc/imaging/metadata.csv --mask_col organ_mask --embedder pyradiomics --label 1 python scripts/make_genomics_matrix.py -o data/nsclc/genomics/gene_expression.h5ad --dataset nsclc --feature gene_expression data/nsclc/genomics/gene_expression.txt.gz
python scripts/process_imaging_adni.py python scripts/make_genomics_matrix.py -o data/adni/genomics/gene_expression.h5ad --dataset adni --feature gene_expression data/adni/genomics/ADNI_Gene_Expression_Profile.csv python /home/jrich/Desktop/rgit-private/scripts/make_adni_variant_matrix.py --inputs /home/jrich/Desktop/rgit-private/data/adni/genomics/WGS_Omni2.5M_20140220 --reference /home/jrich/data/reference/gatk_grch37/Homo_sapiens_assembly19.fasta --data-sources /home/jrich/data/reference/gatk_grch37/funcotator_dataSources.v1.8.hg19.20230908g --out-maf /home/jrich/Desktop/rgit-private/data/adni/genomics/WGS_Omni2.5M_20140220/genotype.maf.gz --out /home/jrich/Desktop/rgit-private/data/adni/genomics/genotype.h5ad --filter-impact --groupby-gene
Early-stage research repository. Methods and structure are under active development.