CPSC 445 (Algorithms in Bioinformatics) Course Project — University of British Columbia
This project systematically evaluates which statistical distributions best model single-cell RNA-seq UMI count data, addressing the "zero-inflation" debate in the field. We compare Poisson, Negative Binomial (NB), Zero-Inflated Poisson (ZIP), and Zero-Inflated Negative Binomial (ZINB) distributions on the PBMC 3k dataset, fitting all 9,107 post-QC genes (not just HVGs) under both no-offset and per-cell library-size offset modes. We complement the univariate analysis with scVI and GLM-PCA latent-variable models under multiple observation likelihoods, including an explicit multinomial family.
- HVG-only fitting is circularly biased: Seurat v3 HVG selection picks genes whose variance exceeds the empirical mean-variance trend, pre-stacking a Poisson-vs-NB BIC contest toward NB (99.1% NB on HVGs vs. 43.6% on all genes)
- Library-size offset shifts the balance further: adding a per-cell log-offset moves ~4 pp from NB to Poisson (56.2% Poisson, 39.8% NB)
- Cell-type conditioning makes Poisson dominant: within Leiden clusters, Poisson wins 86.5% of 58,614 gene–cluster BIC contests; NB 11.2%
- Zero-inflation is negligible: 0.1–0.3% of genes/gene-cluster pairs are flagged at ΔBIC > 10, depending on offset mode
- scVI confirms overdispersion > zero-inflation: ZINB–NB gap is ~10 nats/cell vs. NB–Poisson gap of ~50 nats/cell
- GLM-PCA confirms Poisson–multinomial equivalence: the two families differ by 0.2% in deviance and yield indistinguishable UMAP layouts
├── report.tex # LaTeX report (~10 pages)
├── report.pdf # Compiled report
├── ref.bib # Bibliography (11 references)
├── requirements.txt # Pinned Python dependencies
├── README.md # This file
├── src/
│ ├── config.py # Centralized constants, paths, size-factor computation
│ ├── utils.py # Offset-aware distribution fitters, AIC/BIC, sanity tests
│ ├── 01_data.py # Data acquisition, QC, clustering, cell type annotation
│ ├── 02_fit.py # Per-gene fitting (all 9,107 genes × {noff, offset} modes)
│ ├── 03_select.py # Model selection via BIC, zero-fraction analysis, decile stratification
│ ├── 04_heterogeneity.py # Per-cluster re-fitting under both offset modes
│ ├── 05_scvi.py # scVI comparison (ZINB, NB, Poisson likelihoods)
│ ├── 05b_glmpca.py # GLM-PCA comparison (multinomial, Poisson, NB families)
│ └── 06_figures.py # Publication-quality figure generation (8 figures)
└── data/
├── pbmc3k_processed.h5ad # Processed AnnData object
├── fit_results_all_cells_{noff,offset}.csv # Distribution fitting results
├── model_selection_all_cells_{noff,offset}.csv # Model selection summaries
├── fit_results_per_cluster_{noff,offset}.csv # Per-cluster fitting results
├── model_selection_per_cluster_{noff,offset}.csv # Per-cluster model selection
├── figures/ # All generated figures (PDF + PNG)
├── scvi_results/ # scVI trained models and metrics
└── glmpca_results/ # GLM-PCA factors, UMAPs, and metrics
- Python 3.12 (scanpy ≥ 1.12 requires Python ≥ 3.12)
- Python 3.13 is not supported: scVI crashes due to a multiprocessing/OMP incompatibility in PyTorch Lightning. Use Python 3.12.x.
- Internet connection (Step 1 downloads the PBMC 3k dataset from 10x Genomics via Scanpy)
- ~2 GB disk space for data and models
- GPU is not required; all scripts run on CPU. scVI training (Step 5a) benefits from a CUDA GPU but completes in ~5 min on CPU as well.
- Total pipeline runtime: ~60 minutes on a modern multi-core CPU (Steps 2 and 4 dominate).
# Create virtual environment (using uv, or pip below)
uv venv .venv --python 3.12
source .venv/bin/activate
# Install pinned dependencies
uv pip install -r requirements.txtAlternative (without uv):
python3.12 -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
pip install -r requirements.txtAlternative (conda):
conda create -y -p .venv python=3.12
conda activate ./.venv
pip install -r requirements.txtScripts are numbered to indicate execution order. Run all scripts from the src/ directory:
cd src/
# Step 1: Download and preprocess PBMC 3k data (~1 min; requires internet)
# -> data/pbmc3k_processed.h5ad
python 01_data.py
# Step 2: Fit 4 distributions to all 9,107 post-QC genes (~12 min, both offset modes)
# -> data/fit_results_all_cells_{noff,offset}.csv
python 02_fit.py
# Step 3: Model selection, zero-fraction analysis, decile stratification
# -> data/model_selection_all_cells_{noff,offset}.csv
python 03_select.py
# Step 4: Per-cluster heterogeneity analysis (~30 min, both offset modes)
# -> data/fit_results_per_cluster_{noff,offset}.csv
# -> data/model_selection_per_cluster_{noff,offset}.csv
python 04_heterogeneity.py
# Step 5a: scVI comparison with 3 likelihoods (~5 min on CPU)
# -> data/scvi_results/
python 05_scvi.py
# Step 5b: GLM-PCA multinomial/Poisson/NB comparison (~12 min)
# -> data/glmpca_results/
python 05b_glmpca.py
# Step 6: Generate all publication figures
# -> data/figures/fig{1..8}_*.{pdf,png}
python 06_figures.pyRun from the project root (not src/):
cd .. # return to project root if still in src/
pdflatex report.tex && bibtex report && pdflatex report.tex && pdflatex report.texAll fits are performed both without offset (noff) and with a per-cell log-library-size offset (offset), where size factor
| Distribution | Parameters | Estimation Method |
|---|---|---|
| Poisson | μ | Closed-form MLE: μ̂ = ΣX / Σs |
| Negative Binomial | μ, φ | Profile likelihood (Brent on log φ) |
| Zero-Inflated Poisson | π₀, μ | statsmodels (BFGS/NM) with log-offset |
| Zero-Inflated NB | π₀, μ, φ | statsmodels (BFGS/NM) with log-offset |
Model selection uses BIC with ΔBIC > 10 threshold for zero-inflation evidence (Kass–Raftery "very strong" scale).
- Svensson (2020). Droplet scRNA-seq is not zero-inflated. Nature Biotechnology.
- Choi et al. (2020). Bayesian model selection reveals biological origins of zero inflation. Genome Biology.
- Kim et al. (2020). Demystifying "drop-outs" in single-cell UMI data. Genome Biology.
- Pan et al. (2023). The Poisson distribution model fits UMI-based scRNA-seq data. BMC Bioinformatics.
- Pierson & Yau (2015). ZIFA: Dimensionality reduction for zero-inflated scRNA-seq. Genome Biology.
- Lopez et al. (2018). Deep generative modeling for single-cell transcriptomics. Nature Methods.
- Risso et al. (2018). A general and flexible method for signal extraction from single-cell RNA-seq data. Nature Communications.
- Jiang et al. (2022). Statistics or biology: the zero-inflation controversy about scRNA-seq data. Genome Biology.
- Townes et al. (2019). Feature selection and dimension reduction for single-cell RNA-Seq based on a multinomial model. Genome Biology.
- Wolf et al. (2018). SCANPY: large-scale single-cell gene expression data analysis. Genome Biology.
- Kass & Raftery (1995). Bayes Factors. Journal of the American Statistical Association.