Official implementation of "Differentiable Gene Set Enrichment Analysis for Pathway-Level Supervision in Transcriptomic Learning" (Li et al., 2026).
Transcriptomic prediction models are trained with gene-level objectives (MSE, Pearson correlation) but evaluated via pathway-level statistics such as Gene Set Enrichment Analysis (GSEA). This objective–functional mismatch causes unreliable pathway conclusions under imperfect prediction.
dGSEA bridges this gap by constructing a differentiable surrogate of classical GSEA that maps predicted gene-level scores to pathway enrichment with stable gradients. It replaces three non-differentiable operations with principled continuous relaxations:
| Classical GSEA | dGSEA |
|---|---|
| Hard ranking | Temperature-controlled soft ranking |
| Discrete prefix accumulation | Smooth sigmoid prefix kernel |
| Extremum selection | Softmax-weighted aggregation |
Sign-specific robust permutation normalization (dNES) preserves the statistical semantics of the classical normalized enrichment score. A Nyström–windowing approximation (nyswin) reduces the naive O(G²) complexity to near-linear, enabling genome-scale integration into training loops.
When used as an auxiliary objective for SMILES-to-transcriptome prediction on LINCS L1000, dGSEA improves pathway-level agreement without sacrificing gene-level fidelity:
| Metric | Baseline | + dGSEA |
|---|---|---|
| Macro pathway correlation | 0.257 | 0.306 (+19%) |
| Sign accuracy | 0.620 | 0.641 (+3.4%) |
| Pathway MSE | 1.784 | 1.610 (−9.8%) |
| Mean gene Pearson r | 0.449 | 0.452 |
| Gene RMSE | 0.420 | 0.418 |
git clone https://github.com/LeeShuaiyu/dgsea-paper-code
cd dgsea-paper-code
pip install -r requirements.txtRequirements: Python ≥ 3.9, PyTorch ≥ 2.0, RDKit. See requirements.txt for the full dependency list.
LINCS L1000 (Level-5 signatures, 978 landmark genes):
# Download from GEO accession GSE92742
# Expected file: data/GSE92742_Broad_LINCS_Level5_COMPZ.MODZ_n473647x12328.gctxReproduce the compound-level aggregation used in the paper (N = 10,554 compounds after SMILES filtering and replicate averaging):
python scripts/prepare_data.py \
--gctx-path /path/to/GSE92742.gctx \
--out data/lincs_compound_level.parquetChemBERTa encoder: downloaded automatically from HuggingFace Hub (seyonec/ChemBERTa-zinc-base-v1) on first run. To use a local copy, pass --model-dir /path/to/ChemBERTa.
Algorithm sanity check (no data required, runs in < 1 min):
python scripts/smoke_test.pyExpected output: dES values consistent with Table 3 (mean relative error vs. nyswin < 0.2%), Spearman(dNES, NES) > 0.85 across all synthetic scenarios.
Full training reproduction:
python scripts/run_training_repro.py \
--data-path data/lincs_compound_level.parquet \
--model-dir /path/to/ChemBERTa \
--seed 42Target: conclusion-level consistency with Tables 4–5 (not bitwise-identical checkpoints due to GPU nondeterminism). Training takes approximately 2–3 hours on a single A100.
dgsea-paper-code/
├── dgse/ # Core library
│ ├── functional.py # dES, dNES, nyswin (Algorithms 1–3)
│ ├── loss.py # Hybrid training objective (Eq. 7–9)
│ └── normalize.py # Sign-specific robust permutation normalization
├── scripts/
│ ├── smoke_test.py # Algorithm sanity check
│ ├── prepare_data.py # LINCS L1000 preprocessing
│ └── run_training_repro.py # Full training reproduction
├── configs/ # Hyperparameter configs used in the paper
├── figures/
└── CITATION.cff
If you use dGSEA in your work, please cite:
@article{li2026dgsea,
title = {Differentiable Gene Set Enrichment Analysis for Pathway-Level Supervision
in Transcriptomic Learning},
author = {Li, Shuaiyu and Ruan, Yang and Yang, Xinyue and Zhang, Wen and Saigo, Hiroto},
journal = {bioRxiv},
year = {2026},
doi = {10.64898/2026.03.18.712610}
}Supported by JSPS KAKENHI Grant Number JP23H03356.
MIT


