A pipeline for performing common genetic epidemiology tasks at the University of Bristol.
Goals:
- Implement comment steps in GWAS investigations to create reproducible, more efficient research
- Reusable pipelines that can be utilised on different projects
- Shared code and steps that can be updated according to the latest knowledge and practices
There are six Snakemake pipelines (grouped as two tables of three). Each pipeline is a .smk file under snakemake/; see PIPELINES.md for YAML inputs and parameters.
| standardise_gwas.smk | compare_gwases.smk | disease_progression.smk |
|---|---|---|
| Takes in any of: VCF, CSV, TSV, TXT (also zip/gz) | Runs standardise_gwas for each GWAS, then pairwise comparison tooling | Runs standardise_gwas for incident and subsequent GWASes |
| Optional liftover (e.g. GRCh38 → GRCh37) | PLINK clumping | Runs collider-bias-aware analyses (e.g. SlopeHunter, MR-IVW) and compares results |
| Optional RSID and EAF fill from reference panels | Heterogeneity across GWASes; LDSC h² and rg | Miami plots of unadjusted vs adjusted GWAS |
| Converts z-scores and odds ratios to BETA/SE | Expected vs observed replication metrics | Expected vs observed (before and after collider adjustment) |
Harmonised SNP ID = CHR:BP_EA_OA (EA/OA sorted) |
HTML report of LDSC, plots, and tables | HTML report; optional instructions to refit subsequent GWAS from collider output |
| Optional gene ↔ Ensembl mapping |
| qtl_mr.smk | finemap.smk | coloc.smk |
|---|---|---|
| Runs standardise_gwas, clumping, and SuSiE fine-mapping on one outcome GWAS | Same standardise + clump + SuSiE path for each input GWAS (one or more) | Same standardise + clump + SuSiE for ≥2 GWASes (required for colocalization) |
| Runs Mendelian randomization vs a chosen QTL panel (e.g. eQTLGen, MetaBrain) | Per-locus fine-mapping using summary stats and ancestry-matched LD (PLINK reference); outputs credible sets and LBF columns per locus | Pairwise coloc::coloc.bf_bf on overlapping finemapped signals (same chr, leads within ±overlap_kb kb) across all trait pairs |
| Volcano plot of MR results; ** coloc** for exposures that pass MR FDR | Finemap-only: no MR or coloc between traits (use when you only need SuSiE outputs) | Full coloc table + HTML report (result_coloc.html), including a disclaimer when ancestries differ between GWASes |
Requires QTL_DATA_DIR (see .env_example) for QTL files |
Each GWAS must declare ancestry (for LD) | Configurable finemap and coloc priors/overlap; see PIPELINES.md |
git clone git@github.com:MRCIEU/GeneHackman.git && cd GeneHackman
2. Ensure you have conda installed and initialised before activating
conda env create --file environment.yml
or if you have already created the environment
conda activate genehackman
The data to run the pipelines have been split into two buckets, the mandatory bucket, and QTL bucket (only needed if you want to run MR-QTL pipeline). To download, install gsutil
- Mandatory:
gs://genehackmangsutil -m rsync -r gs://genehackman/ /path/to/my_pipeline_data/- Update
PIPELINE_DATA_DIRto/path/to/my_pipeline_data/in the .env file
- Optional:
gs://genehackman-qtlgsutil -m rsync -r gs://genehackman-qtl/ /path/to/my_qtl_data/- Update
QTL_DATA_DIRto/path/to/my_qtl_data/in the .env file
To copy only selected prefixes instead of the QTL bucket, use gsutil -m cp -r gs://genehackman-qtl/SOME_PREFIX/ ... as needed. For example, you may only be interested in cis, not trans data.
Then point your .env at those directories (trailing slashes are fine):
PIPELINE_DATA_DIR=/path/to/my_pipeline_data/
QTL_DATA_DIR=/path/to/my_qtl_data/Alternatives if you don’t set QTL_DATA_DIR directly: keep the same directory layout under PIPELINE_DATA_DIR/qtl_datasets/, or follow PLATFORM_SETUP.md (bind mounts and defaulting QTL_DATA_DIR to PIPELINE_DATA_DIR/qtl_datasets).
Copy the template and edit paths for your machine:
cp .env_example .envVariables match .env_example:
Mandatory
| Variable | Purpose |
|---|---|
PROJECT_DIR |
Root folder for this analysis. The pipeline uses PROJECT_DIR/data/ for inputs (GWAS, clumps, …) and PROJECT_DIR/results/ for outputs (finemap, coloc, plots, …). |
PIPELINE_DATA_DIR |
Path where you unpacked the shared genehackman reference data (see §3). Also used for the Apptainer image: PIPELINE_DATA_DIR/genomic_data/pipeline/genehackman_<version>.sif, where <version> comes from DOCKER_VERSION or defaults to Version: in DESCRIPTION. If that SIF is missing and the directory is writable, run_pipeline.sh builds it from docker://mrcieu/genehackman:<version>. |
Optional
| Variable | Purpose |
|---|---|
DOCKER_VERSION |
Docker/Apptainer image tag (e.g. 1.1.0 or develop). Defaults to Version: in DESCRIPTION when unset. Override when you need a different image than the checked-out package version. |
QTL_DATA_DIR |
Path to genehackman-qtl data if you run qtl_mr. Leave empty for other pipelines. You can instead place QTL data under PIPELINE_DATA_DIR/qtl_datasets/ (see PLATFORM_SETUP.md). |
SNAKEMAKE_PROFILE |
Snakemake profile directory. Default in .env_example: snakemake/profiles/local/ (local Apptainer). On HPC use e.g. snakemake/profiles/slurm/. |
APPTAINER_MODULE |
Environment module name to load Apptainer/Singularity before running on HPC (only used when the profile is not local). |
SLURM_PARTITION |
Slurm partition for cluster jobs. If unset, run_pipeline.sh tries to detect one from sinfo, else uses compute. |
SLURM_ACCOUNT |
Slurm account for cluster jobs. If unset, run_pipeline.sh tries to infer it from sacctmgr. |
Example .env:
PROJECT_DIR=/path/to/my_project
PIPELINE_DATA_DIR=/path/to/my_pipeline_data/
# DOCKER_VERSION=1.1.0 # optional; defaults to Version: in DESCRIPTION
QTL_DATA_DIR=/path/to/my_qtl_data/
SNAKEMAKE_PROFILE=snakemake/profiles/local/input.yaml
- Example:
cp snakemake/input_templates/compare_gwases.yaml input.yaml - Each pipeline has its own shape; examples live under
snakemake/input_templates/. - See PIPELINES.md for all fields.
- Pass the input YAML as the second argument to
run_pipeline.sh, or rely oninput.yamlin the working directory. To callsnakemakeyourself without the wrapper, setPROJECT_DIRin.envand exportDATA_DIR="${PROJECT_DIR%/}/data"andRESULTS_DIR="${PROJECT_DIR%/}/results"(or load the same paths Snakemake uses) so shell rules and profile bind mounts resolve; also pass--config genehackman_input=/path/to/file.yaml.
To ensure you have configured everything correctly, you can run a test pipeline from run_test_pipelines.sh
./run_pipeline.sh snakemake/standardise_gwas.smk tests/testthat/data/snakemake_inputs/standardise_gwas.yaml
To run your pipeline:
./run_pipeline.sh snakemake/<specific_pipeline>.smk <optional_input_file.yaml>
Snakemake execution profiles (--profile) live under snakemake/profiles/ (see also PLATFORM_SETUP.md).
- By default
run_pipeline.shusesSNAKEMAKE_PROFILE=snakemake/profiles/local/(local Apptainer). On HPC, set e.g.SNAKEMAKE_PROFILE=snakemake/profiles/slurm/(generic Slurm) or add a site-specific directory besidesnakemake/profiles/local/andsnakemake/profiles/slurm/. run_pipeline.shis just a convience wrapper around thesnakemakecommand, if you want to do anything out of the ordinary, please read up on snakemake- If there are errors while running the pipeline, you can find error messages either directly on the screen, or in slurm log file that is outputted on error
- It is recommended that you run the your pipeline inside a tmux session.
The standard column naming for GWASes are:
| CHR | BP | EA | OA | BETA | SE | P | EAF | SNP | RSID |
|---|
A full list of names and default values can be found here
There are 2 main components to the pipeline
- Snakemake to define the steps to complete for each pipeline.
- Docker / Singularity container with installed languages (R and python), packages, os libraries, and code
The pipeline can be run either on its own, or via your institutions HPC. Each snakemake step spins up a singularity container inside an HPC job (ex. slurm). Each step can specify different cpu/memory requirements.
Running on macOS, Linux, Slurm, or PBS? See PLATFORM_SETUP.md for platform-specific setup (Apptainer/Lima, SIF cache, Snakemake profiles under snakemake/profiles/, qsub template).
See CONTRIBUTING.md for development setup, Docker rebuilds, unit tests, and end-to-end tests. You can also contact open an issue in this repo.