This repository collects genomic and epigenomic annotation datasets and helper scripts used to build annotation resources for human genomes (hg19/hg38) and related analyses.
Overview
- Purpose: provide curated annotation files (gene lists, BEDs, drug targets, eQTL summaries, etc.) and utility scripts to build and match variant-level annotations.
- Scope: primarily human (hg19/hg38) with some comparative resources and support data for related analyses.
Quick start
- Inspect major folders:
1000G,GTEx,dbSNP,Gene,TCGA,drug,ENCODE. - Run the main helper script (example):
cd AnnotationDatabase
./run.shNotes & tips
- This collection contains large files (dbSNP chain files, GTEx significant pairs) — ensure you have enough disk and memory before loading.
- Some scripts expect specific formats (e.g., added rsid to GTEx
v8.signif_variant_gene_pairs.txt), see per-folderbin/scripts for details.
Timeline / history
- 2020/02/26: Added references to NHLBI Exome Sequencing Project (ESP).
- 2020/01/12: Updated TCGA pan-meta differential and survival analysis (see
TCGA/). - 2020/01/01: Added rsid to GTEx v8 files using
GTEx/bin/addrs2pairs.pl(seeGTEx/).
What to include in annotations
- Genome assembly: hg19, hg38 (and mouse versions if added).
- Standard formats: BED/BED12, bedGraph for signal tracks, VCF for variant sets when appropriate.
- Metadata: sample counts, study/population (e.g., Han Chinese, European), method summary, and file provenance.
Index & summary
- Full recursive index:
dataset_index_recursive.csv(generated byscripts/generate_dataset_index.py). - Human-readable summary:
dataset_summary.md(generated byscripts/generate_summary.py).
Per-folder READMEs
- Many major subfolders now contain short
README.mdfiles describing contents (e.g.,GTEx/,dbSNP/,hg19/,hg38/,scripts/).
Checksums & provenance
-
Use
scripts/generate_dataset_index.py --recursive --exclude .git --checksumto compute MD5/SHA256 checksums for files and populatechecksum_md5/checksum_sha256columns in the index. -
Use
scripts/compute_checksums.py --index dataset_index.csv --out dataset_index_with_checksums.csvto compute missing checksums for an existing index. -
Use
scripts/generate_manifest.pyto create a consumabledataset_catalog.csvmanifest from an index. -
Catalog (no
.git):dataset_catalog_nogit.csvanddataset_index_nogit.csv(generated with--exclude .git --checksum). -
Static browser viewer: open
viewer/index.htmlvia a local static server to browsedataset_catalog.csv.
Next steps (short)
- Add per-folder detailed README files describing formats and required provenance.
- Document
run.shand provide small example workflows for common tasks. - Add tests and CI checks for parsers and indexing scripts.
For details see CONTRIBUTING.md for how to contribute improvements.