GitHub - Shicheng-Guo/AnnotationDatabase: annotation database for human genetics, genomics and epigenomics

Annotation Database

This repository collects genomic and epigenomic annotation datasets and helper scripts used to build annotation resources for human genomes (hg19/hg38) and related analyses.

Overview

Purpose: provide curated annotation files (gene lists, BEDs, drug targets, eQTL summaries, etc.) and utility scripts to build and match variant-level annotations.
Scope: primarily human (hg19/hg38) with some comparative resources and support data for related analyses.

Quick start

Inspect major folders: 1000G, GTEx, dbSNP, Gene, TCGA, drug, ENCODE.
Run the main helper script (example):

cd AnnotationDatabase
./run.sh

Notes & tips

This collection contains large files (dbSNP chain files, GTEx significant pairs) — ensure you have enough disk and memory before loading.
Some scripts expect specific formats (e.g., added rsid to GTEx v8.signif_variant_gene_pairs.txt), see per-folder bin/ scripts for details.

Timeline / history

2020/02/26: Added references to NHLBI Exome Sequencing Project (ESP).
2020/01/12: Updated TCGA pan-meta differential and survival analysis (see TCGA/).
2020/01/01: Added rsid to GTEx v8 files using GTEx/bin/addrs2pairs.pl (see GTEx/).

What to include in annotations

Genome assembly: hg19, hg38 (and mouse versions if added).
Standard formats: BED/BED12, bedGraph for signal tracks, VCF for variant sets when appropriate.
Metadata: sample counts, study/population (e.g., Han Chinese, European), method summary, and file provenance.

Index & summary

Full recursive index: dataset_index_recursive.csv (generated by scripts/generate_dataset_index.py).
Human-readable summary: dataset_summary.md (generated by scripts/generate_summary.py).

Per-folder READMEs

Many major subfolders now contain short README.md files describing contents (e.g., GTEx/, dbSNP/, hg19/, hg38/, scripts/).

Checksums & provenance

Use scripts/generate_dataset_index.py --recursive --exclude .git --checksum to compute MD5/SHA256 checksums for files and populate checksum_md5 / checksum_sha256 columns in the index.
Use scripts/compute_checksums.py --index dataset_index.csv --out dataset_index_with_checksums.csv to compute missing checksums for an existing index.
Use scripts/generate_manifest.py to create a consumable dataset_catalog.csv manifest from an index.
Catalog (no .git): dataset_catalog_nogit.csv and dataset_index_nogit.csv (generated with --exclude .git --checksum).
Static browser viewer: open viewer/index.html via a local static server to browse dataset_catalog.csv.

Next steps (short)

Add per-folder detailed README files describing formats and required provenance.
Document run.sh and provide small example workflows for common tasks.
Add tests and CI checks for parsers and indexing scripts.

For details see CONTRIBUTING.md for how to contribute improvements.

Name		Name	Last commit message	Last commit date
Latest commit History 338 Commits
.github/workflows		.github/workflows
1000G		1000G
ENCODE		ENCODE
GTEx		GTEx
Gene		Gene
Gnomad		Gnomad
LOF		LOF
NCI60		NCI60
ProteinAtlas		ProteinAtlas
TCGA		TCGA
UCSCToilRNAseqRecompute		UCSCToilRNAseqRecompute
USA300		USA300
blueprint		blueprint
dbSNP		dbSNP
drug		drug
eQTL		eQTL
exm2rs		exm2rs
hg19		hg19
hg38		hg38
methbase		methbase
rheumatoidarthritis		rheumatoidarthritis
roadmap		roadmap
scripts		scripts
tests		tests
viewer		viewer
1218.tumor.suppressor.gene.txt		1218.tumor.suppressor.gene.txt
CONTRIBUTING.md		CONTRIBUTING.md
COSMIC.TSG.hg19.bed		COSMIC.TSG.hg19.bed
ENSG.ENST.ENSP.Symbol.R		ENSG.ENST.ENSP.Symbol.R
ENSG.ENST.ENSP.Symbol.hg19.bed		ENSG.ENST.ENSP.Symbol.hg19.bed
Epigene.hg19.bed.txt		Epigene.hg19.bed.txt
ExomeChip_SNPsInfo_hg19.txt		ExomeChip_SNPsInfo_hg19.txt
FDADrugLabeling.txt		FDADrugLabeling.txt
FDA_approved_drugtarget.txt		FDA_approved_drugtarget.txt
FHM.bed		FHM.bed
GCF_000001405.25_GRCh37.p13_assembly_report.txt		GCF_000001405.25_GRCh37.p13_assembly_report.txt
GCF_000001405.38_GRCh38.p12_assembly_report.txt		GCF_000001405.38_GRCh38.p12_assembly_report.txt
Han_South_North.Aims.txt		Han_South_North.Aims.txt
HumanCellLines.csv		HumanCellLines.csv
Human_73150_Protein_length.txt		Human_73150_Protein_length.txt
Illumina_CoreExome_Beadchip.hg19.exm2rs.bed		Illumina_CoreExome_Beadchip.hg19.exm2rs.bed
LICENSE		LICENSE
Pathway.MarkerGene.txt		Pathway.MarkerGene.txt
Readme.md		Readme.md
SmokingRelatedCpGSite-Srikant2016.txt		SmokingRelatedCpGSite-Srikant2016.txt
TSGene2.0.txt		TSGene2.0.txt
TumorDrivenMutation.txt		TumorDrivenMutation.txt
Volcano.TSG.Pancancer.RNAseq.R		Volcano.TSG.Pancancer.RNAseq.R
dataset_catalog.csv		dataset_catalog.csv
dataset_catalog_nogit.csv		dataset_catalog_nogit.csv
dataset_index.csv		dataset_index.csv
dataset_index_nogit.csv		dataset_index_nogit.csv
dataset_index_recursive.csv		dataset_index_recursive.csv
dataset_summary.md		dataset_summary.md
dataset_summary_nogit.md		dataset_summary_nogit.md
hapmap2.pop		hapmap2.pop
hapmap3.pop		hapmap3.pop
known_cancer_genes.xls		known_cancer_genes.xls
knowngene.hg19.bed12		knowngene.hg19.bed12
oncogene.55.txt		oncogene.55.txt
oncogene_human_803.txt		oncogene_human_803.txt
run.sh		run.sh
top1000HanChineseAims.txt		top1000HanChineseAims.txt
top_to_AB.txt.gz		top_to_AB.txt.gz
top_to_forward.txt.gz		top_to_forward.txt.gz

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Annotation Database

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Annotation Database

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages