Skip to content

Shicheng-Guo/AnnotationDatabase

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

338 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Annotation Database

This repository collects genomic and epigenomic annotation datasets and helper scripts used to build annotation resources for human genomes (hg19/hg38) and related analyses.

Overview

  • Purpose: provide curated annotation files (gene lists, BEDs, drug targets, eQTL summaries, etc.) and utility scripts to build and match variant-level annotations.
  • Scope: primarily human (hg19/hg38) with some comparative resources and support data for related analyses.

Quick start

  • Inspect major folders: 1000G, GTEx, dbSNP, Gene, TCGA, drug, ENCODE.
  • Run the main helper script (example):
cd AnnotationDatabase
./run.sh

Notes & tips

  • This collection contains large files (dbSNP chain files, GTEx significant pairs) — ensure you have enough disk and memory before loading.
  • Some scripts expect specific formats (e.g., added rsid to GTEx v8.signif_variant_gene_pairs.txt), see per-folder bin/ scripts for details.

Timeline / history

  • 2020/02/26: Added references to NHLBI Exome Sequencing Project (ESP).
  • 2020/01/12: Updated TCGA pan-meta differential and survival analysis (see TCGA/).
  • 2020/01/01: Added rsid to GTEx v8 files using GTEx/bin/addrs2pairs.pl (see GTEx/).

What to include in annotations

  • Genome assembly: hg19, hg38 (and mouse versions if added).
  • Standard formats: BED/BED12, bedGraph for signal tracks, VCF for variant sets when appropriate.
  • Metadata: sample counts, study/population (e.g., Han Chinese, European), method summary, and file provenance.

Index & summary

  • Full recursive index: dataset_index_recursive.csv (generated by scripts/generate_dataset_index.py).
  • Human-readable summary: dataset_summary.md (generated by scripts/generate_summary.py).

Per-folder READMEs

  • Many major subfolders now contain short README.md files describing contents (e.g., GTEx/, dbSNP/, hg19/, hg38/, scripts/).

Checksums & provenance

  • Use scripts/generate_dataset_index.py --recursive --exclude .git --checksum to compute MD5/SHA256 checksums for files and populate checksum_md5 / checksum_sha256 columns in the index.

  • Use scripts/compute_checksums.py --index dataset_index.csv --out dataset_index_with_checksums.csv to compute missing checksums for an existing index.

  • Use scripts/generate_manifest.py to create a consumable dataset_catalog.csv manifest from an index.

  • Catalog (no .git): dataset_catalog_nogit.csv and dataset_index_nogit.csv (generated with --exclude .git --checksum).

  • Static browser viewer: open viewer/index.html via a local static server to browse dataset_catalog.csv.

Next steps (short)

  • Add per-folder detailed README files describing formats and required provenance.
  • Document run.sh and provide small example workflows for common tasks.
  • Add tests and CI checks for parsers and indexing scripts.

For details see CONTRIBUTING.md for how to contribute improvements.

About

annotation database for human genetics, genomics and epigenomics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors