Note: This repo was updated in 2026 with Hermes Agent to test AI-assisted repository maintenance. The original content and structure have been preserved.
Structural variant analysis of common disease-associated genes in 1000 Genomes long-read sequencing data.
Three directories at the top level: data/ with the gene coordinates and sample list, scripts/ with the pipeline code, and a legacy/ subdirectory with earlier plotting versions.
Do structural variants — deletions, insertions, duplications — in genomic regions around known disease genes appear at different frequencies or lengths depending on genetic ancestry?
GWAS has found thousands of SNPs associated with common diseases, but SNPs are only part of the picture. Structural variants are harder to detect with short-read sequencing but contribute significantly to phenotypic diversity and disease risk. Long reads (PacBio or Oxford Nanopore) resolve them much better. This project takes a targeted approach: instead of scanning the whole genome, it looks at whether reads spanning 4,449 disease-gene regions show population-specific length differences that would suggest underlying structural variation. The data comes from the 1000 Genomes Project, with over a thousand samples across five super-populations — African, Ad Mixed American, East Asian, European, and South Asian.
read_stats.py opens the CRAM file for a given sample, iterates over the 4,449 gene coordinates, and for each gene it fetches all reads that overlap that region. It filters out supplementary and secondary alignments, determines where each read starts and ends relative to the region of interest, and excludes reads that don't fully span it. The output is one line per gene per sample: the Ensembl gene ID, the coordinates, the sample ID, and a comma-separated list of read lengths spanning that region. This runs as a SLURM array job (cdg.sbatch) that processes samples in parallel.
sort_by_genes.py takes all the per-sample output files and reorganises them into per-gene files. Every read line is appended to a file named after the Ensembl gene ID it belongs to, giving you one file per gene containing all reads from all samples.
plots1.3.py reads those per-gene files and generates stacked histograms. For each gene it groups the reads by population, removes outliers beyond six standard deviations using a z-score filter, and plots the distributions as colour-coded stacked bars, 25 subplots per figure. If one population shows a distinct peak at a different read length than the others, that is evidence of a population-specific structural variant at that locus.
There is also sv_analysis.py — a stub that was meant to call structural variants directly from a merged cuteSV VCF file, but it was never implemented beyond a basic argument parser.
The gene coordinates live in data/cdg_positions.txt — 4,449 Ensembl gene IDs with chromosome and start/end positions in BED-like format, representing specific features of each disease-associated gene (exon boundaries or GWAS lead SNP windows). Only genes with regions shorter than roughly 40,000 base pairs were included. The sample list is data/1kg_samples.txt with 1,027 sample IDs from the 1000 Genomes Project. An ancestry group file mapping sample IDs to population group and ancestry was maintained on the cluster and is not in this repo. Requires pysam, matplotlib, numpy, scipy — all pip-installable.
This was developed as part of the bi-marvl research group. The CRAM alignment files, reference genome, and ancestry metadata live on the group's cluster filesystem under /groups/bi-marvl/.