eDNA Sequence Analysis Pipeline

Automated species identification from environmental DNA — built for environmental consultancies, regulators, and field ecologists.

Environmental DNA (eDNA) surveys capture traces of biological material from water samples and use DNA sequencing to identify which species are present. Processing the resulting sequence data manually is slow, error-prone, and hard to reproduce. This pipeline automates the entire workflow — from a raw FASTA sequence file to a species identification table, biodiversity summary, protected species alerts, and a client-ready regulatory PDF report — with no internet connection required.

What it does

Local BLAST+ offline analysis ✅ — parallel species identification against curated local databases; no rate limits, no internet dependency, version-locked results
Regulatory PDF reports ✅ — 4-section PDF with cover page, executive summary metric tiles, colour-coded species table, and methodology with database version embedded
Protected species detection ✅ — automatic cross-reference against 20 UK aquatic and riparian protected species (Habitats Regulations 2017, WCA Schedule 5/6, NERC Act S41); CONFIRMED/POSSIBLE alert banners in the PDF
Curated database registry ✅ — built-in registry of SILVA, BOLD, MIDORI2, ITS_RefSeq_Fungi, PR2; --marker flag auto-selects the correct database for your gene marker
Multi-marker support ✅ — route COI, ITS, 16S, 18S, 12S markers in one FASTA to their own databases by ID prefix; per-marker diversity reported separately
Rarefaction & richness estimation ✅ — Chao1 estimator, analytical rarefaction curves, depth-fair subsampling via --rarefy; rarefaction curve embedded in the PDF
Beta diversity 🔄 — community comparison across sites with PCoA ordination (Phase 7)
Cloud API 🔄 — REST endpoint for integration with LIMS and third-party tools (Phase 8)
ASV/DADA2 🔄 — amplicon sequence variant denoising for higher-resolution ID (Phase 9)
Nanopore streaming 🔄 — real-time field identification from MinION devices (Phase 10)

Quick start

Offline demo (no BLAST install needed)

git clone https://github.com/ybaah-biotech/eDNA-sequence-analysis.git
cd eDNA-sequence-analysis
pip install -r requirements.txt
python mock_run.py

Local BLAST+ (recommended for production)

# Step 1 — download a curated database (run once)
python scripts/setup_db.py --db 16S_ribosomal_RNA --dest /data/blast/

# Step 2 — run the pipeline
python pipeline.py \
    --fasta data/sample_sequences.fasta \
    --local --db-path /data/blast/16S_ribosomal_RNA \
    --threads 4 \
    --report --site "Rutland Water" --analyst "Yaw Baah"

# Use --marker to auto-select the recommended database for your gene marker
python pipeline.py \
    --fasta data/sample_sequences.fasta \
    --local --db-path /data/blast/bold \
    --marker COI --threads 4 --report

Multi-marker run (Phase 5)

Tag each FASTA sequence with a marker prefix on its ID (16S_001, COI_002, 12S_fish), then route each marker to its own database in a single run:

python pipeline.py \
    --fasta data/multi_marker.fasta \
    --local --multi-marker \
    --db-map 16S=/data/blast/silva,COI=/data/blast/bold,12S=/data/blast/midori2 \
    --threads 4 --report

Diversity is reported per marker (never pooled) in marker_summary.csv and a dedicated PDF section. Sequences whose marker has no --db-map entry are skipped with a warning.

Rarefaction (Phase 6)

Add --rarefy DEPTH to normalise diversity to a fixed sequencing depth and report a Chao1 richness estimate:

python pipeline.py \
    --fasta data/sample_sequences.fasta \
    --local --db-path /data/blast/16S_ribosomal_RNA \
    --rarefy 5000 --rarefy-seed 42 --report

Always (when classified taxa exist) writes rarefaction_curve.csv; with --report, a rendered rarefaction curve is embedded in the PDF showing observed richness, the Chao1 estimate, and the chosen rarefaction depth.

Web BLAST (NCBI — requires internet)

python pipeline.py \
    --fasta data/sample_sequences.fasta \
    --email your@email.com \
    --report --site "Site A" --analyst "Yaw Baah"

All CLI flags:

  --fasta          Path to input FASTA file                       [required]
  --local          Use local BLAST+ instead of NCBI web API       [flag]
  --db-path        Path to local database (required with --local)
  --threads        Parallel workers for local BLAST               [default: 1]
  --marker         Gene marker: 16S | 18S | COI | 12S | ITS | ITS1 | ITS2
                   Auto-selects recommended curated database and logs it
  --multi-marker   Route markers in one FASTA to different databases [flag]
  --db-map         MARKER=PATH pairs for --multi-marker, comma-separated
                   e.g. 16S=/data/blast/silva,COI=/data/blast/bold
  --rarefy DEPTH   Rarefy diversity to DEPTH reads + report Chao1 estimate
  --rarefy-seed N  Random seed for the rarefaction subsample (reproducibility)
  --email          Email for NCBI Entrez (required without --local)
  --output         Output directory                    [default: data/results]
  --db             NCBI web database name                        [default: nt]
  --evalue         E-value threshold                         [default: 1e-10]
  --max-hits       Max hits per query                            [default: 5]
  --top-hit-only   Retain only the best hit per query            [flag]
  --report         Generate a regulatory PDF report              [flag]
  --site           Site name for the PDF cover page
  --sample-date    Sample collection date (YYYY-MM-DD)
  --analyst        Analyst name for the PDF cover page

Supported databases

Database	Marker	Target organisms	Size
SILVA 138.2 SSU	16S / 18S	Bacteria, archaea, eukaryotes	~10 GB
BOLD	COI	Animals — invertebrates, fish, birds	~3 GB
MIDORI2	12S, COI	Vertebrates — fish, amphibians, GCN	~2 GB
ITS_RefSeq_Fungi	ITS1 / ITS2	Fungi	~1 GB
PR2	18S	Protists, algae, phytoplankton	~500 MB
NCBI nt	All	General purpose	~300 GB

Use --marker COI (or 16S, 12S, ITS, 18S) and the pipeline recommends the correct database automatically.

Output files

File	Description
`blast_hit_table.csv`	All hits with resolved species (LCA), confidence flag, identity %, e-value, protected_flag (+ `marker` in multi-marker runs)
`biodiversity_summary.csv`	Shannon H', Pielou J', species richness, per-taxon counts
`marker_summary.csv`	Per-marker diversity (multi-marker runs only)
`rarefaction_curve.csv` / `.png`	Rarefaction curve data and rendered plot
`eDNA_Report.pdf`	Regulatory PDF — generated with `--report` flag
`db_version.json`	Database path, version string, timestamp — written per run for reproducibility

Confidence flags

Flag	Identity	Meaning
`HIGH`	≥ 97%	Species-level assignment reliable
`MEDIUM`	90–96%	Genus-level reliable; reported as Genus sp.
`LOW`	< 90%	Alignment too divergent; row highlighted red in PDF

Protected species detection

The pipeline cross-references every BLAST result against 20 UK aquatic and riparian protected species drawn from four legislative frameworks:

EPS — European Protected Species (Habitats Regulations 2017)
WCA Schedule 5 — Wildlife & Countryside Act 1981
WCA Schedule 6 — Prohibited methods of taking/killing
Section 41 — NERC Act 2006 (Species of Principal Importance)

Includes: Triturus cristatus (GCN), Lutra lutra (otter), Austropotamobius pallipes (white-clawed crayfish), Margaritifera margaritifera (freshwater pearl mussel), Salmo salar (Atlantic salmon), Anguilla anguilla (European eel), and 14 others.

When alerts fire, the PDF report opens with a red banner (CONFIRMED) or amber banner (POSSIBLE) before all other content, with a mandatory ecologist review disclaimer.

Phase roadmap

Phase	Name	Status
1	Local BLAST+ — offline, threaded, version-stamped	✅ Complete
2	Regulatory PDF reports — 4-section client PDF	✅ Complete
3	Protected species detection — 20 UK species, CONFIRMED/POSSIBLE alerts	✅ Complete
4	Curated database registry — 7 databases, marker→DB routing	✅ Complete
5	Multi-marker — route COI, ITS, 16S, 18S, 12S to their own databases in one run	✅ Complete
6	Rarefaction — Chao1, curves, depth-fair subsampling, embedded plot	✅ Complete
7	Beta diversity — Bray-Curtis, PCoA ordination, PERMANOVA	🔄 Next
8	Cloud API — Flask/FastAPI, pay-per-run endpoint	📋 Planned
9	ASV/DADA2 — denoising, higher-resolution ID	📋 Planned
10	Nanopore streaming — real-time field analysis	📋 Planned

Project structure

eDNA-sequence-analysis/
├── pipeline.py                    # CLI entry point
├── mock_run.py                    # Offline demo — no BLAST needed
│
├── src/
│   ├── blast_query.py             # Web BLAST via NCBI Entrez
│   ├── local_blast.py             # Local BLAST+ via subprocess (Phase 1)
│   ├── parser.py                  # BLAST XML → BlastHit dataclass
│   ├── summarise.py               # LCA resolution, diversity metrics, CSV export
│   ├── report.py                  # Regulatory PDF generator (Phase 2)
│   ├── protected.py               # Protected species detection (Phase 3)
│   ├── databases.py               # Curated database registry (Phase 4)
│   ├── markers.py                 # Multi-marker routing & per-marker diversity (Phase 5)
│   ├── rarefaction.py             # Chao1, rarefaction curves, subsampling (Phase 6)
│   ├── plots.py                   # matplotlib chart rendering (Phase 6)
│   └── utils.py                   # FASTA loading, logging, helpers
│
├── scripts/
│   └── setup_db.py                # Download and verify local BLAST databases
│
├── tests/
│   ├── test_parser.py             # 38 tests — parsing, LCA, diversity metrics
│   ├── test_protected.py          # 12 tests — protected species detection
│   ├── test_databases.py          # 28 tests — database registry and marker routing
│   ├── test_markers.py            # 30 tests — marker detection, per-marker diversity
│   ├── test_rarefaction.py        # 29 tests — Chao1, curves, subsampling
│   └── test_plots.py             # 3 smoke tests — rarefaction PNG rendering
│
├── notebooks/
│   ├── 01_BLAST_and_Species_Parsing.ipynb
│   └── 02_Local_BLAST_and_Databases.ipynb
│
├── data/
│   ├── sample_sequences.fasta     # Five representative UK eDNA sequences
│   └── results/                   # Pipeline outputs (auto-created)
│
├── .devcontainer/                 # GitHub Codespaces — Python 3.12 + BLAST+
├── .github/workflows/tests.yml   # CI — runs full test suite on push
└── requirements.txt

Running tests

python -m pytest tests/ -v

140 tests — all run offline, no network required. Covers: species name parsing, XML parsing and e-value filtering, LCA taxonomy resolution, confidence flag assignment, diversity metrics, protected species detection, database registry routing, multi-marker detection and per-marker diversity, and rarefaction (Chao1, curves, subsampling).

Author

Yaw Baah — MSc Biotechnology (Nottingham Trent University, High Commendation) · BSc Biology (University of Nottingham)

Building commercial-grade bioinformatics tools for UK environmental consulting.

github.com/ybaah-biotech

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

eDNA Sequence Analysis Pipeline

What it does

Quick start

Offline demo (no BLAST install needed)

Local BLAST+ (recommended for production)

Multi-marker run (Phase 5)

Rarefaction (Phase 6)

Web BLAST (NCBI — requires internet)

Supported databases

Output files

Confidence flags

Protected species detection

Phase roadmap

Project structure

Running tests

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.devcontainer		.devcontainer
.github		.github
data		data
notebooks		notebooks
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
README.md		README.md
mock_run.py		mock_run.py
pipeline.py		pipeline.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

eDNA Sequence Analysis Pipeline

What it does

Quick start

Offline demo (no BLAST install needed)

Local BLAST+ (recommended for production)

Multi-marker run (Phase 5)

Rarefaction (Phase 6)

Web BLAST (NCBI — requires internet)

Supported databases

Output files

Confidence flags

Protected species detection

Phase roadmap

Project structure

Running tests

Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages