daff-viral-assembly

A Nextflow pipeline for viral genome assembly and identification from Illumina short-read data. Developed through a collaboration between QUT eResearch and QCIF.

Overview

daff-viral-assembly detects and characterises known and novel viruses from paired-end Illumina RNA-seq data. It performs:

Read quality control and trimming
rRNA and PhiX decontamination
Metagenomic read classification (Kraken2 and Kaiju)
De novo viral genome assembly (SPAdes)
BLAST-based contig annotation
Reference mapping and consensus generation
ORF prediction and protein domain annotation (HMMER, GeNomad, DIAMOND)
Novel virus candidate identification (within results summary)
Per-sample HTML report generation

Requirements

Nextflow ≥ 21.05.0
Docker or Singularity (recommended for reproducibility)
The following databases (paths provided as parameters):
- NCBI BLAST nucleotide database (e.g. core_nt)
- TaxonKit taxonomy database
- RVDB viral database and taxonomy tab for Diamond analysis. Build the diamon database using the command diamond makedb --quiet --threads 2 --in U-RVDBv31.0-prot.fasta -d rvdb.
- Kraken2 database (e.g. core_nt)
- Kaiju database (e.g. kaiju_db_nr_euk)
- GeNomad database (follow the instructions on the github repo)
- HMMER Pfam database (Pfam-A.hmm)
- rRNA reference sequences for rRNA filtering. Download and untar smr_v4.3_sensitive_db.fasta. Make sure to convert the uracils to thymines in the fasta file: sed '/^[^>]/s/U/T/g; /^[^>]/s/u/t/g' smr_v4.3_sensitive_db.fasta > smr_v4.3_sensitive_db_DNA.fasta.

Quick Start

Install Nextflow (≥ 21.05.0).
Install Singularity or Docker.
Create a samplesheet (see Input for format).

Create a params file (copy and edit params/params_example.yml):

input: /path/to/samplesheet.csv
blastn_db: /path/to/ncbi/core_nt
taxdump: ~/.taxonkit
blast_threads: 2
genomad_db: /path/to/genomad/genomad_db
hmmer_db: /path/to/hmms/Pfam-A.hmm
kraken2_db: /path/to/kraken_databases/core_nt
kaiju_dbname: /path/to/kaiju_databases/kaiju_db_nr_euk.fmi
kaiju_names: /path/to/kaiju_databases/names.dmp
kaiju_nodes: /path/to/kaiju_databases/nodes.dmp
rrna_ref: /path/to/rrna_reference.fasta
outdir: results

Run the pipeline:

nextflow run main.nf \
  -profile singularity \
  -params-file params/user_params.yml \
  --analyst_name "Your Name" \
  --facility "Your Institution"

To resume a failed run, add -resume:

nextflow run main.nf \
  -profile singularity \
  -params-file params/user_params.yml \
  --analyst_name "Your Name" \
  --facility "Your Institution" \
  -resume

Input

Samplesheet

Provide a CSV samplesheet via --input (default: index.csv). The file must include the following columns:

Column	Required	Description
`sample`	Yes	Unique sample identifier (no spaces)
`fastq_1`	Yes	Absolute path to R1 FASTQ file (`.fastq.gz` or `.fq.gz`)
`fastq_2`	No	Absolute path to R2 FASTQ file (`.fastq.gz` or `.fq.gz`). Leave blank for single-end data
`sample_information`	No	Free-text sample information
`sample_type`	No	Sample type descriptor
`sample_receipt_date`	No	Date sample was received
`storage_location`	No	Physical storage location of the sample

Example samplesheet (index.csv):

sample,fastq_1,fastq_2,sample_information,sample_type,sample_receipt_date,storage_location
Sample_A,/data/Sample_A_R1.fastq.gz,/data/Sample_A_R2.fastq.gz,Tomato leaf,Plant,2024-01-15,Freezer_1
Sample_B,/data/Sample_B_R1.fastq.gz,/data/Sample_B_R2.fastq.gz,Rose stem,Plant,2024-01-16,Freezer_1

Multiple FASTQ files from the same sample (e.g. different sequencing lanes) can be listed as separate rows with the same sample identifier — they will be concatenated before processing.

Pipeline Steps

Input FASTQ files
      │
      ▼
 CAT_FASTQ ──────── Concatenate files from the same sample
      │
      ▼
 (optional) SEQTK_SAMPLE ── Subsample to 40 M reads (default)
      │
      ├──► FASTQC_RAW ──── Raw read quality control
      │
      ▼
 FASTP ──────────── Adapter trimming and quality filtering
      │
      ├──► FASTQC_TRIM ─── Post-trimming quality control
      │
      ▼
 BBMAP_BBDUK ────── rRNA read removal
      │
      ▼
 BBMAP_BBSPLIT ──── PhiX decontamination
      │
      ├──► KRAKEN2_KRAKEN2 ──► BRACKEN ──► KRAKEN2_TO_KRONA ──► (Krona chart)
      │         │
      │         └──► RETRIEVE_VIRAL_READS_KRAKEN2 ── Extract viral + unclassified reads
      │                         │
      ├──► KAIJU_KAIJU ──────►  │          ── Protein-level classification + Krona chart
      │         │               │
      │         └──► SUMMARISE_READ_CLASSIFICATION
      │                         │
      ▼                         ▼
 SPADES ─────────── De novo assembly (rnaSPAdes)
      │
      ▼
 SEQTK ──────────── Filter contigs < 150 bp
      │
      ▼
 BLASTN ─────────── Megablast search (split input for parallel execution)
      │
      ▼
 EXTRACT_VIRAL_BLAST_HITS ── Identify viral contigs
      │
      ├──► EXTRACT_CONTIGS ─── Separate viral / other contigs
      │         │
      │         ▼
      │    MAPPING_BACK_TO_CONTIGS ──► PILEUP ──► IDENTIFY_ERRORS ──► TRIM_ENDS
      │              │                                                      │
      │              └──────────────────────────── REALIGN ◄────────────────┘
      │                                               │
      │                          SAMTOOLS_CONTIGS ◄───┘
      │                          MOSDEPTH_CONTIGS
      │                          PYFAIDX_CONTIGS
      │
      ├──► BLASTN_ROUND2 ──── Second BLAST on trimmed contigs
      │         │
      │         ▼
      │    EXTRACT_VIRAL_BLAST_HITS_ROUND2 ──► FASTA2TABLE
      │                                              │
      │              ┌───────────────────────────────┘
      │              ▼
      │    EXTRACT_REF_FASTA ──► CLUSTER ──► MAPPING_BACK_TO_REF
      │                                            │
      │                                       SAMTOOLS2 ──► BCFTOOLS ──► BEDTOOLS
      │                                       PYFAIDX / MOSDEPTH
      │                                       REF_COVSTATS ──► FASTA2TABLE2
      │                                       CONTIG_COVSTATS
      │
      ├──► ORFIPY ──────────── ORF prediction from contigs
      │         │
      │         ▼
      │    HMMSCAN ─────────── Protein domain annotation (Pfam)
      │
      ├──► GENOMAD ─────────── Virus / provirus classification
      │
      ├──► DIAMOND ─────────── Protein-level contig alignment (RVDB)
      │
      ▼
 SUMMARISE_RESULTS ── Combine all evidence into per-sample summary table
                       (includes novel virus candidate identification)
      │
 QCREPORT ────────── Run-level QC report
      │
 HTML_REPORT ──────── Per-sample interactive HTML report

Step Descriptions

Step	Tool	Description
Read concatenation	`cat`	Merges multiple FASTQ files per sample
Subsampling	seqtk	Randomly subsample reads to a target depth (default 40 M reads)
Raw QC	FastQC	Quality metrics for raw reads
Adapter trimming	fastp	Trims adapters, low-quality bases, and short reads
Trimmed QC	FastQC	Quality metrics after trimming
rRNA removal	BBDuk	Removes ribosomal RNA reads using a reference
PhiX removal	BBSplit	Removes PhiX174 spike-in reads
Taxonomic classification	Kraken2 + Bracken	k-mer based classification; Bracken re-estimates abundance
Taxonomic classification	Kaiju	Protein-level classification against nr/euk database
Visualisation	Krona	Interactive pie-chart visualisation of classifications
Viral read retrieval	KrakenTools	Extracts unclassified + viral-classified reads
De novo assembly	SPAdes (rnaSPAdes mode)	Assembles viral genome contigs
Contig filtering	seqtk	Removes contigs shorter than 150 bp
BLAST annotation	BLASTN	Searches assembled contigs against the NCBI nucleotide database
Hit extraction	custom script	Filters to top viral BLAST hits
Reference retrieval	Entrez Direct	Downloads matched reference sequences from NCBI
Reference clustering	CD-HIT	Clusters highly similar references
Reference mapping	BWA	Aligns cleaned reads back to reference sequences
Consensus calling	bcftools	Calls variants and applies them to generate consensus sequences
Coverage masking	bedtools	Masks low-coverage regions in the consensus
Depth analysis	mosdepth	Calculates per-base sequencing depth
ORF prediction	orfipy	Predicts open reading frames from assembled contigs
Protein domain annotation	HMMER (hmmscan)	Searches ORFs against Pfam protein domain database
Virus/provirus prediction	GeNomad	Classifies contigs as viral or proviral
Protein-level alignment	DIAMOND (blastx)	Protein-level alignment of contigs against a protein database (e.g. RVDB)
Results summary	custom script	Integrates evidence from all tools into a summary table (including novel virus candidates)
QC report	custom script	Generates run-level quality control report
HTML report	custom script	Generates interactive per-sample HTML report

Parameters

Parameters can be provided via a YAML params file (-params-file) or on the command line (--param_name value).

Required Parameters

Parameter	Description
`--blastn_db`	Path to BLAST nucleotide database (e.g. NCBI `core_nt`)
`--taxdump`	Path to TaxonKit taxonomy database directory
`--kraken2_db`	Path to Kraken2 database directory
`--kaiju_dbname`	Path to Kaiju `.fmi` index file
`--kaiju_names`	Path to Kaiju `names.dmp` file
`--kaiju_nodes`	Path to Kaiju `nodes.dmp` file
`--genomad_db`	Path to GeNomad database directory
`--hmmer_db`	Path to Pfam HMM database file (`Pfam-A.hmm`)
`--rrna_ref`	Path to rRNA reference FASTA file (for BBDuk filtering)
`--analyst_name`	Name of the analyst (included in reports)
`--facility`	Name of the sequencing/analysis facility (included in reports)

Optional Parameters

Parameter	Default	Description
`--input`	`index.csv`	Path to samplesheet CSV file
`--outdir`	`results`	Output directory
`--blast_threads`	`2`	Number of CPU threads for BLASTN
`--subsample_enabled`	`true`	Enable/disable read subsampling before processing
`--subsample_size`	`40000000`	Target number of reads after subsampling (40 M)
`--kraken2_save_classified_reads`	`false`	Save Kraken2-classified reads to output
`--kraken2_save_unclassified_reads`	`true`	Save Kraken2-unclassified reads to output
`--kraken2_save_readclassifications`	`true`	Save per-read Kraken2 classification output
`--save_trimmed_fail`	`false`	Save reads that failed fastp quality trimming
`--save_merged`	`false`	Save fastp-merged reads
`--qfiltered_reads_threshold`	`8000000`	Minimum number of quality-filtered reads required per sample
`--clean_reads_threshold`	`2500000`	Minimum number of clean reads (after decontamination) required per sample
`--prot_db`	`null`	Path to DIAMOND protein database (e.g. RVDB) for blastx alignment
`--rvdb_taxonomy`	`null`	Path to RVDB taxonomy file for DIAMOND results annotation
`--publish_dir_mode`	`copy`	Nextflow output publish mode (`copy`, `symlink`, `link`, etc.)
`--help`	`false`	Print help message and exit

Output

All results are written to --outdir (default: results). Per-sample results are nested under the sample identifier.

results/
├── 01_pipeline_logs/
│   ├── execution_timeline_<timestamp>.html   # Nextflow timeline report
│   ├── execution_report_<timestamp>.html     # Nextflow execution report
│   ├── execution_trace_<timestamp>.txt       # Process trace file
│   ├── pipeline_dag_<timestamp>.html         # Pipeline DAG visualisation
│   └── <timestamp>_nextflow_start_timestamp.txt
│
├── 02_qc_report/
│   ├── run_qc_report_<timestamp>.html        # Run-level QC summary (HTML)
│   └── run_qc_report_<timestamp>.txt         # Run-level QC summary (text)
│
└── <sample_id>/
    ├── 05_read_classification/
    │   ├── <sample_id>_kraken2_report.txt          # Kraken2 report
    │   ├── <sample_id>_kraken2_to_krona.html       # Kraken2 Krona chart
    │   ├── <sample_id>_kaiju_summary.txt           # Kaiju classification summary
    │   ├── <sample_id>_kraken_summary.txt          # Kraken2 classification summary
    │   └── <sample_id>_bracken_report.txt          # Bracken abundance report
    │
    ├── 06_assembly/
    │   └── <sample_id>_spades_scaffolds.fasta      # Filtered assembled contigs (≥ 150 bp)
    │
    ├── 07_annotation/
    │   ├── <sample_id>_megablast_top_viral_hits.txt          # Top viral BLAST hits
    │   ├── <sample_id>_orfs.fasta                            # Predicted ORFs (orfipy)
    │   ├── <sample_id>_hmmscan_per_target_output.txt         # HMMER per-target results
    │   ├── <sample_id>_hmmscan_per_domain_output.txt         # HMMER per-domain results
    │   ├── <sample_id>_hmm_domain_summary_counts.tsv         # HMM domain summary counts
    │   ├── <sample_id>_diamond_matches.txt                   # DIAMOND protein alignment results
    │   └── genomad/                                          # GeNomad virus predictions
    │
    ├── 08_mapping_to_contigs/
    │   ├── <sample_id>_contig_aln.sorted.bam                 # BAM alignment to contigs
    │   ├── <sample_id>_contig_aln.sorted.bam.bai             # BAM index
    │   ├── <sample_id>_contig_coverage.txt                   # Per-base contig coverage
    │   ├── <sample_id>_contig_mapq.txt                       # Mapping quality statistics
    │   ├── <sample_id>_contigs.trimmed.fa                    # End-trimmed contig sequences
    │   └── <sample_id>_trim_coords.tsv                       # Trimming coordinates
    │
    ├── 09_mapping_to_ref/
    │   ├── <sample_id>_ref_aln.sorted.bam                    # BAM alignment to references
    │   ├── <sample_id>_ref_aln.sorted.bam.bai                # BAM index
    │   ├── <sample_id>_ref_coverage.txt                      # Per-base reference coverage
    │   ├── <sample_id>_vcf_applied.fasta                     # Consensus with variants applied
    │   ├── <sample_id>_bcftools_masked_consensus.fasta       # Coverage-masked consensus
    │   └── <sample_id>_reference_with_cov_stats_final.txt    # Final reference stats table
    │
    ├── 10_results_summary/
    │   ├── <sample_id>_summary_viral_results.tsv             # Known virus summary table
    │   ├── <sample_id>_novel_virus_candidates.tsv            # Novel virus candidates table
    │   └── <sample_id>_evidence_summary_novel.txt            # Novel virus evidence summary
    │
    └── 11_report/
        └── <sample_id>_report.html                           # Interactive per-sample report

Profiles

The pipeline includes several execution profiles selectable with -profile:

Profile	Description
`docker`	Run processes inside Docker containers
`singularity`	Run processes inside Singularity containers (recommended on HPC)
`test`	Test profile

Multiple profiles can be combined with commas, e.g. -profile singularity,mtdt_test.

Tool Versions

Tool	Version	Purpose
BBMap	39.37	rRNA and PhiX decontamination
bcftools	1.19	Variant calling and consensus generation
bedtools	2.27.1	Coverage masking
BLASTN	2.16.0	Nucleotide sequence alignment
BWA	0.7.19	Short-read alignment to reference sequences
CD-HIT	4.8.1	Reference sequence clustering
DIAMOND	2.1.24	Protein-level contig alignment (blastx)
Entrez Direct	22.4	NCBI sequence retrieval
FastQC	0.12.1	Read quality control
fastp	1.0.1	Adapter trimming and quality filtering
GeNomad	1.11.2	Virus / provirus classification
HMMER	3.4.0	Protein domain annotation
Kaiju	1.8.2	Protein-level taxonomic classification
Kraken2	2.1.3	k-mer based taxonomic classification
KrakenTools	1.2.1	Viral read extraction from Kraken2 output
Krona	2.8.1	Interactive taxonomic visualisation
mosdepth	0.3.3	Sequencing depth analysis
orfipy	0.0.4	ORF prediction
pyfaidx	0.8.1.3	FASTA indexing
Python	3.12.9	Scripting
SAMtools	1.22.1	BAM processing
seqtk	1.3	Sequence subsampling and filtering
SPAdes	4.2.0	De novo genome assembly

Name		Name	Last commit message	Last commit date
Latest commit History 277 Commits
.vscode		.vscode
assets		assets
bin		bin
conf		conf
db		db
deploy/azure		deploy/azure
docs/azure		docs/azure
modules		modules
params		params
tests		tests
.env.azure.sample		.env.azure.sample
.gitignore		.gitignore
README.md		README.md
index.csv		index.csv
main.nf		main.nf
nextflow.config		nextflow.config
nextflow_schema.json		nextflow_schema.json
versions.yml		versions.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

daff-viral-assembly

Table of Contents

Overview

Requirements

Quick Start

Input

Samplesheet

Pipeline Steps

Step Descriptions

Parameters

Required Parameters

Optional Parameters

Output

Profiles

Tool Versions

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

daff-viral-assembly

Table of Contents

Overview

Requirements

Quick Start

Input

Samplesheet

Pipeline Steps

Step Descriptions

Parameters

Required Parameters

Optional Parameters

Output

Profiles

Tool Versions

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages