The CORAL protocol is a snakemake workflow design to annotate compact genomes using long-read RNAseq data.
It uses as input clean (primer-trimmed) pre-processed FASTQ files, mapping them to the provided genome using Minimap2. Then, it creates non-assembled annotations for each FASTQ file (using StringTie ) and identifies potential operons within those annotations by implementing GAMBA.
After identifying operon transcripts, operon-contained transcripts, and non-operon-related transcripts, CORAL generates consensus annotations for each of the three sets. These sets are then merged (using StringTie) to generate two final consensus annotations:
- Merge clean_andOPRNs GTF: contains all three transcripts sets.
- Merge clean_noOPRNs GTF: includes only operon-contained transcripts and non-operon-related transcripts.
The quality of the annotation is assayed with BUSCO, and optionally with Gffcompare when a reference annotation is provided. Finally, CORAL can generate an expression matrix for the consensus annotation including all the transcript sets (Merge clean_andOPRNs GTF), when specified in the configuration file.
Schematic pipeline:
This pipeline is build on Snakemake; therefore, you need to have Snakemake installed (tested on v5.24.1).
Source files for runing this pipeline can be directly downloaded from the Releases page on this repository. However, due to the presence of a submodule we recommend downloading it using:
git clone --recursive https://github.com/EvoDevoGenomics-UB/CORAL.git
Or if your git version is >2.13:
git clone --recurse-submodules https://github.com/EvoDevoGenomics-UB/CORAL.git
To run CORAL, simply modify the CORAL-config.yaml file with your desired parameters and execute it as any other Snakemake workflow. We recommend running it with --use-conda, which will automatically create an environment to install all the dependecies specified in the CORAL-env.yml file. Example command:
snakemake --use-conda --snakefile Snakefile --configfile CORAL-config.yaml --cores 4
There are two ways to indicate to CORAL where to find your long-read FASTQ files:
-
By using a samplesheet file
Edit the
samplesheetparameter to point into a TSV file:samplesheet: "/absolute/path/to/TSV_file.tsv"The TSV file should contain the 'sample names' and their 'absolute paths' separated by tab (
\t). It should look like this:Sample1 /absolute/path/to/your/sample1.fq Sample2 /absolute/path/to/your/sample2.part1.fastq Sample2 /absolute/path/to/yout/sample2.part2.fastqNOTE: This format supports multiple FASTQ files for a single 'sample name', and accepts different FASTQ suffixes (
.fqor.fastq) -
Using directory and naming parameters
Edit the parameters
data_dir,samples, anddata_suffix. Example:data_dir: "/absolute/path/to/the/data/files/" samples: ["Sample1","Sample2","Sample3"] data_suffix: "_chip-runXXXXX.fastq"In this example, CORAL will use as 'sample name' the ones provided in samples ("Sample1", "Sample2", and "Sample3"), and will interpret that the FASTQ files to use are:
/absolute/path/to/the/data/files/Sample1_chip-runXXXXX.fastq /absolute/path/to/the/data/files/Sample2_chip-runXXXXX.fastq /absolute/path/to/the/data/files/Sample3_chip-runXXXXX.fastqThis method is useful when the sample files have highly similar names, with just different IDs, and a single FASTQ file to process. In this case all sample files must share the same FASTQ suffix (either
.fqor.fastq).
The CORAL pipeline creates several folders, including:
- alignments: contains all the reads alignments for each sample individually.
- index: contains the minimap2 index of the genome.
- logs: contains the log files for the different processes.
- sample_annotations: contains the GTF annotation files created for each sample.
- annotations: contains the consensus annotations (merged annotations).
- GAMBA_results: contians the output of the GAMBA tool for each sample (i.e. the operons found on each sample).
- busco_downloads: contians the BUSCO database used for the BUSCO analysis.
- busco_analysis: contains the BUSCO results for the main consensus annotaitons.
- TD2_results: contains the results from TransDecoder analysis.
- Expression_matix: contains the outputs generated for create the expression matrix of the 'noOPRNs' consensus annotation and the Reference annotation if provided.
There is the posibility to run specific parts of CORAL instead of the full workflow. Those parts are:
- Alignment of FASTQ fiels: rule 'do_alignment'
- Sample annotations: rule 'do_stringtie_sample_annotations'
- GAMBA (operon finder): rule 'do_operon_annotations'
- Consensus annotation: rule 'do_consensus_annotations'
- BUSCO analysis: rule 'do_busco_analyses'
- Gffcompare analysis: rule 'do_gffcompare'
- TransDecoder analysis: rule 'do_transdecoder'
- Expression matrix creation: rule 'do_expression_matrix'
To run any of those parts just run snakemake specifying the rule name. Example:
snakemake --use-conda --snakefile Snakefile --configfile CORAL-config.yaml --cores 4 do_alignment
If you use CORAL in your research, please cite the following publication:
- Torres-Aguila, N.P., Cassà, B., and Canestro, C. (2025). CORAL: Accurate annotation of compact genomes using long-read RNA-seq, demonstrated in Oikopleura dioica. bioRxiv. DOI: 10.64898/2025.12.04.692336