CORAL: Compact-genome Oriented RNA-based Annotation using Long reads

The CORAL protocol is a snakemake workflow design to annotate compact genomes using long-read RNAseq data.

It uses as input clean (primer-trimmed) pre-processed FASTQ files, mapping them to the provided genome using Minimap2. Then, it creates non-assembled annotations for each FASTQ file (using StringTie ) and identifies potential operons within those annotations by implementing GAMBA.

After identifying operon transcripts, operon-contained transcripts, and non-operon-related transcripts, CORAL generates consensus annotations for each of the three sets. These sets are then merged (using StringTie) to generate two final consensus annotations:

Merge clean_andOPRNs GTF: contains all three transcripts sets.
Merge clean_noOPRNs GTF: includes only operon-contained transcripts and non-operon-related transcripts.

The quality of the annotation is assayed with BUSCO, and optionally with Gffcompare when a reference annotation is provided. Finally, CORAL can generate an expression matrix for the consensus annotation including all the transcript sets (Merge clean_andOPRNs GTF), when specified in the configuration file.

Schematic pipeline:

Installation

This pipeline is build on Snakemake; therefore, you need to have Snakemake installed (tested on v5.24.1).

Source files for runing this pipeline can be directly downloaded from the Releases page on this repository. However, due to the presence of a submodule we recommend downloading it using:

git clone --recursive https://github.com/EvoDevoGenomics-UB/CORAL.git

Or if your git version is >2.13:

git clone --recurse-submodules https://github.com/EvoDevoGenomics-UB/CORAL.git

How to run

To run CORAL, simply modify the CORAL-config.yaml file with your desired parameters and execute it as any other Snakemake workflow. We recommend running it with --use-conda, which will automatically create an environment to install all the dependecies specified in the CORAL-env.yml file. Example command:

snakemake --use-conda --snakefile Snakefile --configfile CORAL-config.yaml --cores 4

Indicating the FASTQ files to use in the Config file

There are two ways to indicate to CORAL where to find your long-read FASTQ files:

By using a samplesheet file

Edit the samplesheet parameter to point into a TSV file:
```
samplesheet: "/absolute/path/to/TSV_file.tsv"
```
The TSV file should contain the 'sample names' and their 'absolute paths' separated by tab (\t). It should look like this:
```
Sample1    /absolute/path/to/your/sample1.fq
Sample2    /absolute/path/to/your/sample2.part1.fastq
Sample2    /absolute/path/to/yout/sample2.part2.fastq
```
NOTE: This format supports multiple FASTQ files for a single 'sample name', and accepts different FASTQ suffixes (.fq or .fastq)
Using directory and naming parameters

Edit the parameters data_dir, samples, and data_suffix. Example:
```
data_dir: "/absolute/path/to/the/data/files/"
samples: ["Sample1","Sample2","Sample3"]
data_suffix: "_chip-runXXXXX.fastq"
```
In this example, CORAL will use as 'sample name' the ones provided in samples ("Sample1", "Sample2", and "Sample3"), and will interpret that the FASTQ files to use are:
```
/absolute/path/to/the/data/files/Sample1_chip-runXXXXX.fastq
/absolute/path/to/the/data/files/Sample2_chip-runXXXXX.fastq
/absolute/path/to/the/data/files/Sample3_chip-runXXXXX.fastq
```
This method is useful when the sample files have highly similar names, with just different IDs, and a single FASTQ file to process. In this case all sample files must share the same FASTQ suffix (either .fq or .fastq).

Output files

The CORAL pipeline creates several folders, including:

alignments: contains all the reads alignments for each sample individually.
index: contains the minimap2 index of the genome.
logs: contains the log files for the different processes.
sample_annotations: contains the GTF annotation files created for each sample.
annotations: contains the consensus annotations (merged annotations).
GAMBA_results: contians the output of the GAMBA tool for each sample (i.e. the operons found on each sample).
busco_downloads: contians the BUSCO database used for the BUSCO analysis.
busco_analysis: contains the BUSCO results for the main consensus annotaitons.
TD2_results: contains the results from TransDecoder analysis.
Expression_matix: contains the outputs generated for create the expression matrix of the 'noOPRNs' consensus annotation and the Reference annotation if provided.

Running specific rules

There is the posibility to run specific parts of CORAL instead of the full workflow. Those parts are:

Alignment of FASTQ fiels: rule 'do_alignment'
Sample annotations: rule 'do_stringtie_sample_annotations'
GAMBA (operon finder): rule 'do_operon_annotations'
Consensus annotation: rule 'do_consensus_annotations'
BUSCO analysis: rule 'do_busco_analyses'
Gffcompare analysis: rule 'do_gffcompare'
TransDecoder analysis: rule 'do_transdecoder'
Expression matrix creation: rule 'do_expression_matrix'

To run any of those parts just run snakemake specifying the rule name. Example:

snakemake --use-conda --snakefile Snakefile --configfile CORAL-config.yaml --cores 4 do_alignment

Citation

If you use CORAL in your research, please cite the following publication:

Torres-Aguila, N.P., Cassà, B., and Canestro, C. (2025). CORAL: Accurate annotation of compact genomes using long-read RNA-seq, demonstrated in Oikopleura dioica. bioRxiv. DOI: 10.64898/2025.12.04.692336

Name		Name	Last commit message	Last commit date
Latest commit History 154 Commits
.github/workflows		.github/workflows
.test		.test
envs		envs
rules		rules
scripts		scripts
.gitignore		.gitignore
.gitmodules		.gitmodules
CORAL-config.yaml		CORAL-config.yaml
LICENSE		LICENSE
README.md		README.md
Snakefile		Snakefile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CORAL: Compact-genome Oriented RNA-based Annotation using Long reads

Installation

How to run

Indicating the FASTQ files to use in the Config file

Output files

Running specific rules

Citation

About

Uh oh!

Releases 6

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CORAL: Compact-genome Oriented RNA-based Annotation using Long reads

Installation

How to run

Indicating the FASTQ files to use in the Config file

Output files

Running specific rules

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 6

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages