Skip to content

EvoDevoGenomics-UB/CORAL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

154 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CORAL: Compact-genome Oriented RNA-based Annotation using Long reads

Snakemake Tests DOI:10.64898/2025.12.04.692336

The CORAL protocol is a snakemake workflow design to annotate compact genomes using long-read RNAseq data.

It uses as input clean (primer-trimmed) pre-processed FASTQ files, mapping them to the provided genome using Minimap2. Then, it creates non-assembled annotations for each FASTQ file (using StringTie ) and identifies potential operons within those annotations by implementing GAMBA.

After identifying operon transcripts, operon-contained transcripts, and non-operon-related transcripts, CORAL generates consensus annotations for each of the three sets. These sets are then merged (using StringTie) to generate two final consensus annotations:

  • Merge clean_andOPRNs GTF: contains all three transcripts sets.
  • Merge clean_noOPRNs GTF: includes only operon-contained transcripts and non-operon-related transcripts.

The quality of the annotation is assayed with BUSCO, and optionally with Gffcompare when a reference annotation is provided. Finally, CORAL can generate an expression matrix for the consensus annotation including all the transcript sets (Merge clean_andOPRNs GTF), when specified in the configuration file.

Schematic pipeline:

Figure1_NEW

Installation

This pipeline is build on Snakemake; therefore, you need to have Snakemake installed (tested on v5.24.1).

Source files for runing this pipeline can be directly downloaded from the Releases page on this repository. However, due to the presence of a submodule we recommend downloading it using:

git clone --recursive https://github.com/EvoDevoGenomics-UB/CORAL.git

Or if your git version is >2.13:

git clone --recurse-submodules https://github.com/EvoDevoGenomics-UB/CORAL.git

How to run

To run CORAL, simply modify the CORAL-config.yaml file with your desired parameters and execute it as any other Snakemake workflow. We recommend running it with --use-conda, which will automatically create an environment to install all the dependecies specified in the CORAL-env.yml file. Example command:

snakemake --use-conda --snakefile Snakefile --configfile CORAL-config.yaml --cores 4

Indicating the FASTQ files to use in the Config file

There are two ways to indicate to CORAL where to find your long-read FASTQ files:

  1. By using a samplesheet file

    Edit the samplesheet parameter to point into a TSV file:

    samplesheet: "/absolute/path/to/TSV_file.tsv"
    

    The TSV file should contain the 'sample names' and their 'absolute paths' separated by tab (\t). It should look like this:

    Sample1    /absolute/path/to/your/sample1.fq
    Sample2    /absolute/path/to/your/sample2.part1.fastq
    Sample2    /absolute/path/to/yout/sample2.part2.fastq
    

    NOTE: This format supports multiple FASTQ files for a single 'sample name', and accepts different FASTQ suffixes (.fq or .fastq)

  2. Using directory and naming parameters

    Edit the parameters data_dir, samples, and data_suffix. Example:

    data_dir: "/absolute/path/to/the/data/files/"
    samples: ["Sample1","Sample2","Sample3"]
    data_suffix: "_chip-runXXXXX.fastq"
    

    In this example, CORAL will use as 'sample name' the ones provided in samples ("Sample1", "Sample2", and "Sample3"), and will interpret that the FASTQ files to use are:

    /absolute/path/to/the/data/files/Sample1_chip-runXXXXX.fastq
    /absolute/path/to/the/data/files/Sample2_chip-runXXXXX.fastq
    /absolute/path/to/the/data/files/Sample3_chip-runXXXXX.fastq
    

    This method is useful when the sample files have highly similar names, with just different IDs, and a single FASTQ file to process. In this case all sample files must share the same FASTQ suffix (either .fq or .fastq).

Output files

The CORAL pipeline creates several folders, including:

  • alignments: contains all the reads alignments for each sample individually.
  • index: contains the minimap2 index of the genome.
  • logs: contains the log files for the different processes.
  • sample_annotations: contains the GTF annotation files created for each sample.
  • annotations: contains the consensus annotations (merged annotations).
  • GAMBA_results: contians the output of the GAMBA tool for each sample (i.e. the operons found on each sample).
  • busco_downloads: contians the BUSCO database used for the BUSCO analysis.
  • busco_analysis: contains the BUSCO results for the main consensus annotaitons.
  • TD2_results: contains the results from TransDecoder analysis.
  • Expression_matix: contains the outputs generated for create the expression matrix of the 'noOPRNs' consensus annotation and the Reference annotation if provided.

Running specific rules

There is the posibility to run specific parts of CORAL instead of the full workflow. Those parts are:

  • Alignment of FASTQ fiels: rule 'do_alignment'
  • Sample annotations: rule 'do_stringtie_sample_annotations'
  • GAMBA (operon finder): rule 'do_operon_annotations'
  • Consensus annotation: rule 'do_consensus_annotations'
  • BUSCO analysis: rule 'do_busco_analyses'
  • Gffcompare analysis: rule 'do_gffcompare'
  • TransDecoder analysis: rule 'do_transdecoder'
  • Expression matrix creation: rule 'do_expression_matrix'

To run any of those parts just run snakemake specifying the rule name. Example:

snakemake --use-conda --snakefile Snakefile --configfile CORAL-config.yaml --cores 4 do_alignment

Citation

If you use CORAL in your research, please cite the following publication:

  • Torres-Aguila, N.P., Cassà, B., and Canestro, C. (2025). CORAL: Accurate annotation of compact genomes using long-read RNA-seq, demonstrated in Oikopleura dioica. bioRxiv. DOI: 10.64898/2025.12.04.692336

About

Tool for annotation of compact genomes

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors