Tools for DNA sequence retrieval, alignment, concatenation, and phylogenetic analysis in R
catGenes provides tools in the R
environment for assembling, standardizing,
and analyzing multilocus DNA datasets for phylogenetic and phylogenomic
research. Although originally developed for concatenating multiple DNA
alignments, the package now includes a broader set of functions for
sequence retrieval from GenBank, combining FASTA files, automated
multiple sequence alignment, alignment conversion, partitioned dataset
export, evolutionary model selection, MrBayes workflow preparation and
execution, and phylogenetic tree visualization.
The package is intended to support reproducible workflows from sequence retrieval and alignment processing to phylogenetic inference and tree visualization.
catGenes currently includes functions for:
- retrieving DNA sequences from GenBank using accession numbers
- retrieving DNA sequences from GenBank using taxonomic queries
- mining targeted loci from plastid and mitochondrial genomes
- combining multiple
FASTAfiles into a single file - performing automated multiple sequence alignment
- converting alignments among
NEXUS,FASTA, andPHYLIPformats - comparing and concatenating multiple DNA alignments
- handling datasets with single sequences per species or multiple accessions per species
- writing concatenated datasets in
NEXUSandPHYLIPformats with partition information - selecting evolutionary models for phylogenetic analysis
- generating partitioned
MrBayescommand blocks - running
MrBayesdirectly from R - plotting edited phylogenetic trees with
ggtree
You can install the development version from GitHub with:
# install.packages("devtools")
devtools::install_github("DBOSlab/catGenes")Most catGenes workflows start from DNA sequences or individual DNA
alignments. Depending on the function, inputs may include GenBank
accession tables, FASTA files, or alignments in NEXUS, FASTA, or PHYLIP
format. For concatenation functions, taxon labels should be consistently
formatted across loci.
In general:
- datasets with a single sequence per species can use labels such as
Genus_species - datasets with multiple accessions per species should include a stable identifier after the taxon name
- alignment file names are best kept simple and informative, usually matching the gene or locus name
A more detailed guide to sequence-label formatting, duplicated accessions, and naming conventions can be provided in dedicated articles.
A typical catGenes workflow may involve some or all of the following
steps:
- retrieve sequences from GenBank or mine loci from organellar genomes
- combine FASTA files when needed
- perform automated multiple sequence alignment
- convert alignments among standard file formats
- compare taxa across loci and build equally sized gene datasets
- write concatenated matrices in NEXUS or PHYLIP format
- select substitution models and prepare partition information
- run phylogenetic analyses
- visualize and edit resulting trees
The diagram below summarizes a typical catGenes workflow, from
sequence retrieval and alignment preparation to concatenation, model
selection, phylogenetic inference, and tree visualization.
Overview of the main catGenes workflow, highlighting sequence
retrieval, FASTA combination, sequence alignment, alignment conversion,
concatenation, export of partitioned datasets, model selection,
phylogenetic inference, and tree visualization.
library(catGenes)
genes <- list.files(system.file("DNAlignments/Vataireoids",
package = "catGenes"))
Vataireoids <- list()
for (i in genes[1:3]) {
Vataireoids[[i]] <- ape::read.nexus.data(
system.file("DNAlignments/Vataireoids", i, package = "catGenes")
)
}
names(Vataireoids) <- gsub("[.].*", "", names(Vataireoids))Use catfullGenes() when each species is represented by a single
sequence per locus.
catdf <- catfullGenes(
Vataireoids,
shortaxlabel = TRUE,
missdata = TRUE
)When species are represented by multiple accessions across one or more
alignments, use catmultGenes() instead.
writeNexus(
catdf,
file = "Vataireoids.nex",
genomics = FALSE,
interleave = TRUE,
bayesblock = TRUE
)writePhylip(
catdf,
file = "Vataireoids_dataset.phy",
genomics = FALSE,
catalignments = TRUE,
partitionfile = TRUE
)Beyond concatenation, catGenes includes several additional tools that
can be combined into a broader phylogenetic workflow.
seqs <- mineSeq(
inputdf = my_accession_table,
gb.colnames = c("ITS", "matK", "rbcL")
)mineTaxa(
term = "Leguminosae[Organism] AND matK[Gene]",
retmax = 2000,
clean_taxa = TRUE
)result <- combineFASTA(
input_files = c("gene1.fasta", "gene2.fasta", "gene3.fasta"),
output_file = "combined_sequences.fasta"
)alignSeqs(
filepath = "path_to_fasta_files",
method = "ClustalW",
format = "NEXUS"
)convertAlign(
filepath = "path_to_alignments",
format = "FASTA"
)minePlastome(
genbank = c("NC_000000", "NC_000001"),
genes = c("matK", "rbcL", "ndhF")
)
mineMitochondrion(
genbank = c("NC_000000", "NC_000001"),
genes = c("cox1", "nad1")
)evomodelTest(
nexus_file_path = "Vataireoids.nex",
model_criteria = "BIC"
)res <- mrbayesRun(
nexus_file = "Vataireoids.nex",
mrbayes_dir = "/path/to/mrbayes"
)plotPhylo(
tree = my_tree,
layout = "rectangular",
branch.supports = TRUE,
show.tip.label = TRUE
)| Function | Main purpose |
|---|---|
mineSeq() |
Download DNA sequences from GenBank using accession numbers |
mineTaxa() |
Mine DNA sequences from GenBank using taxonomic queries |
minePlastome() |
Retrieve targeted loci from plastid genomes available in GenBank |
mineMitochondrion() |
Retrieve targeted loci from mitochondrial genomes available in GenBank |
combineFASTA() |
Combine multiple FASTA files into a single FASTA file |
alignSeqs() |
Perform automated multiple sequence alignment using supported alignment algorithms |
convertAlign() |
Convert alignments among NEXUS, FASTA, and PHYLIP formats |
catfullGenes() |
Compare and prepare multiple alignments for concatenation when each species has a single sequence per locus |
catmultGenes() |
Compare and prepare multiple alignments for concatenation when species may have multiple accessions |
dropSeq() |
Remove redundant or less informative duplicated accessions from concatenated datasets |
writeNexus() |
Export concatenated datasets in NEXUS format, optionally with partition information and a MrBayes block |
writePhylip() |
Export concatenated datasets in PHYLIP format and write a partition file for downstream analyses |
evomodelTest() |
Perform substitution model selection and generate MrBayes-ready commands |
mrbayesRun() |
Run MrBayes directly from R using an existing NEXUS file |
plotPhylo() |
Plot and edit phylogenetic trees using ggtree |
The two main concatenation functions are: - catfullGenes() for
datasets without duplicated species/accessions across alignments -
catmultGenes() for datasets in which one or more species are
represented by multiple accessions Both functions return a list of
equally sized gene data frames that can then be exported with
writeNexus() or writePhylip().
Full function documentation and articles are available at the catGenes
website. More detailed
articles describing individual functions and use cases will be added
progressively.
Cardoso, D. & Cavalcante, Q. (2026). catGenes: Tools for DNA Alignment Concatenation, Sequence Mining, and Phylogenetic Analysis. GitHub repository: https://github.com/DBOSlab/catGenes

