catGenes

Tools for DNA sequence retrieval, alignment, concatenation, and phylogenetic analysis in R

catGenes provides tools in the R environment for assembling, standardizing, and analyzing multilocus DNA datasets for phylogenetic and phylogenomic research. Although originally developed for concatenating multiple DNA alignments, the package now includes a broader set of functions for sequence retrieval from GenBank, combining FASTA files, automated multiple sequence alignment, alignment conversion, partitioned dataset export, evolutionary model selection, MrBayes workflow preparation and execution, and phylogenetic tree visualization.

The package is intended to support reproducible workflows from sequence retrieval and alignment processing to phylogenetic inference and tree visualization.

Main features

catGenes currently includes functions for:

retrieving DNA sequences from GenBank using accession numbers
retrieving DNA sequences from GenBank using taxonomic queries
mining targeted loci from plastid and mitochondrial genomes
combining multiple FASTA files into a single file
performing automated multiple sequence alignment
converting alignments among NEXUS, FASTA, and PHYLIP formats
comparing and concatenating multiple DNA alignments
handling datasets with single sequences per species or multiple accessions per species
writing concatenated datasets in NEXUS and PHYLIP formats with partition information
selecting evolutionary models for phylogenetic analysis
generating partitioned MrBayes command blocks
running MrBayes directly from R
plotting edited phylogenetic trees with ggtree

Installation

You can install the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("DBOSlab/catGenes")

Input data

Most catGenes workflows start from DNA sequences or individual DNA alignments. Depending on the function, inputs may include GenBank accession tables, FASTA files, or alignments in NEXUS, FASTA, or PHYLIP format. For concatenation functions, taxon labels should be consistently formatted across loci.

In general:

datasets with a single sequence per species can use labels such as Genus_species
datasets with multiple accessions per species should include a stable identifier after the taxon name
alignment file names are best kept simple and informative, usually matching the gene or locus name

A more detailed guide to sequence-label formatting, duplicated accessions, and naming conventions can be provided in dedicated articles.

General workflow

A typical catGenes workflow may involve some or all of the following steps:

retrieve sequences from GenBank or mine loci from organellar genomes
combine FASTA files when needed
perform automated multiple sequence alignment
convert alignments among standard file formats
compare taxa across loci and build equally sized gene datasets
write concatenated matrices in NEXUS or PHYLIP format
select substitution models and prepare partition information
run phylogenetic analyses
visualize and edit resulting trees

Typical phylogenetic workflow with `catGenes`

The diagram below summarizes a typical catGenes workflow, from sequence retrieval and alignment preparation to concatenation, model selection, phylogenetic inference, and tree visualization.

Overview of the main catGenes workflow, highlighting sequence retrieval, FASTA combination, sequence alignment, alignment conversion, concatenation, export of partitioned datasets, model selection, phylogenetic inference, and tree visualization.

Basic example

Load example DNA alignments

library(catGenes)

genes <- list.files(system.file("DNAlignments/Vataireoids",
                                package = "catGenes"))

Vataireoids <- list()
for (i in genes[1:3]) {
  Vataireoids[[i]] <- ape::read.nexus.data(
    system.file("DNAlignments/Vataireoids", i, package = "catGenes")
  )
}
names(Vataireoids) <- gsub("[.].*", "", names(Vataireoids))

Compare loci and prepare a concatenated dataset

Use catfullGenes() when each species is represented by a single sequence per locus.

catdf <- catfullGenes(
  Vataireoids,
  shortaxlabel = TRUE,
  missdata = TRUE
)

When species are represented by multiple accessions across one or more alignments, use catmultGenes() instead.

Write a concatenated NEXUS matrix

writeNexus(
  catdf,
  file = "Vataireoids.nex",
  genomics = FALSE,
  interleave = TRUE,
  bayesblock = TRUE
)

Write a concatenated PHYLIP matrix and partition file

writePhylip(
  catdf,
  file = "Vataireoids_dataset.phy",
  genomics = FALSE,
  catalignments = TRUE,
  partitionfile = TRUE
)

Other key functions

Beyond concatenation, catGenes includes several additional tools that can be combined into a broader phylogenetic workflow.

Retrieve DNA sequences from GenBank

seqs <- mineSeq(
  inputdf = my_accession_table,
  gb.colnames = c("ITS", "matK", "rbcL")
)

Mine sequences from GenBank using taxonomic queries

mineTaxa(
  term = "Leguminosae[Organism] AND matK[Gene]",
  retmax = 2000,
  clean_taxa = TRUE
)

Combine multiple FASTA files

result <- combineFASTA(
  input_files = c("gene1.fasta", "gene2.fasta", "gene3.fasta"),
  output_file = "combined_sequences.fasta"
)

Perform automated multiple sequence alignment

alignSeqs(
  filepath = "path_to_fasta_files",
  method = "ClustalW",
  format = "NEXUS"
)

Convert alignment formats

convertAlign(
  filepath = "path_to_alignments",
  format = "FASTA"
)

Mine loci from plastomes or mitochondrial genomes

minePlastome(
  genbank = c("NC_000000", "NC_000001"),
  genes = c("matK", "rbcL", "ndhF")
)

mineMitochondrion(
  genbank = c("NC_000000", "NC_000001"),
  genes = c("cox1", "nad1")
)

Select evolutionary models and generate MrBayes blocks

evomodelTest(
  nexus_file_path = "Vataireoids.nex",
  model_criteria = "BIC"
)

Run MrBayes from R

res <- mrbayesRun(
  nexus_file = "Vataireoids.nex",
  mrbayes_dir = "/path/to/mrbayes"
)

Plot phylogenetic trees

plotPhylo(
  tree = my_tree,
  layout = "rectangular",
  branch.supports = TRUE,
  show.tip.label = TRUE
)

Available functions

Function	Main purpose
`mineSeq()`	Download DNA sequences from GenBank using accession numbers
`mineTaxa()`	Mine DNA sequences from GenBank using taxonomic queries
`minePlastome()`	Retrieve targeted loci from plastid genomes available in GenBank
`mineMitochondrion()`	Retrieve targeted loci from mitochondrial genomes available in GenBank
`combineFASTA()`	Combine multiple FASTA files into a single FASTA file
`alignSeqs()`	Perform automated multiple sequence alignment using supported alignment algorithms
`convertAlign()`	Convert alignments among NEXUS, FASTA, and PHYLIP formats
`catfullGenes()`	Compare and prepare multiple alignments for concatenation when each species has a single sequence per locus
`catmultGenes()`	Compare and prepare multiple alignments for concatenation when species may have multiple accessions
`dropSeq()`	Remove redundant or less informative duplicated accessions from concatenated datasets
`writeNexus()`	Export concatenated datasets in NEXUS format, optionally with partition information and a MrBayes block
`writePhylip()`	Export concatenated datasets in PHYLIP format and write a partition file for downstream analyses
`evomodelTest()`	Perform substitution model selection and generate MrBayes-ready commands
`mrbayesRun()`	Run MrBayes directly from R using an existing NEXUS file
`plotPhylo()`	Plot and edit phylogenetic trees using `ggtree`

Notes on concatenation functions

The two main concatenation functions are: - catfullGenes() for datasets without duplicated species/accessions across alignments - catmultGenes() for datasets in which one or more species are represented by multiple accessions Both functions return a list of equally sized gene data frames that can then be exported with writeNexus() or writePhylip().

Documentation

Full function documentation and articles are available at the catGenes website. More detailed articles describing individual functions and use cases will be added progressively.

Citation

Cardoso, D. & Cavalcante, Q. (2026). catGenes: Tools for DNA Alignment Concatenation, Sequence Mining, and Phylogenetic Analysis. GitHub repository: https://github.com/DBOSlab/catGenes

Name		Name	Last commit message	Last commit date
Latest commit History 103 Commits
R		R
data-raw		data-raw
data		data
inst		inst
man		man
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
README.Rmd		README.Rmd
README.md		README.md
catGenes.Rproj		catGenes.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

catGenes

Main features

Installation

Input data

General workflow

Typical phylogenetic workflow with `catGenes`

Basic example

Load example DNA alignments

Compare loci and prepare a concatenated dataset

Write a concatenated NEXUS matrix

Write a concatenated PHYLIP matrix and partition file

Other key functions

Retrieve DNA sequences from GenBank

Mine sequences from GenBank using taxonomic queries

Combine multiple FASTA files

Perform automated multiple sequence alignment

Convert alignment formats

Mine loci from plastomes or mitochondrial genomes

Select evolutionary models and generate MrBayes blocks

Run MrBayes from R

Plot phylogenetic trees

Available functions

Notes on concatenation functions

Documentation

Citation

About

Licenses found

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

catGenes

Main features

Installation

Input data

General workflow

Typical phylogenetic workflow with catGenes

Basic example

Load example DNA alignments

Compare loci and prepare a concatenated dataset

Write a concatenated NEXUS matrix

Write a concatenated PHYLIP matrix and partition file

Other key functions

Retrieve DNA sequences from GenBank

Mine sequences from GenBank using taxonomic queries

Combine multiple FASTA files

Perform automated multiple sequence alignment

Convert alignment formats

Mine loci from plastomes or mitochondrial genomes

Select evolutionary models and generate MrBayes blocks

Run MrBayes from R

Plot phylogenetic trees

Available functions

Notes on concatenation functions

Documentation

Citation

About

Topics

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Typical phylogenetic workflow with `catGenes`

Packages