Recovery of Eukaryotic Metagenome-Assembled Genomes using contrastive learning. A specialized metagenomic binning tool designed for recovering high-quality eukaryotic genomes from mixed prokaryotic-eukaryotic samples.
- Installation
- Quick Start
- Usage
- Common Options
- How It Works
- Output
- Requirements
- Acknowledgments
- License
- Citation
conda create -n remag -c bioconda -c conda-forge remag
conda activate remagInstall miniprot separately first:
conda create -n remag python=3.9
conda activate remag
conda install -c bioconda miniprot
pip install remagconda create -n remag python=3.9
conda activate remag
git clone https://github.com/danielzmbp/remag.git
cd remag
conda install -c bioconda miniprot
pip install .pip install -e ".[dev]"docker pull danielzmbp/remag:latestpip install remag[plotting]REMAG uses PyTorch and will use GPU acceleration automatically when a supported backend is available. No extra REMAG flag is required.
If you want a CUDA-enabled PyTorch build, install REMAG first and then replace the CPU PyTorch package with the CUDA-enabled one that matches your system:
conda create -n remag -c bioconda -c conda-forge remag
conda activate remag
conda install -c pytorch -c nvidia pytorch pytorch-cuda=12.1Adjust the CUDA version to match your driver and platform.
On Apple Silicon, PyTorch can use Metal (mps) automatically when available. In most cases no extra REMAG-specific setup is needed beyond installing a current PyTorch build.
If you install REMAG with pip, install the PyTorch build you want first, then install REMAG:
conda create -n remag python=3.9
conda activate remag
conda install -c bioconda miniprot
# Install the desired PyTorch build first
pip install torch
# Then install REMAG
pip install remagFor NVIDIA systems, use the PyTorch install command from the official PyTorch selector so the wheel matches your CUDA runtime.
remag contigs.fasta -c alignments.bamdocker run --rm -v $(pwd):/data danielzmbp/remag:latest \
/data/contigs.fasta -c /data/alignments.bam -o /data/outputsingularity build remag.sif docker://danielzmbp/remag:latest
singularity run --bind $(pwd):/data remag.sif \
/data/contigs.fasta -c /data/alignments.bam -o /data/outputAfter installation, you can use REMAG via the command line:
# Basic usage
remag contigs.fasta -c alignments.bam
# With explicit output directory
remag contigs.fasta -c alignments.bam -o output_directory
# Multiple samples
remag contigs.fasta -c sample1.bam -c sample2.bam
# Multiple samples using shell-expanded globs
remag contigs.fasta -c samples/*.bam
# Using precomputed coverage tables (one TSV per sample)
remag contigs.fasta -c sample1.tsv -c sample2.tsv
# Only run eukaryotic filtering (skip binning)
remag contigs.fasta --filter-only
# Use single-cell mode (adjusts k-NN defaults and skips eukaryotic filtering)
remag contigs.fasta -c alignments.bam -m single-cell
# Keep intermediate files
remag contigs.fasta -c alignments.bam -kpython -m remag contigs.fasta -c alignments.bamPrecomputed coverage TSVs are supported as an alternative to BAM/CRAM. Use one TSV per sample.
- Column 1: contig ID
- Last column: coverage value for that contig
- No header row
Example:
contig_1 12.4
contig_2 3.8
contig_3 0.0TSV input provides contig-level coverage only. REMAG cannot infer fragment-specific coverage for augmented fragments from a TSV, so every fragment from the same contig gets the same coverage value. Use BAM/CRAM if you want fragment-level augmented coverage features. Do not mix TSV inputs with BAM/CRAM inputs in the same run.
-c, --coverage: one or more BAM, CRAM, or TSV coverage inputs-o, --output: output directory; defaults toremag_outputnext to the input FASTA-k, --keep-intermediate: retain embeddings, features, model weights, and other intermediate files--filter-only: stop after eukaryotic filtering and write filtered FASTA output-m, --mode: select presets such asmetagenomics,single-cell, orshort-reads--save-filtered-contigs: also write the contigs removed by the eukaryotic filter
Use remag -h for a quick reference and remag --help for the full CLI, including training, clustering, filtering, and rescue options.
REMAG recovers eukaryotic bins with a multi-stage pipeline:
- Eukaryotic filtering: By default, REMAG filters contigs with the integrated HyenaDNA classifier. This step can be disabled with
--skip-bacterial-filter. - Feature extraction: REMAG combines 4-mer composition with optional multi-sample coverage data. Large contigs are augmented into multiple fragments for training.
- Contrastive learning: A Siamese network trained with Barlow Twins learns embeddings that place fragments from the same contig close together.
- Core gene annotation:
miniprotmaps eukaryotic single-copy core genes to support clustering and quality checks. - Greedy clustering and rescue: REMAG applies greedy Leiden clustering across multiple resolutions, then merges or rescues bins when single-copy gene checks support it.
bins/: Directory containing FASTA files for each binbins.csv: Final contig-to-bin assignmentsembeddings.csv: Contig embeddings from the neural networkremag.log: Detailed log file*_eukaryotic_filtered.fasta: Filtered FASTA written when eukaryotic filtering is enabled
siamese_model.pt: Trained Siamese neural network modelkmer_embeddings.csv: K-mer encoder embeddings (before fusion)coverage_embeddings.csv: Coverage encoder embeddings (before fusion)params.json: Run parameters for reproducibilityfeatures.csv: Extracted k-mer and coverage featuresfragments.pkl: Fragment information used during training*_hyenadna_classification.tsv: HyenaDNA eukaryotic classification results (tab-separated)gene_contig_mappings.json: Cached gene-to-contig mappingscore_gene_duplication_results.json: Core gene duplication analysisknn_graph_edges.csv: k-NN graph edge list used for Leiden clusteringknn_graph_stats.json: k-NN graph construction statisticstemp_miniprot/: Temporary directory for miniprot alignments
*_non_eukaryotic.fasta: Contigs removed by the HyenaDNA filter when--save-filtered-contigsis used
With plotting dependencies installed, you can generate UMAP plots from embeddings.csv and bins.csv:
pip install "remag[plotting]"
python scripts/plot_features.py --features output_directory/embeddings.csv --clusters output_directory/bins.csv --output output_directoryumap_coordinates.csv: UMAP projections for visualizationumap_plot.pdf: UMAP visualization plot with cluster assignments
- Python 3.9+
miniprotis required for core gene analysis when installing outside conda packages or the project Docker image- Plotting extras are optional:
pip install remag[plotting]
The package includes a pre-trained HyenaDNA classifier model for eukaryotic contig filtering.
The integrated HyenaDNA classifier uses a pre-trained genomic foundation model:
- Repository: HazyResearch/hyena-dna
- Paper: Nguyen E, Poli M, Faizi M, et al. HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution. NeurIPS 2023.
MIT License - see LICENSE file for details.
If you use REMAG in your research, please cite:
@article {G{\'o}mez-P{\'e}rez2026.03.05.709928,
author = {G{\'o}mez-P{\'e}rez, Daniel and Raguideau, S{\'e}bastien and Warring, Sally and James, Robert and Hildebrand, Falk and Quince, Christopher},
title = {REMAG: recovery of eukaryotic genomes from metagenomic data using contrastive learning},
elocation-id = {2026.03.05.709928},
year = {2026},
doi = {10.64898/2026.03.05.709928},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2026/03/08/2026.03.05.709928},
eprint = {https://www.biorxiv.org/content/early/2026/03/08/2026.03.05.709928.full.pdf},
journal = {bioRxiv}
}