Skip to content

Arnaroo/taggen

Repository files navigation

TagGen

High-Performance Barcode Generator and Demultiplexer for High-Throughput and Long-Read Sequencing Applications

License: MIT Version Platform

TagGen Logo

TagGen generates diverse, error-tolerant DNA/RNA barcodes (tags) for multiplexed sequencing experiments and assigns sequencing reads back to their source barcodes through an integrated demultiplexer. It is purpose-built for long-read platforms such as Oxford Nanopore Technologies (ONT), where higher error rates (5--15%, dominated by insertions and deletions) demand longer, more robust barcodes than traditional short-read tools can produce.

Key highlights:

  • Generates barcodes at lengths of 8--30 bp in under 100 milliseconds, where existing tools fail at lengths above 12 bp due to memory exhaustion
  • Up to 13,600x faster than exhaustive enumeration (DNABarcodes) at 12 bp
  • Integrated anchor-free demultiplexer that works without knowledge of flanking adapter sequences
  • Validated against realistic nanopore error profiles across 154 parameter combinations
  • Both a cross-platform graphical interface (GTK3) and a comprehensive command-line interface

Table of Contents


Overview

Sequence barcodes (tags) are short DNA or RNA sequences used to identify samples, cells, or spatial locations in multiplexed sequencing experiments. For short-read platforms (Illumina), barcodes of 8--12 bp with minimum Hamming distances of 3--4 provide adequate error tolerance. However, nanopore sequencing exhibits error rates of 5--15% dominated by insertions and deletions, demanding longer barcodes with greater inter-sequence distances.

Existing barcode generation tools enumerate all possible sequences exhaustively (O(4^n) complexity), making them computationally infeasible at lengths above 12 bp. TagGen replaces exhaustive enumeration with Monte Carlo candidate sampling and greedy diversity selection, achieving O(k) complexity where k is the number of candidates generated (typically 10,000--100,000). This allows generation of barcodes at 14--30 bp in milliseconds.

TagGen also includes an integrated demultiplexer (taggen-demux) that assigns ONT FASTQ reads to their source barcodes. Unlike primer-anchored tools (minibar, Dorado), taggen-demux is anchor-free: it locates barcodes at any read position using k-mer voting and banded edit-distance alignment, without requiring knowledge of flanking adapter sequences. This makes it suitable for direct RNA sequencing, spatial transcriptomics capture plates, and custom library protocols where the barcode is not flanked by a consistent adapter.


Features

Barcode Generation

  • Flexible barcode lengths: 8--30 bp core sequences with optional 5' prefix and 3' suffix
  • Distance metrics: Hamming distance (substitutions only) or Levenshtein edit distance (substitutions, insertions, deletions)
  • Quality constraints: Configurable GC content bounds (default 30--70%), maximum homopolymer run length (default 3), and custom exclude sequences
  • Exclude sequences: Load FASTA files of adapters, primers, or existing barcodes to ensure generated tags are dissimilar; supports sliding-window comparison for sequences longer than the tag length
  • Scalable output: Generate from 1 to 1,000,000 tags; select best N from a larger candidate pool
  • Multiple output formats: FASTA, TSV, and pairwise distance matrix (CSV)
  • Visualisation: Interactive pairwise distance heatmap (GUI) or ASCII heatmap (CLI)
  • Reproducibility: JSON configuration files for saving and sharing parameter sets
  • Parallel generation: Multi-threaded Monte Carlo sampling across all available CPU cores

Demultiplexing (taggen-demux)

  • Anchor-free matching: Locates barcodes at any read position without adapter sequence knowledge
  • Two search modes: End mode (standard dual-ended libraries) and Full mode (mid-read barcodes, direct RNA sequencing, spatial transcriptomics)
  • Position masks: Restrict the search to specific read regions (absolute bp, end-relative offsets, or read-length fractions) to reduce false positives
  • Adaptive acceptance thresholds: Automatic maximum edit distance based on tag length (overridable via --max-dist)
  • Ambiguity detection: Rejects reads where the margin between best and second-best match is less than 2
  • Confidence scoring: Configurable minimum confidence score for assignment
  • Tag trimming: Three modes -- none, ends (trim matched end only), all (trim from both ends)
  • Gzip support: Reads gzip-compressed FASTQ input; optionally compresses output
  • POD5 co-demultiplexing: Partitions raw signal files alongside FASTQ by sample via the pod5 subset tool
  • Quality filtering: Skip reads below a minimum mean Phred Q-score
  • Per-sample output: Assigned reads written to per-barcode subdirectories; unassigned reads collected separately with annotation of rejection reason
  • Statistics: Per-sample assignment counts, expected vs. observed read counts, mean Q-score; interactive charts in the GUI (match positions, edit distance distributions, Q-score profiles)

Interface

  • Graphical interface (GUI): GTK3-based, three-tab layout (Generate, Demultiplex, About) with real-time parameter validation, progress tracking, interactive heatmap, and post-demultiplexing statistics charts
  • Command-line interface (CLI): Full parameter set via getopt-style flags, suitable for scripted pipelines and HPC environments

Use Cases

Scenario Tag Length Count Min Distance Metric Demux Mode Mask
Standard ONT multiplexing (96-well plate) 14 bp 96 d=4 Hamming End --
Direct RNA sequencing (high error tolerance) 24 bp 48 d=10 Levenshtein End --
Older ONT chemistry / degraded samples 30 bp 96 d=8 Levenshtein Full 5p:60
Spatial transcriptomics (zero misassignment) 30 bp 384 d=8 Levenshtein Full 0.05:0.20
Environmental metagenomics 20 bp 96 d=6 Levenshtein End --

Installation

System Requirements

  • Operating system: Linux (Arch, Ubuntu 18.04+, Debian 10+), Windows 10+, or macOS 10.14+
  • Processor: Any x86-64 CPU; multi-core recommended for parallel candidate generation
  • Memory: 512 MB minimum; 2 GB recommended for large candidate pools
  • Disk: 50 MB for installation
  • Display: 1024x768 minimum for GUI (not required for CLI)

Pre-compiled Binaries

Pre-compiled binaries for Linux x86_64, Windows x86_64, and macOS arm64 (Apple Silicon) are available in bin/ and from the GitHub Releases page. SHA-256 hashes for every artefact are listed in bin/MACOS_BUILD_NOTES.md.

Platform Binary Notes
Linux x86_64 bin/taggen-linux-x86_64 release-static, glibc 2.34+, runs on most modern distros
Windows x86_64 bin/taggen.exe + bundled GTK DLLs in the release ZIP/installer Win10+
macOS arm64 bin/TagGen-1.2.5-macos-arm64.dmg, bin/TagGen.app, bin/taggen-macos-arm64/ macOS 13+ (Ventura); Intel Macs run via Rosetta 2

Linux:

# Direct binary (statically-linked phobos/druntime; needs only system glibc + GTK3 for GUI)
chmod +x bin/taggen-linux-x86_64
./bin/taggen-linux-x86_64                 # GUI
./bin/taggen-linux-x86_64 --cli -n 96 -l 14 -d 4    # CLI generation
./bin/taggen-linux-x86_64 --demux ...     # CLI demux

# Install GTK3 runtime for the GUI:
#   sudo pacman -S gtk3                    # Arch
#   sudo apt install libgtk-3-0            # Ubuntu / Debian
#   sudo dnf install gtk3                  # Fedora

Windows: Double-click TagGen-1.2.5-windows-x86_64-setup.exe (Inno Setup installer — adds TagGen to Start Menu and optionally to the system PATH), or extract taggen-v1.2.5-windows-x86_64.zip and run taggen.exe from the extracted folder. No system-wide GTK install needed; the runtime DLLs are bundled.

macOS (arm64 / Apple Silicon):

  1. Double-click TagGen-1.2.5-macos-arm64.dmg to mount.
  2. Drag TagGen.app to Applications.
  3. First launch: right-click TagGen.appOpen (Gatekeeper warning is one-time, this build is ad-hoc signed; we don't currently have an Apple Developer ID for full notarisation).
    • Alternatively, strip the quarantine attribute: xattr -dr com.apple.quarantine /Applications/TagGen.app
  4. For CLI use, taggen-macos-arm64/bin/taggen-launcher.sh is a relocatable wrapper.

POD5 co-demultiplexing requires the pod5 Python CLI on all platforms — pip install pod5.

Building from Source (Linux)

TagGen is written in D and uses the DUB package manager.

Prerequisites:

# Arch Linux
sudo pacman -S ldc dub gtk3

# Ubuntu / Debian
sudo apt install ldc dub libgtk-3-dev libgtkd-dev

# Fedora
sudo dnf install ldc dub gtk3-devel

Build:

git clone https://github.com/Arnaroo/taggen.git
cd taggen

# Recommended: statically-linked phobos/druntime (portable across distros)
dub build --compiler=ldc2 --config=linux --build=release-static

# Standard release build
dub build --compiler=ldc2 --config=linux --build=release

# Optimised build (AMD Zen+)
dub build --compiler=ldc2 --config=linux --build=release-zenplus

# Run unit tests
dub test --compiler=ldc2 --config=linux

The resulting binary taggen can be run directly or copied to a location on your PATH. v1.2.5 builds cleanly on LDC 1.41 and LDC 1.42+.

CLI-only build (no GTK dependency):

If you do not need the GUI and want to avoid the GTK3 dependency, you can build TagGen with --config=benchmark which excludes the GUI module, or compile a standalone demux binary:

dub build --compiler=ldc2 --config=benchmark --build=release

Building from Source (macOS)

Detailed walkthrough in BUILD_MACOS.md. Quick summary:

# Install prerequisites via Homebrew
brew install dub gtk+3 dylibbundler pkg-config

# Install LDC 1.41 (canonical; 1.42+ also works for TagGen)
curl -fsSLO https://github.com/ldc-developers/ldc/releases/download/v1.41.0/ldc2-1.41.0-osx-arm64.tar.xz
tar -xJf ldc2-1.41.0-osx-arm64.tar.xz
sudo mv ldc2-1.41.0-osx-arm64 /opt/ldc

# Clone + build
git clone https://github.com/Arnaroo/taggen.git
cd taggen
dub build --config=macos --compiler=/opt/ldc/bin/ldc2 -b release-static

# Bundle into a relocatable .app + .dmg (compiles installer/applauncher.c, runs dylibbundler,
# builds TagGen.app/Contents/MacOS/TagGen as a native Mach-O for Tahoe Gatekeeper compatibility,
# ad-hoc signs the bundle, and produces TagGen-<ver>-macos-arm64.dmg)
bash installer/package-macos.sh

The package-macos.sh script handles the macOS-specific gotchas (force-loading druntime/phobos archives for Apple's ld-prime, bundling 30+ GTK dylibs with dylibbundler, building a native Mach-O launcher so Tahoe Gatekeeper accepts the .app, etc.) — see BUILD_MACOS.md for the architectural notes.

Building from Source (Windows)

Windows builds use a hybrid toolchain: LDC2 (D compiler) + MSVC (linker) + MSYS2 (GTK3 runtime). See WINDOWS_BUILD.md for detailed instructions.

Quick summary:

  1. Install MSYS2 and run: pacman -S mingw-w64-x86_64-gtk3
  2. Install Visual Studio Build Tools (Desktop C++ workload)
  3. Install LDC2 (includes DUB)
  4. Open the x64 Native Tools Command Prompt and run:
set PATH=%PATH%;C:\D\ldc2\bin
cd C:\path\to\taggen
build-windows.bat                  :: Build + portable ZIP
build-windows.bat --installer      :: Build + ZIP + Inno Setup installer

Quick Start

Generate 96 barcodes (default parameters)

taggen --cli -n 96 -l 14 -d 4 -o my_barcodes

This produces my_barcodes.fasta and my_barcodes.tsv containing 96 barcodes of 14 bp each with minimum pairwise Hamming distance of 4.

Generate barcodes with Levenshtein distance (recommended for ONT)

taggen --cli -n 96 -l 20 -d 8 --metric levenshtein --minGc 40 --maxGc 60 -o ont_tags -v

Demultiplex reads

taggen --demux --tags my_barcodes.fasta --reads reads.fastq --mode end --trim-mode ends --outdir demux_results/

Demultiplex reads + co-route POD5 signal

taggen --demux --tags my_barcodes.fasta \
    --reads reads.fastq \
    --pod5 raw_signal.pod5 \
    --mode end --trim-mode ends \
    --outdir demux_results/

--pod5 accepts one or more POD5 files (or pass the flag multiple times). Co-routing requires the pod5 Python CLI (pip install pod5); each per-sample folder will contain a matching <sample>.pod5 alongside <sample>.fastq.

Launch the GUI

./taggen

GUI Workflow

Generate Tab

  1. Configure barcode parameters:
    • Set tag length (8--30 bp recommended)
    • Set target barcode count (e.g., 96 for a standard plate)
    • Set minimum distance (3--8 depending on error tolerance)
    • Choose distance metric (Hamming for speed, Levenshtein for ONT accuracy)
  2. Set quality constraints:
    • GC content bounds (default: 30--70%)
    • Maximum homopolymer length (default: 3)
  3. Optionally load exclude sequences (FASTA format) to avoid similarity to adapters or primers
  4. Click "Generate Tags" to produce barcodes
  5. Review the generated tags in the preview table and the pairwise distance heatmap
  6. Export to FASTA or TSV format

Demultiplex Tab

  1. Load barcode FASTA file (generated above or user-supplied)
  2. Select FASTQ read file(s) for demultiplexing
  3. Choose search mode: end (standard libraries) or full (mid-read barcodes)
  4. Optionally configure:
    • Position mask (to restrict barcode search to a region of the read)
    • Maximum edit distance (auto-calculated by default)
    • Trim mode (none / ends / all)
  5. Click "Run Demultiplexing"
  6. Review results: per-sample assignment counts, barcode match positions, edit distance distributions, Q-score profiles
  7. Demultiplexed reads are written to per-sample subdirectories; unassigned reads to a separate directory

CLI Reference

Barcode Generation

USAGE:
    taggen --cli [OPTIONS]        Generate barcodes
    taggen                        Launch GUI (if available)

Tag Structure

Flag Long Form Description Default
-l --length N Core barcode length in bp 12
-p --prefix SEQ 5' prefix sequence appended to each barcode none
-x --suffix SEQ 3' suffix sequence appended to each barcode none

Diversity Parameters

Flag Long Form Description Default
-d --difference N Minimum pairwise distance between tags 3
-i --identity N Maximum sequence identity (0--100%) 75
-t --tolerance N Error tolerance buffer in bp 2
-r --homopolymer N Maximum homopolymer run length 3
--metric METRIC Distance metric: hamming or levenshtein hamming

Quality Constraints

Flag Long Form Description Default
--minGc N Minimum GC content (0--100%) 30
--maxGc N Maximum GC content (0--100%) 70

Generation

Flag Long Form Description Default
-n --count N Number of tags to generate 96
-k --select N Select best N from a larger generated pool all

Exclude Sequences

Flag Long Form Description Default
-e --exclude SEQ Comma-separated sequences to avoid similarity with none
-f --excludeFile FILE FASTA or text file with sequences to avoid none
--excludeDist N Minimum distance from exclude sequences 3
--excludeMode MODE hard (reject), soft (rank), or combined hard

Exclude sequences can be longer than the tag length. TagGen uses a sliding-window comparison to find the minimum distance at any alignment position, ensuring that generated tags do not match anywhere within longer reference sequences (e.g., adapter or genome sequences).

Output

Flag Long Form Description Default
-o --output NAME Output filename prefix tags
--outDir DIR Output directory current
--tsv Export TSV format yes
--fasta Export FASTA format yes
-m --matrix Export pairwise distance matrix (CSV) no
--heatmap Print ASCII heatmap to terminal no

Other

Flag Long Form Description
-c --config FILE Load parameters from JSON configuration file
-v --verbose Enable verbose output with statistics
-h --help Show help message
-V --version Show version information

Demultiplexing

USAGE:
    taggen --demux --tags FILE --reads FILE [OPTIONS]

Required

Flag Long Form Description
-t --tags FILE Tag FASTA file (barcode sequences with sample IDs)
-r --reads FILE... Input FASTQ file(s); gzip-compressed files accepted

Input Options

Flag Long Form Description Default
-p --pod5 FILE... POD5 signal file(s) to co-demultiplex by read ID none

Matching Parameters

Flag Long Form Description Default
-d --max-dist N Maximum edit distance for barcode acceptance auto
--metric METRIC Distance metric: levenshtein or hamming levenshtein
-s --min-score F Minimum confidence score (0--1) 0.0
-w --search-window N Bases to search at each read end auto (tag_len + 15)
-m --mode MODE Search mode: end or full end
-q --min-qscore Q Skip reads with mean Q-score below Q 0 (off)
--position-mask SPEC Include zone for full-mode search (repeatable) none
--trim-mode MODE Tag trimming: none, ends, or all all

Automatic max-dist: When --max-dist is not specified, TagGen automatically sets the maximum edit distance based on tag length:

  • Tags shorter than 20 bp: tag_length / 5
  • Tags 20--29 bp: tag_length / 4
  • Tags 30 bp or longer: tag_length / 3

Position mask formats (full mode only):

  • 5p:N -- Search the first N bp from the 5' end of the read
  • 3p:N -- Search the last N bp from the 3' end of the read
  • START:END -- Absolute base positions (e.g., 0:60)
  • F_START:F_END -- Read-length fractions (e.g., 0.05:0.20 for 5--20% of read length)

Multiple masks can be specified (comma-separated or repeated flags); their zones are combined as a union.

Output Options

Flag Long Form Description Default
-o --outdir DIR Output directory demux_out
--reads-per-file N Split FASTQ output into N-read files 0 (no split)
--pod5-reads-per-file N Split POD5 output into N-read files 0 (no split)
--pod5-parallel N Concurrent pod5 subset processes during routing 0 (auto: min(N_samples, max(1, min(nproc/2, 8))))
-z --compress Gzip-compress output FASTQ files off
--unassigned Write unassigned reads to separate directory on
--stats Write per-sample statistics TSV on
--batch-size N Reads per processing batch 10000

Demux Output Structure

demux_out/
  Tag_001/
    Tag_001.fastq           # Reads assigned to Tag_001
    Tag_001.pod5            # (only if --pod5 was given) co-routed signal
  Tag_002/
    Tag_002.fastq
    Tag_002.pod5
  ...
  unassigned/
    unassigned.fastq        # Reads that failed matching criteria
  demux_stats.tsv           # Per-sample assignment summary

FASTQ headers of assigned reads are annotated with the matched barcode name, edit distance, and confidence score.

When --pod5-reads-per-file N is set, the POD5 outputs are split into chunks named Tag_001_001.pod5, Tag_001_002.pod5, ... in the same per-sample folder.


JSON Configuration Files

Save and reload generation parameters via JSON for reproducibility:

{
    "coreLength": 20,
    "numTags": 96,
    "minDifference": 8,
    "maxHomopolymer": 3,
    "minGC": 0.4,
    "maxGC": 0.6,
    "distanceMetric": "levenshtein",
    "outputPrefix": "experiment_tags",
    "exportFasta": true,
    "exportTsv": true,
    "exportMatrix": false,
    "excludeSequences": [
        "AGATCGGAAGAGCACACGTCT",
        "AGATCGGAAGAGCGTCGTGTA"
    ],
    "excludeMinDistance": 4
}

Usage:

taggen --cli -c experiment_config.json -v

Output Formats

FASTA (.fasta)

Standard FASTA format with sequentially numbered tag IDs:

>Tag_001
ACGTACGTACGTAC
>Tag_002
TGCATGCATGCATG
>Tag_003
GATCGATCGATCGA

TSV (.tsv)

Tab-separated file with header, suitable for spreadsheet analysis:

#       Tag Sequence    Core Only       Length  GC %
1       ACGTACGTACGTAC  ACGTACGTACGTAC  14      50.000000
2       TGCATGCATGCATG  TGCATGCATGCATG  14      50.000000

Distance Matrix (.csv)

Pairwise distance matrix in CSV format for downstream analysis or clustering:

,Tag_001,Tag_002,Tag_003
Tag_001,0,8,6
Tag_002,8,0,7
Tag_003,6,7,0

Demultiplexing Statistics (demux_stats.tsv)

Per-sample summary with assigned read counts, expected counts, and mean Q-scores.


Examples

Example 1: Standard ONT Sample Multiplexing

Scenario: Multiplex 96 samples on a standard ONT flow cell.

# Generate 96 barcodes of 14 bp, min Hamming distance 4
taggen --cli -n 96 -l 14 -d 4 --minGc 40 --maxGc 60 -r 3 -o ont_barcodes

# Demultiplex reads (standard dual-ended library)
taggen --demux --tags ont_barcodes.fasta --reads reads.fastq \
    --mode end --trim-mode ends

Expected performance: >97% correct barcode assignment at typical ONT error rates (10--15%).

Example 2: Direct RNA Sequencing with High Error Tolerance

Scenario: Direct RNA sequencing of degraded clinical samples. 48-sample experiment with a custom ligation adapter.

# Generate 48 barcodes with stringent diversity, excluding adapter sequences
taggen --cli -n 48 -l 24 -d 10 --metric levenshtein \
    --minGc 45 --maxGc 55 -r 2 \
    -f custom_adapter.fasta -o uc1_tags -v

# Demultiplex (end mode for standard 5'/3' tagging)
taggen --demux --tags uc1_tags.fasta --reads reads.fastq \
    --mode end --trim-mode all

Expected performance: >98% correct assignment even at 20% error rate.

Example 3: Older ONT Chemistry with Position Masking

Scenario: Older R9.4.1 flow cells producing reads at 85--90% identity. Barcode is at the 5' end but may be shifted a few bases due to adapter degradation.

# Generate 96 barcodes of 30 bp, Levenshtein distance 8
taggen --cli -n 96 -l 30 -d 8 --metric levenshtein -o uc2_tags -v

# Demultiplex with full-mode search + 5' position mask (first 60 bp)
taggen --demux --tags uc2_tags.fasta --reads reads.fastq \
    --mode full --position-mask 5p:60 --trim-mode all

Expected performance: 97% accuracy at 95% identity, 84% at 85% identity, with near-zero misassignment.

Example 4: Spatial Transcriptomics with Zero Misassignment

Scenario: Custom spatial transcriptomics array with 384 capture spots. Barcode is embedded in a synthetic RNA spike-in at 5--20% of read length. Zero cross-sample contamination required.

# Generate 384 barcodes with pairwise distance heatmap
taggen --cli -n 384 -l 30 -d 8 --metric levenshtein --heatmap -o uc3_tags -v

# Demultiplex with fractional position mask and strict threshold
taggen --demux --tags uc3_tags.fasta --reads reads.fastq \
    --mode full --position-mask 0.05:0.20 \
    --max-dist 4 --trim-mode all

Expected performance: Misassignment reduced to <0.001% at 90% read identity; approximately 10--15% of reads left unassigned (acceptable when sample integrity is paramount).


Algorithm

TagGen implements a two-phase algorithm:

Phase 1: Monte Carlo Candidate Generation

Rather than enumerating all 4^n possible sequences (which causes memory exhaustion at n >= 14), TagGen generates random candidate sequences in parallel across all CPU cores. Each candidate is validated in real-time against user-specified constraints:

  • GC content within bounds
  • No homopolymer runs exceeding the limit
  • Not matching any exclude sequence within the minimum distance

Valid candidates are collected into a thread-safe pool (typically ~100,000 candidates in ~55 ms). The Mersenne Twister PRNG is used for sequence generation.

Phase 2: Greedy Diversity Selection

From the candidate pool, TagGen iteratively selects the sequence that maximises the minimum Hamming or Levenshtein distance to all previously selected barcodes. This greedy approach produces a locally optimal but not globally optimal set. For practical applications (96--384 barcodes), the approximation is highly effective, typically achieving minimum inter-tag distances of 12--15 bp for d=8 nominal targets.

Selection completes in ~25 ms for 96 barcodes from 100,000 candidates.

Demultiplexer Pipeline

The demultiplexer uses a two-stage pipeline:

  1. K-mer voting (Stage 1): All 8-mers in the search region are looked up against a pre-built index mapping each 8-mer to the set of barcodes containing it. Candidate barcodes are ranked by hit count, and top candidates' hit positions are averaged to estimate barcode location.

  2. Banded edit-distance alignment (Stage 2): A sliding window of (tag_length + 15) bp centred on the estimated location is compared against each candidate barcode via banded Levenshtein or Hamming distance. The best match is accepted if:

    • The edit distance does not exceed the maximum threshold
    • The margin between best and second-best candidate is at least 2 (ambiguity guard)
    • The confidence score exceeds the minimum threshold

Reads failing any criterion are written to the unassigned output, annotated with the rejection reason (no_match, ambiguous, or low_confidence).


Performance

Barcode Generation Speed

Configuration TagGen DNABarcodes Speedup
8 bp, d>=3 24 ms 120 ms 5x
10 bp, d>=3 31 ms 24 s 770x
12 bp, d>=4 34 ms 463 s 13,600x
14 bp, d>=4 35 ms Memory exhaustion --
16 bp, d>=4 35 ms Memory exhaustion --
20 bp, d>=6 44 ms Memory exhaustion --
30 bp, d>=8 29 ms Memory exhaustion --

All benchmarks: 96 target barcodes, Hamming distance, GC 25--75%. Wall-clock times averaged over 3 replicates. Linux, 12-core AMD processor.

Barcode Resolution Under Nanopore Error

Barcode Length Min. Dist. 5% Error 10% Error 15% Error 20% Error 25% Error
10 bp d>=3 99.8% 96.2% 91.3% 83.3% 74.7%
12 bp d>=4 99.7% 99.0% 96.1% 88.6% 82.0%
14 bp d>=4 100% 99.5% 97.9% 94.1% 86.1%
16 bp d>=5 99.9% 100% 99.1% 95.8% 92.2%
20 bp d>=6 100% 99.9% 99.6% 99.6% 96.4%
30 bp d>=8 100% 100% 100% 100% 99.6%

Values indicate the percentage of reads correctly assigned to the original barcode. Error model: 50% deletions, 25% insertions, 25% substitutions. N = 1,000 simulated reads per barcode per condition.


Recommended Parameters

By Application

Application Length Count Distance Metric Notes
Standard ONT (R10.4/Kit14) 14 bp 96 d=4 Hamming >97% accuracy at 95% identity
High-throughput ONT 20 bp 96--384 d=6--8 Levenshtein Recommended for most new experiments
Direct RNA sequencing 24--30 bp 48--96 d=8--10 Levenshtein High error tolerance
Spatial transcriptomics 30 bp 384 d=8 Levenshtein Use fractional position mask + --max-dist 4
Clinical (zero misassignment) 30 bp 48--96 d=8 Levenshtein Use --max-dist 4 for <0.001% misassignment

General Guidance

  • Levenshtein distance is recommended over Hamming for all ONT workflows, as it accounts for the indel-dominated error profile and achieves 3--8 percentage points higher accuracy at 85--90% read identity.
  • Longer barcodes (20--30 bp) provide substantially better error resilience. At 20% error, 30 bp barcodes maintain 100% correct assignment while 10 bp barcodes degrade to 83%.
  • Position masks are important in full-mode search to avoid spurious off-target matches. Without a mask, accuracy drops by 6--8 percentage points.
  • --max-dist controls the trade-off between assignment sensitivity and sample purity. The default adaptive threshold keeps misassignment below 0.5%. For clinical applications, use --max-dist 4 with 30 bp tags to reduce misassignment to <0.001% at the cost of ~10--15% more unassigned reads.

Changelog

Full changelog in CHANGELOG.md. Recent highlights:

v1.2.5 "Frogmouth" (2026-05-13)

  • POD5 routing parallelised across K concurrent pod5 subset processes (--pod5-parallel N, GUI SpinButton). Bin-packed buckets share the OS-page-cached input. Measured 7.1× routing speedup vs 1.2.4 on a 14-sample / 92 GB nanopore run (kaya HPC, K=8).

v1.2.4 "Emu" (2026-05-13)

  • POD5 routing now issues one combined pod5 subset call instead of one-per-sample (~1.3× routing speedup from per-call setup amortisation).
  • macOS .app bundle: native Mach-O launcher replaces shell-script Contents/MacOS/TagGen to satisfy Tahoe Gatekeeper (fixes _LSOpenURLsWithCompletionHandler() error -10669).
  • TagGen confirmed LDC 1.42+ compatible on Linux and macOS.

v1.2.3 "Dingo" (2026-05-13)

  • FASTQ parser fix: read IDs now split on any whitespace (TAB or space). Dorado emits SAM-style auxiliary tags TAB-separated; v1.2.1/1.2.2 captured the entire TSV header line as the read id, breaking POD5 routing (pod5 subset reported Found 0 read_ids). Closed the cluster-side regression that affected all dorado-basecalled inputs.

v1.2.2 "Cassowary" (2026-05-13)

  • POD5 read-id normalisation (lowercase + strip read_ prefix) as defensive belt-and-braces for non-dorado pipelines that emit uppercase or prefixed UUIDs.
  • Failed-routing CSV preservation (TAGGEN_DEBUG_POD5=1) for post-mortem inspection.

v1.2.1 "Bilby" (2026-05-07)

  • Fixed POD5 co-demultiplexing (--pod5): switched from the unsupported --read-id-file flag to pod5 subset --csv direct mapping with positional inputs
  • All input POD5 files now passed in a single pod5 subset call per sample, with --missing-ok so partial coverage no longer aborts routing

v1.2.0 "Wombat" (2026-03-08)

  • Position masks refactored from exclude to include semantics
  • GUI: 3-mode anchor selector per zone (From 3' end / From 5' end / Custom range)
  • CLI: --position-mask accepts 3p:N and 5p:N shorthand
  • Comprehensive demux test framework with parameter sweeps

v1.1.3 "Echidna" (2026-03-08)

  • Positional mask for full-mode demux: constrain tag search to specific read regions
  • Post-deconvolution statistics window with four visualisation tabs

v1.1.2 "Platypus" (2026-03-05)

  • Integrated demultiplexer (GUI tab + CLI --demux mode)
  • POD5 signal file co-demultiplexing
  • Tag trimming modes: none, ends, all
  • Confidence scoring with ambiguity detection

v1.1.1 "Billabong" (2026-03-05)

  • Levenshtein edit distance for tag selection (--metric levenshtein)
  • Greedy selection performance fix: O(n*k) cached min-distances

v1.1.0 "Outback Explorer" (2026-01-16)

  • Interactive heatmap visualisation of pairwise tag distances
  • Exclude sequences feature with sliding-window comparison

v1.0.1 "Unlimited Outback" (2025-10-15)

  • Increased maximum tag limit to 1,000,000

v1.0.0 "Kangaroo Launch" (2025-10-15)

  • Initial release: GUI, parallel generation, greedy selection, TSV/FASTA export

License

TagGen is released under the MIT License.

MIT License
Copyright (c) 2026 Biocodecs, Arnaroo Ribologicals, RMODEL

Authors

  • Faiza Chowdhury
  • Tessa Swain
  • Roderik Shirokikh
  • Danielle L. Rudler
  • Archa H. Fox
  • Alice Cleynen
  • Nikolay E. Shirokikh

School of Human Sciences / School of Molecular Sciences, The University of Western Australia, Perth, WA, Australia

France-Australia Mathematical Sciences and Interactions, CNRS International Research Laboratory, Canberra, ACT, Australia

Contact: nikolay.shirokikh@uwa.edu.au, alice.cleynen@cnrs.fr

Repository: https://github.com/Arnaroo/taggen

About

High-Performance Barcode Generator and Demultiplexer for High-Throughput and Long-Read Sequencing Applications

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors