TagGen

High-Performance Barcode Generator and Demultiplexer for High-Throughput and Long-Read Sequencing Applications

TagGen generates diverse, error-tolerant DNA/RNA barcodes (tags) for multiplexed sequencing experiments and assigns sequencing reads back to their source barcodes through an integrated demultiplexer. It is purpose-built for long-read platforms such as Oxford Nanopore Technologies (ONT), where higher error rates (5--15%, dominated by insertions and deletions) demand longer, more robust barcodes than traditional short-read tools can produce.

Key highlights:

Generates barcodes at lengths of 8--30 bp in under 100 milliseconds, where existing tools fail at lengths above 12 bp due to memory exhaustion
Up to 13,600x faster than exhaustive enumeration (DNABarcodes) at 12 bp
Integrated anchor-free demultiplexer that works without knowledge of flanking adapter sequences
Validated against realistic nanopore error profiles across 154 parameter combinations
Both a cross-platform graphical interface (GTK3) and a comprehensive command-line interface

Overview

Sequence barcodes (tags) are short DNA or RNA sequences used to identify samples, cells, or spatial locations in multiplexed sequencing experiments. For short-read platforms (Illumina), barcodes of 8--12 bp with minimum Hamming distances of 3--4 provide adequate error tolerance. However, nanopore sequencing exhibits error rates of 5--15% dominated by insertions and deletions, demanding longer barcodes with greater inter-sequence distances.

Existing barcode generation tools enumerate all possible sequences exhaustively (O(4^n) complexity), making them computationally infeasible at lengths above 12 bp. TagGen replaces exhaustive enumeration with Monte Carlo candidate sampling and greedy diversity selection, achieving O(k) complexity where k is the number of candidates generated (typically 10,000--100,000). This allows generation of barcodes at 14--30 bp in milliseconds.

TagGen also includes an integrated demultiplexer (taggen-demux) that assigns ONT FASTQ reads to their source barcodes. Unlike primer-anchored tools (minibar, Dorado), taggen-demux is anchor-free: it locates barcodes at any read position using k-mer voting and banded edit-distance alignment, without requiring knowledge of flanking adapter sequences. This makes it suitable for direct RNA sequencing, spatial transcriptomics capture plates, and custom library protocols where the barcode is not flanked by a consistent adapter.

Features

Barcode Generation

Flexible barcode lengths: 8--30 bp core sequences with optional 5' prefix and 3' suffix
Distance metrics: Hamming distance (substitutions only) or Levenshtein edit distance (substitutions, insertions, deletions)
Quality constraints: Configurable GC content bounds (default 30--70%), maximum homopolymer run length (default 3), and custom exclude sequences
Exclude sequences: Load FASTA files of adapters, primers, or existing barcodes to ensure generated tags are dissimilar; supports sliding-window comparison for sequences longer than the tag length
Scalable output: Generate from 1 to 1,000,000 tags; select best N from a larger candidate pool
Multiple output formats: FASTA, TSV, and pairwise distance matrix (CSV)
Visualisation: Interactive pairwise distance heatmap (GUI) or ASCII heatmap (CLI)
Reproducibility: JSON configuration files for saving and sharing parameter sets
Parallel generation: Multi-threaded Monte Carlo sampling across all available CPU cores

Demultiplexing (taggen-demux)

Anchor-free matching: Locates barcodes at any read position without adapter sequence knowledge
Two search modes: End mode (standard dual-ended libraries) and Full mode (mid-read barcodes, direct RNA sequencing, spatial transcriptomics)
Position masks: Restrict the search to specific read regions (absolute bp, end-relative offsets, or read-length fractions) to reduce false positives
Adaptive acceptance thresholds: Automatic maximum edit distance based on tag length (overridable via --max-dist)
Ambiguity detection: Rejects reads where the margin between best and second-best match is less than 2
Confidence scoring: Configurable minimum confidence score for assignment
Tag trimming: Three modes -- none, ends (trim matched end only), all (trim from both ends)
Gzip support: Reads gzip-compressed FASTQ input; optionally compresses output
POD5 co-demultiplexing: Partitions raw signal files alongside FASTQ by sample via the pod5 subset tool
Quality filtering: Skip reads below a minimum mean Phred Q-score
Per-sample output: Assigned reads written to per-barcode subdirectories; unassigned reads collected separately with annotation of rejection reason
Statistics: Per-sample assignment counts, expected vs. observed read counts, mean Q-score; interactive charts in the GUI (match positions, edit distance distributions, Q-score profiles)

Interface

Graphical interface (GUI): GTK3-based, three-tab layout (Generate, Demultiplex, About) with real-time parameter validation, progress tracking, interactive heatmap, and post-demultiplexing statistics charts
Command-line interface (CLI): Full parameter set via getopt-style flags, suitable for scripted pipelines and HPC environments

Use Cases

Scenario	Tag Length	Count	Min Distance	Metric	Demux Mode	Mask
Standard ONT multiplexing (96-well plate)	14 bp	96	d=4	Hamming	End	--
Direct RNA sequencing (high error tolerance)	24 bp	48	d=10	Levenshtein	End	--
Older ONT chemistry / degraded samples	30 bp	96	d=8	Levenshtein	Full	5p:60
Spatial transcriptomics (zero misassignment)	30 bp	384	d=8	Levenshtein	Full	0.05:0.20
Environmental metagenomics	20 bp	96	d=6	Levenshtein	End	--

Installation

System Requirements

Operating system: Linux (Arch, Ubuntu 18.04+, Debian 10+), Windows 10+, or macOS 10.14+
Processor: Any x86-64 CPU; multi-core recommended for parallel candidate generation
Memory: 512 MB minimum; 2 GB recommended for large candidate pools
Disk: 50 MB for installation
Display: 1024x768 minimum for GUI (not required for CLI)

Pre-compiled Binaries

Pre-compiled binaries for Linux x86_64, Windows x86_64, and macOS arm64 (Apple Silicon) are available in bin/ and from the GitHub Releases page. SHA-256 hashes for every artefact are listed in bin/MACOS_BUILD_NOTES.md.

Platform	Binary	Notes
Linux x86_64	`bin/taggen-linux-x86_64`	`release-static`, glibc 2.34+, runs on most modern distros
Windows x86_64	`bin/taggen.exe` + bundled GTK DLLs in the release ZIP/installer	Win10+
macOS arm64	`bin/TagGen-1.2.5-macos-arm64.dmg`, `bin/TagGen.app`, `bin/taggen-macos-arm64/`	macOS 13+ (Ventura); Intel Macs run via Rosetta 2

Linux:

# Direct binary (statically-linked phobos/druntime; needs only system glibc + GTK3 for GUI)
chmod +x bin/taggen-linux-x86_64
./bin/taggen-linux-x86_64                 # GUI
./bin/taggen-linux-x86_64 --cli -n 96 -l 14 -d 4    # CLI generation
./bin/taggen-linux-x86_64 --demux ...     # CLI demux

# Install GTK3 runtime for the GUI:
#   sudo pacman -S gtk3                    # Arch
#   sudo apt install libgtk-3-0            # Ubuntu / Debian
#   sudo dnf install gtk3                  # Fedora

Windows: Double-click TagGen-1.2.5-windows-x86_64-setup.exe (Inno Setup installer — adds TagGen to Start Menu and optionally to the system PATH), or extract taggen-v1.2.5-windows-x86_64.zip and run taggen.exe from the extracted folder. No system-wide GTK install needed; the runtime DLLs are bundled.

macOS (arm64 / Apple Silicon):

Double-click TagGen-1.2.5-macos-arm64.dmg to mount.
Drag TagGen.app to Applications.
First launch: right-click TagGen.app → Open (Gatekeeper warning is one-time, this build is ad-hoc signed; we don't currently have an Apple Developer ID for full notarisation).
- Alternatively, strip the quarantine attribute: xattr -dr com.apple.quarantine /Applications/TagGen.app
For CLI use, taggen-macos-arm64/bin/taggen-launcher.sh is a relocatable wrapper.

POD5 co-demultiplexing requires the pod5 Python CLI on all platforms — pip install pod5.

Building from Source (Linux)

TagGen is written in D and uses the DUB package manager.

Prerequisites:

# Arch Linux
sudo pacman -S ldc dub gtk3

# Ubuntu / Debian
sudo apt install ldc dub libgtk-3-dev libgtkd-dev

# Fedora
sudo dnf install ldc dub gtk3-devel

Build:

git clone https://github.com/Arnaroo/taggen.git
cd taggen

# Recommended: statically-linked phobos/druntime (portable across distros)
dub build --compiler=ldc2 --config=linux --build=release-static

# Standard release build
dub build --compiler=ldc2 --config=linux --build=release

# Optimised build (AMD Zen+)
dub build --compiler=ldc2 --config=linux --build=release-zenplus

# Run unit tests
dub test --compiler=ldc2 --config=linux

The resulting binary taggen can be run directly or copied to a location on your PATH. v1.2.5 builds cleanly on LDC 1.41 and LDC 1.42+.

CLI-only build (no GTK dependency):

If you do not need the GUI and want to avoid the GTK3 dependency, you can build TagGen with --config=benchmark which excludes the GUI module, or compile a standalone demux binary:

dub build --compiler=ldc2 --config=benchmark --build=release

Building from Source (macOS)

Detailed walkthrough in BUILD_MACOS.md. Quick summary:

# Install prerequisites via Homebrew
brew install dub gtk+3 dylibbundler pkg-config

# Install LDC 1.41 (canonical; 1.42+ also works for TagGen)
curl -fsSLO https://github.com/ldc-developers/ldc/releases/download/v1.41.0/ldc2-1.41.0-osx-arm64.tar.xz
tar -xJf ldc2-1.41.0-osx-arm64.tar.xz
sudo mv ldc2-1.41.0-osx-arm64 /opt/ldc

# Clone + build
git clone https://github.com/Arnaroo/taggen.git
cd taggen
dub build --config=macos --compiler=/opt/ldc/bin/ldc2 -b release-static

# Bundle into a relocatable .app + .dmg (compiles installer/applauncher.c, runs dylibbundler,
# builds TagGen.app/Contents/MacOS/TagGen as a native Mach-O for Tahoe Gatekeeper compatibility,
# ad-hoc signs the bundle, and produces TagGen-<ver>-macos-arm64.dmg)
bash installer/package-macos.sh

The package-macos.sh script handles the macOS-specific gotchas (force-loading druntime/phobos archives for Apple's ld-prime, bundling 30+ GTK dylibs with dylibbundler, building a native Mach-O launcher so Tahoe Gatekeeper accepts the .app, etc.) — see BUILD_MACOS.md for the architectural notes.

Building from Source (Windows)

Windows builds use a hybrid toolchain: LDC2 (D compiler) + MSVC (linker) + MSYS2 (GTK3 runtime). See WINDOWS_BUILD.md for detailed instructions.

Quick summary:

Install MSYS2 and run: pacman -S mingw-w64-x86_64-gtk3
Install Visual Studio Build Tools (Desktop C++ workload)
Install LDC2 (includes DUB)
Open the x64 Native Tools Command Prompt and run:

set PATH=%PATH%;C:\D\ldc2\bin
cd C:\path\to\taggen
build-windows.bat                  :: Build + portable ZIP
build-windows.bat --installer      :: Build + ZIP + Inno Setup installer

Quick Start

Generate 96 barcodes (default parameters)

taggen --cli -n 96 -l 14 -d 4 -o my_barcodes

This produces my_barcodes.fasta and my_barcodes.tsv containing 96 barcodes of 14 bp each with minimum pairwise Hamming distance of 4.

Generate barcodes with Levenshtein distance (recommended for ONT)

taggen --cli -n 96 -l 20 -d 8 --metric levenshtein --minGc 40 --maxGc 60 -o ont_tags -v

Demultiplex reads

taggen --demux --tags my_barcodes.fasta --reads reads.fastq --mode end --trim-mode ends --outdir demux_results/

Demultiplex reads + co-route POD5 signal

taggen --demux --tags my_barcodes.fasta \
    --reads reads.fastq \
    --pod5 raw_signal.pod5 \
    --mode end --trim-mode ends \
    --outdir demux_results/

--pod5 accepts one or more POD5 files (or pass the flag multiple times). Co-routing requires the pod5 Python CLI (pip install pod5); each per-sample folder will contain a matching <sample>.pod5 alongside <sample>.fastq.

Launch the GUI

./taggen

GUI Workflow

Generate Tab

Configure barcode parameters:
- Set tag length (8--30 bp recommended)
- Set target barcode count (e.g., 96 for a standard plate)
- Set minimum distance (3--8 depending on error tolerance)
- Choose distance metric (Hamming for speed, Levenshtein for ONT accuracy)
Set quality constraints:
- GC content bounds (default: 30--70%)
- Maximum homopolymer length (default: 3)
Optionally load exclude sequences (FASTA format) to avoid similarity to adapters or primers
Click "Generate Tags" to produce barcodes
Review the generated tags in the preview table and the pairwise distance heatmap
Export to FASTA or TSV format

Demultiplex Tab

Load barcode FASTA file (generated above or user-supplied)
Select FASTQ read file(s) for demultiplexing
Choose search mode: end (standard libraries) or full (mid-read barcodes)
Optionally configure:
- Position mask (to restrict barcode search to a region of the read)
- Maximum edit distance (auto-calculated by default)
- Trim mode (none / ends / all)
Click "Run Demultiplexing"
Review results: per-sample assignment counts, barcode match positions, edit distance distributions, Q-score profiles
Demultiplexed reads are written to per-sample subdirectories; unassigned reads to a separate directory

CLI Reference

Barcode Generation

USAGE:
    taggen --cli [OPTIONS]        Generate barcodes
    taggen                        Launch GUI (if available)

Tag Structure

Flag	Long Form	Description	Default
`-l`	`--length N`	Core barcode length in bp	12
`-p`	`--prefix SEQ`	5' prefix sequence appended to each barcode	none
`-x`	`--suffix SEQ`	3' suffix sequence appended to each barcode	none

Diversity Parameters

Flag	Long Form	Description	Default
`-d`	`--difference N`	Minimum pairwise distance between tags	3
`-i`	`--identity N`	Maximum sequence identity (0--100%)	75
`-t`	`--tolerance N`	Error tolerance buffer in bp	2
`-r`	`--homopolymer N`	Maximum homopolymer run length	3
	`--metric METRIC`	Distance metric: `hamming` or `levenshtein`	hamming

Quality Constraints

Flag	Long Form	Description	Default
	`--minGc N`	Minimum GC content (0--100%)	30
	`--maxGc N`	Maximum GC content (0--100%)	70

Generation

Flag	Long Form	Description	Default
`-n`	`--count N`	Number of tags to generate	96
`-k`	`--select N`	Select best N from a larger generated pool	all

Exclude Sequences

Flag	Long Form	Description	Default
`-e`	`--exclude SEQ`	Comma-separated sequences to avoid similarity with	none
`-f`	`--excludeFile FILE`	FASTA or text file with sequences to avoid	none
	`--excludeDist N`	Minimum distance from exclude sequences	3
	`--excludeMode MODE`	`hard` (reject), `soft` (rank), or `combined`	hard

Exclude sequences can be longer than the tag length. TagGen uses a sliding-window comparison to find the minimum distance at any alignment position, ensuring that generated tags do not match anywhere within longer reference sequences (e.g., adapter or genome sequences).

Output

Flag	Long Form	Description	Default
`-o`	`--output NAME`	Output filename prefix	tags
	`--outDir DIR`	Output directory	current
	`--tsv`	Export TSV format	yes
	`--fasta`	Export FASTA format	yes
`-m`	`--matrix`	Export pairwise distance matrix (CSV)	no
	`--heatmap`	Print ASCII heatmap to terminal	no

Other

Flag	Long Form	Description
`-c`	`--config FILE`	Load parameters from JSON configuration file
`-v`	`--verbose`	Enable verbose output with statistics
`-h`	`--help`	Show help message
`-V`	`--version`	Show version information

Demultiplexing

USAGE:
    taggen --demux --tags FILE --reads FILE [OPTIONS]

Required

Flag	Long Form	Description
`-t`	`--tags FILE`	Tag FASTA file (barcode sequences with sample IDs)
`-r`	`--reads FILE...`	Input FASTQ file(s); gzip-compressed files accepted

Input Options

Flag	Long Form	Description	Default
`-p`	`--pod5 FILE...`	POD5 signal file(s) to co-demultiplex by read ID	none

Matching Parameters

Flag	Long Form	Description	Default
`-d`	`--max-dist N`	Maximum edit distance for barcode acceptance	auto
	`--metric METRIC`	Distance metric: `levenshtein` or `hamming`	levenshtein
`-s`	`--min-score F`	Minimum confidence score (0--1)	0.0
`-w`	`--search-window N`	Bases to search at each read end	auto (tag_len + 15)
`-m`	`--mode MODE`	Search mode: `end` or `full`	end
`-q`	`--min-qscore Q`	Skip reads with mean Q-score below Q	0 (off)
	`--position-mask SPEC`	Include zone for full-mode search (repeatable)	none
	`--trim-mode MODE`	Tag trimming: `none`, `ends`, or `all`	all

Automatic max-dist: When --max-dist is not specified, TagGen automatically sets the maximum edit distance based on tag length:

Tags shorter than 20 bp: tag_length / 5
Tags 20--29 bp: tag_length / 4
Tags 30 bp or longer: tag_length / 3

Position mask formats (full mode only):

5p:N -- Search the first N bp from the 5' end of the read
3p:N -- Search the last N bp from the 3' end of the read
START:END -- Absolute base positions (e.g., 0:60)
F_START:F_END -- Read-length fractions (e.g., 0.05:0.20 for 5--20% of read length)

Multiple masks can be specified (comma-separated or repeated flags); their zones are combined as a union.

Output Options

Flag	Long Form	Description	Default
`-o`	`--outdir DIR`	Output directory	demux_out
	`--reads-per-file N`	Split FASTQ output into N-read files	0 (no split)
	`--pod5-reads-per-file N`	Split POD5 output into N-read files	0 (no split)
	`--pod5-parallel N`	Concurrent `pod5 subset` processes during routing	0 (auto: `min(N_samples, max(1, min(nproc/2, 8)))`)
`-z`	`--compress`	Gzip-compress output FASTQ files	off
	`--unassigned`	Write unassigned reads to separate directory	on
	`--stats`	Write per-sample statistics TSV	on
	`--batch-size N`	Reads per processing batch	10000

Demux Output Structure

demux_out/
  Tag_001/
    Tag_001.fastq           # Reads assigned to Tag_001
    Tag_001.pod5            # (only if --pod5 was given) co-routed signal
  Tag_002/
    Tag_002.fastq
    Tag_002.pod5
  ...
  unassigned/
    unassigned.fastq        # Reads that failed matching criteria
  demux_stats.tsv           # Per-sample assignment summary

FASTQ headers of assigned reads are annotated with the matched barcode name, edit distance, and confidence score.

When --pod5-reads-per-file N is set, the POD5 outputs are split into chunks named Tag_001_001.pod5, Tag_001_002.pod5, ... in the same per-sample folder.

JSON Configuration Files

Save and reload generation parameters via JSON for reproducibility:

{
    "coreLength": 20,
    "numTags": 96,
    "minDifference": 8,
    "maxHomopolymer": 3,
    "minGC": 0.4,
    "maxGC": 0.6,
    "distanceMetric": "levenshtein",
    "outputPrefix": "experiment_tags",
    "exportFasta": true,
    "exportTsv": true,
    "exportMatrix": false,
    "excludeSequences": [
        "AGATCGGAAGAGCACACGTCT",
        "AGATCGGAAGAGCGTCGTGTA"
    ],
    "excludeMinDistance": 4
}

Usage:

taggen --cli -c experiment_config.json -v

Output Formats

FASTA (`.fasta`)

Standard FASTA format with sequentially numbered tag IDs:

>Tag_001
ACGTACGTACGTAC
>Tag_002
TGCATGCATGCATG
>Tag_003
GATCGATCGATCGA

TSV (`.tsv`)

Tab-separated file with header, suitable for spreadsheet analysis:

#       Tag Sequence    Core Only       Length  GC %
1       ACGTACGTACGTAC  ACGTACGTACGTAC  14      50.000000
2       TGCATGCATGCATG  TGCATGCATGCATG  14      50.000000

Distance Matrix (`.csv`)

Pairwise distance matrix in CSV format for downstream analysis or clustering:

,Tag_001,Tag_002,Tag_003
Tag_001,0,8,6
Tag_002,8,0,7
Tag_003,6,7,0

Demultiplexing Statistics (`demux_stats.tsv`)

Per-sample summary with assigned read counts, expected counts, and mean Q-scores.

Examples

Example 1: Standard ONT Sample Multiplexing

Scenario: Multiplex 96 samples on a standard ONT flow cell.

# Generate 96 barcodes of 14 bp, min Hamming distance 4
taggen --cli -n 96 -l 14 -d 4 --minGc 40 --maxGc 60 -r 3 -o ont_barcodes

# Demultiplex reads (standard dual-ended library)
taggen --demux --tags ont_barcodes.fasta --reads reads.fastq \
    --mode end --trim-mode ends

Expected performance: >97% correct barcode assignment at typical ONT error rates (10--15%).

Example 2: Direct RNA Sequencing with High Error Tolerance

Scenario: Direct RNA sequencing of degraded clinical samples. 48-sample experiment with a custom ligation adapter.

# Generate 48 barcodes with stringent diversity, excluding adapter sequences
taggen --cli -n 48 -l 24 -d 10 --metric levenshtein \
    --minGc 45 --maxGc 55 -r 2 \
    -f custom_adapter.fasta -o uc1_tags -v

# Demultiplex (end mode for standard 5'/3' tagging)
taggen --demux --tags uc1_tags.fasta --reads reads.fastq \
    --mode end --trim-mode all

Expected performance: >98% correct assignment even at 20% error rate.

Example 3: Older ONT Chemistry with Position Masking

Scenario: Older R9.4.1 flow cells producing reads at 85--90% identity. Barcode is at the 5' end but may be shifted a few bases due to adapter degradation.

# Generate 96 barcodes of 30 bp, Levenshtein distance 8
taggen --cli -n 96 -l 30 -d 8 --metric levenshtein -o uc2_tags -v

# Demultiplex with full-mode search + 5' position mask (first 60 bp)
taggen --demux --tags uc2_tags.fasta --reads reads.fastq \
    --mode full --position-mask 5p:60 --trim-mode all

Expected performance: 97% accuracy at 95% identity, 84% at 85% identity, with near-zero misassignment.

Example 4: Spatial Transcriptomics with Zero Misassignment

Scenario: Custom spatial transcriptomics array with 384 capture spots. Barcode is embedded in a synthetic RNA spike-in at 5--20% of read length. Zero cross-sample contamination required.

# Generate 384 barcodes with pairwise distance heatmap
taggen --cli -n 384 -l 30 -d 8 --metric levenshtein --heatmap -o uc3_tags -v

# Demultiplex with fractional position mask and strict threshold
taggen --demux --tags uc3_tags.fasta --reads reads.fastq \
    --mode full --position-mask 0.05:0.20 \
    --max-dist 4 --trim-mode all

Expected performance: Misassignment reduced to <0.001% at 90% read identity; approximately 10--15% of reads left unassigned (acceptable when sample integrity is paramount).

Algorithm

TagGen implements a two-phase algorithm:

Phase 1: Monte Carlo Candidate Generation

Rather than enumerating all 4^n possible sequences (which causes memory exhaustion at n >= 14), TagGen generates random candidate sequences in parallel across all CPU cores. Each candidate is validated in real-time against user-specified constraints:

GC content within bounds
No homopolymer runs exceeding the limit
Not matching any exclude sequence within the minimum distance

Valid candidates are collected into a thread-safe pool (typically ~100,000 candidates in ~55 ms). The Mersenne Twister PRNG is used for sequence generation.

Phase 2: Greedy Diversity Selection

From the candidate pool, TagGen iteratively selects the sequence that maximises the minimum Hamming or Levenshtein distance to all previously selected barcodes. This greedy approach produces a locally optimal but not globally optimal set. For practical applications (96--384 barcodes), the approximation is highly effective, typically achieving minimum inter-tag distances of 12--15 bp for d=8 nominal targets.

Selection completes in ~25 ms for 96 barcodes from 100,000 candidates.

Demultiplexer Pipeline

The demultiplexer uses a two-stage pipeline:

K-mer voting (Stage 1): All 8-mers in the search region are looked up against a pre-built index mapping each 8-mer to the set of barcodes containing it. Candidate barcodes are ranked by hit count, and top candidates' hit positions are averaged to estimate barcode location.
Banded edit-distance alignment (Stage 2): A sliding window of (tag_length + 15) bp centred on the estimated location is compared against each candidate barcode via banded Levenshtein or Hamming distance. The best match is accepted if:
- The edit distance does not exceed the maximum threshold
- The margin between best and second-best candidate is at least 2 (ambiguity guard)
- The confidence score exceeds the minimum threshold

Reads failing any criterion are written to the unassigned output, annotated with the rejection reason (no_match, ambiguous, or low_confidence).

Performance

Barcode Generation Speed

Configuration	TagGen	DNABarcodes	Speedup
8 bp, d>=3	24 ms	120 ms	5x
10 bp, d>=3	31 ms	24 s	770x
12 bp, d>=4	34 ms	463 s	13,600x
14 bp, d>=4	35 ms	Memory exhaustion	--
16 bp, d>=4	35 ms	Memory exhaustion	--
20 bp, d>=6	44 ms	Memory exhaustion	--
30 bp, d>=8	29 ms	Memory exhaustion	--

All benchmarks: 96 target barcodes, Hamming distance, GC 25--75%. Wall-clock times averaged over 3 replicates. Linux, 12-core AMD processor.

Barcode Resolution Under Nanopore Error

Barcode Length	Min. Dist.	5% Error	10% Error	15% Error	20% Error	25% Error
10 bp	d>=3	99.8%	96.2%	91.3%	83.3%	74.7%
12 bp	d>=4	99.7%	99.0%	96.1%	88.6%	82.0%
14 bp	d>=4	100%	99.5%	97.9%	94.1%	86.1%
16 bp	d>=5	99.9%	100%	99.1%	95.8%	92.2%
20 bp	d>=6	100%	99.9%	99.6%	99.6%	96.4%
30 bp	d>=8	100%	100%	100%	100%	99.6%

Values indicate the percentage of reads correctly assigned to the original barcode. Error model: 50% deletions, 25% insertions, 25% substitutions. N = 1,000 simulated reads per barcode per condition.

Recommended Parameters

By Application

Application	Length	Count	Distance	Metric	Notes
Standard ONT (R10.4/Kit14)	14 bp	96	d=4	Hamming	>97% accuracy at 95% identity
High-throughput ONT	20 bp	96--384	d=6--8	Levenshtein	Recommended for most new experiments
Direct RNA sequencing	24--30 bp	48--96	d=8--10	Levenshtein	High error tolerance
Spatial transcriptomics	30 bp	384	d=8	Levenshtein	Use fractional position mask + `--max-dist 4`
Clinical (zero misassignment)	30 bp	48--96	d=8	Levenshtein	Use `--max-dist 4` for <0.001% misassignment

General Guidance

Levenshtein distance is recommended over Hamming for all ONT workflows, as it accounts for the indel-dominated error profile and achieves 3--8 percentage points higher accuracy at 85--90% read identity.
Longer barcodes (20--30 bp) provide substantially better error resilience. At 20% error, 30 bp barcodes maintain 100% correct assignment while 10 bp barcodes degrade to 83%.
Position masks are important in full-mode search to avoid spurious off-target matches. Without a mask, accuracy drops by 6--8 percentage points.
--max-dist controls the trade-off between assignment sensitivity and sample purity. The default adaptive threshold keeps misassignment below 0.5%. For clinical applications, use --max-dist 4 with 30 bp tags to reduce misassignment to <0.001% at the cost of ~10--15% more unassigned reads.

Changelog

Full changelog in CHANGELOG.md. Recent highlights:

v1.2.5 "Frogmouth" (2026-05-13)

POD5 routing parallelised across K concurrent pod5 subset processes (--pod5-parallel N, GUI SpinButton). Bin-packed buckets share the OS-page-cached input. Measured 7.1× routing speedup vs 1.2.4 on a 14-sample / 92 GB nanopore run (kaya HPC, K=8).

v1.2.4 "Emu" (2026-05-13)

POD5 routing now issues one combined pod5 subset call instead of one-per-sample (~1.3× routing speedup from per-call setup amortisation).
macOS .app bundle: native Mach-O launcher replaces shell-script Contents/MacOS/TagGen to satisfy Tahoe Gatekeeper (fixes _LSOpenURLsWithCompletionHandler() error -10669).
TagGen confirmed LDC 1.42+ compatible on Linux and macOS.

v1.2.3 "Dingo" (2026-05-13)

FASTQ parser fix: read IDs now split on any whitespace (TAB or space). Dorado emits SAM-style auxiliary tags TAB-separated; v1.2.1/1.2.2 captured the entire TSV header line as the read id, breaking POD5 routing (pod5 subset reported Found 0 read_ids). Closed the cluster-side regression that affected all dorado-basecalled inputs.

v1.2.2 "Cassowary" (2026-05-13)

POD5 read-id normalisation (lowercase + strip read_ prefix) as defensive belt-and-braces for non-dorado pipelines that emit uppercase or prefixed UUIDs.
Failed-routing CSV preservation (TAGGEN_DEBUG_POD5=1) for post-mortem inspection.

v1.2.1 "Bilby" (2026-05-07)

Fixed POD5 co-demultiplexing (--pod5): switched from the unsupported --read-id-file flag to pod5 subset --csv direct mapping with positional inputs
All input POD5 files now passed in a single pod5 subset call per sample, with --missing-ok so partial coverage no longer aborts routing

v1.2.0 "Wombat" (2026-03-08)

Position masks refactored from exclude to include semantics
GUI: 3-mode anchor selector per zone (From 3' end / From 5' end / Custom range)
CLI: --position-mask accepts 3p:N and 5p:N shorthand
Comprehensive demux test framework with parameter sweeps

v1.1.3 "Echidna" (2026-03-08)

Positional mask for full-mode demux: constrain tag search to specific read regions
Post-deconvolution statistics window with four visualisation tabs

v1.1.2 "Platypus" (2026-03-05)

Integrated demultiplexer (GUI tab + CLI --demux mode)
POD5 signal file co-demultiplexing
Tag trimming modes: none, ends, all
Confidence scoring with ambiguity detection

v1.1.1 "Billabong" (2026-03-05)

Levenshtein edit distance for tag selection (--metric levenshtein)
Greedy selection performance fix: O(n*k) cached min-distances

v1.1.0 "Outback Explorer" (2026-01-16)

Interactive heatmap visualisation of pairwise tag distances
Exclude sequences feature with sliding-window comparison

v1.0.1 "Unlimited Outback" (2025-10-15)

Increased maximum tag limit to 1,000,000

v1.0.0 "Kangaroo Launch" (2025-10-15)

Initial release: GUI, parallel generation, greedy selection, TSV/FASTA export

License

TagGen is released under the MIT License.

MIT License
Copyright (c) 2026 Biocodecs, Arnaroo Ribologicals, RMODEL

Authors

Faiza Chowdhury
Tessa Swain
Roderik Shirokikh
Danielle L. Rudler
Archa H. Fox
Alice Cleynen
Nikolay E. Shirokikh

School of Human Sciences / School of Molecular Sciences, The University of Western Australia, Perth, WA, Australia

France-Australia Mathematical Sciences and Interactions, CNRS International Research Laboratory, Canberra, ACT, Australia

Contact: nikolay.shirokikh@uwa.edu.au, alice.cleynen@cnrs.fr

Repository: https://github.com/Arnaroo/taggen

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
benchmarks		benchmarks
bin		bin
installer		installer
resources		resources
source		source
.gitignore		.gitignore
BUILD_MACOS.md		BUILD_MACOS.md
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
LICENSE.txt		LICENSE.txt
README.md		README.md
WINDOWS_BUILD.md		WINDOWS_BUILD.md
build-windows.bat		build-windows.bat
dub.json		dub.json
dub.selections.json		dub.selections.json

Folders and files

Latest commit

History

Repository files navigation

TagGen

Table of Contents

Overview

Features

Barcode Generation

Demultiplexing (taggen-demux)

Interface

Use Cases

Installation

System Requirements

Pre-compiled Binaries

Building from Source (Linux)

Building from Source (macOS)

Building from Source (Windows)

Quick Start

Generate 96 barcodes (default parameters)

Generate barcodes with Levenshtein distance (recommended for ONT)

Demultiplex reads

Demultiplex reads + co-route POD5 signal

Launch the GUI

GUI Workflow

Generate Tab

Demultiplex Tab

CLI Reference

Barcode Generation

Tag Structure

Diversity Parameters

Quality Constraints

Generation

Exclude Sequences

Output

Other

Demultiplexing

Required

Input Options

Matching Parameters

Output Options

Demux Output Structure

JSON Configuration Files

Output Formats

FASTA (.fasta)

TSV (.tsv)

Distance Matrix (.csv)

Demultiplexing Statistics (demux_stats.tsv)

Examples

Example 1: Standard ONT Sample Multiplexing

Example 2: Direct RNA Sequencing with High Error Tolerance

Example 3: Older ONT Chemistry with Position Masking

Example 4: Spatial Transcriptomics with Zero Misassignment

Algorithm

Phase 1: Monte Carlo Candidate Generation

Phase 2: Greedy Diversity Selection

Demultiplexer Pipeline

Performance

Barcode Generation Speed

Barcode Resolution Under Nanopore Error

Recommended Parameters

By Application

General Guidance

Changelog

v1.2.5 "Frogmouth" (2026-05-13)

v1.2.4 "Emu" (2026-05-13)

v1.2.3 "Dingo" (2026-05-13)

v1.2.2 "Cassowary" (2026-05-13)

v1.2.1 "Bilby" (2026-05-07)

v1.2.0 "Wombat" (2026-03-08)

v1.1.3 "Echidna" (2026-03-08)

v1.1.2 "Platypus" (2026-03-05)

v1.1.1 "Billabong" (2026-03-05)

v1.1.0 "Outback Explorer" (2026-01-16)

v1.0.1 "Unlimited Outback" (2025-10-15)

v1.0.0 "Kangaroo Launch" (2025-10-15)

License

Authors

About

Resources

FASTA (`.fasta`)

TSV (`.tsv`)

Distance Matrix (`.csv`)

Demultiplexing Statistics (`demux_stats.tsv`)

Packages