High-Performance Barcode Generator and Demultiplexer for High-Throughput and Long-Read Sequencing Applications
TagGen generates diverse, error-tolerant DNA/RNA barcodes (tags) for multiplexed sequencing experiments and assigns sequencing reads back to their source barcodes through an integrated demultiplexer. It is purpose-built for long-read platforms such as Oxford Nanopore Technologies (ONT), where higher error rates (5--15%, dominated by insertions and deletions) demand longer, more robust barcodes than traditional short-read tools can produce.
Key highlights:
- Generates barcodes at lengths of 8--30 bp in under 100 milliseconds, where existing tools fail at lengths above 12 bp due to memory exhaustion
- Up to 13,600x faster than exhaustive enumeration (DNABarcodes) at 12 bp
- Integrated anchor-free demultiplexer that works without knowledge of flanking adapter sequences
- Validated against realistic nanopore error profiles across 154 parameter combinations
- Both a cross-platform graphical interface (GTK3) and a comprehensive command-line interface
- Overview
- Features
- Use Cases
- Installation
- Quick Start
- GUI Workflow
- CLI Reference
- Output Formats
- Examples
- Algorithm
- Performance
- Recommended Parameters
- Changelog
- License
- Authors
Sequence barcodes (tags) are short DNA or RNA sequences used to identify samples, cells, or spatial locations in multiplexed sequencing experiments. For short-read platforms (Illumina), barcodes of 8--12 bp with minimum Hamming distances of 3--4 provide adequate error tolerance. However, nanopore sequencing exhibits error rates of 5--15% dominated by insertions and deletions, demanding longer barcodes with greater inter-sequence distances.
Existing barcode generation tools enumerate all possible sequences exhaustively (O(4^n) complexity), making them computationally infeasible at lengths above 12 bp. TagGen replaces exhaustive enumeration with Monte Carlo candidate sampling and greedy diversity selection, achieving O(k) complexity where k is the number of candidates generated (typically 10,000--100,000). This allows generation of barcodes at 14--30 bp in milliseconds.
TagGen also includes an integrated demultiplexer (taggen-demux) that assigns ONT FASTQ reads to their source barcodes. Unlike primer-anchored tools (minibar, Dorado), taggen-demux is anchor-free: it locates barcodes at any read position using k-mer voting and banded edit-distance alignment, without requiring knowledge of flanking adapter sequences. This makes it suitable for direct RNA sequencing, spatial transcriptomics capture plates, and custom library protocols where the barcode is not flanked by a consistent adapter.
- Flexible barcode lengths: 8--30 bp core sequences with optional 5' prefix and 3' suffix
- Distance metrics: Hamming distance (substitutions only) or Levenshtein edit distance (substitutions, insertions, deletions)
- Quality constraints: Configurable GC content bounds (default 30--70%), maximum homopolymer run length (default 3), and custom exclude sequences
- Exclude sequences: Load FASTA files of adapters, primers, or existing barcodes to ensure generated tags are dissimilar; supports sliding-window comparison for sequences longer than the tag length
- Scalable output: Generate from 1 to 1,000,000 tags; select best N from a larger candidate pool
- Multiple output formats: FASTA, TSV, and pairwise distance matrix (CSV)
- Visualisation: Interactive pairwise distance heatmap (GUI) or ASCII heatmap (CLI)
- Reproducibility: JSON configuration files for saving and sharing parameter sets
- Parallel generation: Multi-threaded Monte Carlo sampling across all available CPU cores
- Anchor-free matching: Locates barcodes at any read position without adapter sequence knowledge
- Two search modes: End mode (standard dual-ended libraries) and Full mode (mid-read barcodes, direct RNA sequencing, spatial transcriptomics)
- Position masks: Restrict the search to specific read regions (absolute bp, end-relative offsets, or read-length fractions) to reduce false positives
- Adaptive acceptance thresholds: Automatic maximum edit distance based on tag length (overridable via
--max-dist) - Ambiguity detection: Rejects reads where the margin between best and second-best match is less than 2
- Confidence scoring: Configurable minimum confidence score for assignment
- Tag trimming: Three modes -- none, ends (trim matched end only), all (trim from both ends)
- Gzip support: Reads gzip-compressed FASTQ input; optionally compresses output
- POD5 co-demultiplexing: Partitions raw signal files alongside FASTQ by sample via the
pod5 subsettool - Quality filtering: Skip reads below a minimum mean Phred Q-score
- Per-sample output: Assigned reads written to per-barcode subdirectories; unassigned reads collected separately with annotation of rejection reason
- Statistics: Per-sample assignment counts, expected vs. observed read counts, mean Q-score; interactive charts in the GUI (match positions, edit distance distributions, Q-score profiles)
- Graphical interface (GUI): GTK3-based, three-tab layout (Generate, Demultiplex, About) with real-time parameter validation, progress tracking, interactive heatmap, and post-demultiplexing statistics charts
- Command-line interface (CLI): Full parameter set via getopt-style flags, suitable for scripted pipelines and HPC environments
| Scenario | Tag Length | Count | Min Distance | Metric | Demux Mode | Mask |
|---|---|---|---|---|---|---|
| Standard ONT multiplexing (96-well plate) | 14 bp | 96 | d=4 | Hamming | End | -- |
| Direct RNA sequencing (high error tolerance) | 24 bp | 48 | d=10 | Levenshtein | End | -- |
| Older ONT chemistry / degraded samples | 30 bp | 96 | d=8 | Levenshtein | Full | 5p:60 |
| Spatial transcriptomics (zero misassignment) | 30 bp | 384 | d=8 | Levenshtein | Full | 0.05:0.20 |
| Environmental metagenomics | 20 bp | 96 | d=6 | Levenshtein | End | -- |
- Operating system: Linux (Arch, Ubuntu 18.04+, Debian 10+), Windows 10+, or macOS 10.14+
- Processor: Any x86-64 CPU; multi-core recommended for parallel candidate generation
- Memory: 512 MB minimum; 2 GB recommended for large candidate pools
- Disk: 50 MB for installation
- Display: 1024x768 minimum for GUI (not required for CLI)
Pre-compiled binaries for Linux x86_64, Windows x86_64, and macOS arm64 (Apple Silicon) are available in bin/ and from the GitHub Releases page. SHA-256 hashes for every artefact are listed in bin/MACOS_BUILD_NOTES.md.
| Platform | Binary | Notes |
|---|---|---|
| Linux x86_64 | bin/taggen-linux-x86_64 |
release-static, glibc 2.34+, runs on most modern distros |
| Windows x86_64 | bin/taggen.exe + bundled GTK DLLs in the release ZIP/installer |
Win10+ |
| macOS arm64 | bin/TagGen-1.2.5-macos-arm64.dmg, bin/TagGen.app, bin/taggen-macos-arm64/ |
macOS 13+ (Ventura); Intel Macs run via Rosetta 2 |
Linux:
# Direct binary (statically-linked phobos/druntime; needs only system glibc + GTK3 for GUI)
chmod +x bin/taggen-linux-x86_64
./bin/taggen-linux-x86_64 # GUI
./bin/taggen-linux-x86_64 --cli -n 96 -l 14 -d 4 # CLI generation
./bin/taggen-linux-x86_64 --demux ... # CLI demux
# Install GTK3 runtime for the GUI:
# sudo pacman -S gtk3 # Arch
# sudo apt install libgtk-3-0 # Ubuntu / Debian
# sudo dnf install gtk3 # FedoraWindows:
Double-click TagGen-1.2.5-windows-x86_64-setup.exe (Inno Setup installer — adds TagGen to Start Menu and optionally to the system PATH), or extract taggen-v1.2.5-windows-x86_64.zip and run taggen.exe from the extracted folder. No system-wide GTK install needed; the runtime DLLs are bundled.
macOS (arm64 / Apple Silicon):
- Double-click
TagGen-1.2.5-macos-arm64.dmgto mount. - Drag
TagGen.apptoApplications. - First launch: right-click
TagGen.app→ Open (Gatekeeper warning is one-time, this build is ad-hoc signed; we don't currently have an Apple Developer ID for full notarisation).- Alternatively, strip the quarantine attribute:
xattr -dr com.apple.quarantine /Applications/TagGen.app
- Alternatively, strip the quarantine attribute:
- For CLI use,
taggen-macos-arm64/bin/taggen-launcher.shis a relocatable wrapper.
POD5 co-demultiplexing requires the pod5 Python CLI on all platforms — pip install pod5.
TagGen is written in D and uses the DUB package manager.
Prerequisites:
# Arch Linux
sudo pacman -S ldc dub gtk3
# Ubuntu / Debian
sudo apt install ldc dub libgtk-3-dev libgtkd-dev
# Fedora
sudo dnf install ldc dub gtk3-develBuild:
git clone https://github.com/Arnaroo/taggen.git
cd taggen
# Recommended: statically-linked phobos/druntime (portable across distros)
dub build --compiler=ldc2 --config=linux --build=release-static
# Standard release build
dub build --compiler=ldc2 --config=linux --build=release
# Optimised build (AMD Zen+)
dub build --compiler=ldc2 --config=linux --build=release-zenplus
# Run unit tests
dub test --compiler=ldc2 --config=linuxThe resulting binary taggen can be run directly or copied to a location on your PATH. v1.2.5 builds cleanly on LDC 1.41 and LDC 1.42+.
CLI-only build (no GTK dependency):
If you do not need the GUI and want to avoid the GTK3 dependency, you can build TagGen with --config=benchmark which excludes the GUI module, or compile a standalone demux binary:
dub build --compiler=ldc2 --config=benchmark --build=releaseDetailed walkthrough in BUILD_MACOS.md. Quick summary:
# Install prerequisites via Homebrew
brew install dub gtk+3 dylibbundler pkg-config
# Install LDC 1.41 (canonical; 1.42+ also works for TagGen)
curl -fsSLO https://github.com/ldc-developers/ldc/releases/download/v1.41.0/ldc2-1.41.0-osx-arm64.tar.xz
tar -xJf ldc2-1.41.0-osx-arm64.tar.xz
sudo mv ldc2-1.41.0-osx-arm64 /opt/ldc
# Clone + build
git clone https://github.com/Arnaroo/taggen.git
cd taggen
dub build --config=macos --compiler=/opt/ldc/bin/ldc2 -b release-static
# Bundle into a relocatable .app + .dmg (compiles installer/applauncher.c, runs dylibbundler,
# builds TagGen.app/Contents/MacOS/TagGen as a native Mach-O for Tahoe Gatekeeper compatibility,
# ad-hoc signs the bundle, and produces TagGen-<ver>-macos-arm64.dmg)
bash installer/package-macos.shThe package-macos.sh script handles the macOS-specific gotchas (force-loading
druntime/phobos archives for Apple's ld-prime, bundling 30+ GTK dylibs with
dylibbundler, building a native Mach-O launcher so Tahoe Gatekeeper accepts
the .app, etc.) — see BUILD_MACOS.md for the architectural
notes.
Windows builds use a hybrid toolchain: LDC2 (D compiler) + MSVC (linker) + MSYS2 (GTK3 runtime). See WINDOWS_BUILD.md for detailed instructions.
Quick summary:
- Install MSYS2 and run:
pacman -S mingw-w64-x86_64-gtk3 - Install Visual Studio Build Tools (Desktop C++ workload)
- Install LDC2 (includes DUB)
- Open the x64 Native Tools Command Prompt and run:
set PATH=%PATH%;C:\D\ldc2\bin
cd C:\path\to\taggen
build-windows.bat :: Build + portable ZIP
build-windows.bat --installer :: Build + ZIP + Inno Setup installertaggen --cli -n 96 -l 14 -d 4 -o my_barcodesThis produces my_barcodes.fasta and my_barcodes.tsv containing 96 barcodes of 14 bp each with minimum pairwise Hamming distance of 4.
taggen --cli -n 96 -l 20 -d 8 --metric levenshtein --minGc 40 --maxGc 60 -o ont_tags -vtaggen --demux --tags my_barcodes.fasta --reads reads.fastq --mode end --trim-mode ends --outdir demux_results/taggen --demux --tags my_barcodes.fasta \
--reads reads.fastq \
--pod5 raw_signal.pod5 \
--mode end --trim-mode ends \
--outdir demux_results/--pod5 accepts one or more POD5 files (or pass the flag multiple times). Co-routing requires the pod5 Python CLI (pip install pod5); each per-sample folder will contain a matching <sample>.pod5 alongside <sample>.fastq.
./taggen- Configure barcode parameters:
- Set tag length (8--30 bp recommended)
- Set target barcode count (e.g., 96 for a standard plate)
- Set minimum distance (3--8 depending on error tolerance)
- Choose distance metric (Hamming for speed, Levenshtein for ONT accuracy)
- Set quality constraints:
- GC content bounds (default: 30--70%)
- Maximum homopolymer length (default: 3)
- Optionally load exclude sequences (FASTA format) to avoid similarity to adapters or primers
- Click "Generate Tags" to produce barcodes
- Review the generated tags in the preview table and the pairwise distance heatmap
- Export to FASTA or TSV format
- Load barcode FASTA file (generated above or user-supplied)
- Select FASTQ read file(s) for demultiplexing
- Choose search mode: end (standard libraries) or full (mid-read barcodes)
- Optionally configure:
- Position mask (to restrict barcode search to a region of the read)
- Maximum edit distance (auto-calculated by default)
- Trim mode (none / ends / all)
- Click "Run Demultiplexing"
- Review results: per-sample assignment counts, barcode match positions, edit distance distributions, Q-score profiles
- Demultiplexed reads are written to per-sample subdirectories; unassigned reads to a separate directory
USAGE:
taggen --cli [OPTIONS] Generate barcodes
taggen Launch GUI (if available)
| Flag | Long Form | Description | Default |
|---|---|---|---|
-l |
--length N |
Core barcode length in bp | 12 |
-p |
--prefix SEQ |
5' prefix sequence appended to each barcode | none |
-x |
--suffix SEQ |
3' suffix sequence appended to each barcode | none |
| Flag | Long Form | Description | Default |
|---|---|---|---|
-d |
--difference N |
Minimum pairwise distance between tags | 3 |
-i |
--identity N |
Maximum sequence identity (0--100%) | 75 |
-t |
--tolerance N |
Error tolerance buffer in bp | 2 |
-r |
--homopolymer N |
Maximum homopolymer run length | 3 |
--metric METRIC |
Distance metric: hamming or levenshtein |
hamming |
| Flag | Long Form | Description | Default |
|---|---|---|---|
--minGc N |
Minimum GC content (0--100%) | 30 | |
--maxGc N |
Maximum GC content (0--100%) | 70 |
| Flag | Long Form | Description | Default |
|---|---|---|---|
-n |
--count N |
Number of tags to generate | 96 |
-k |
--select N |
Select best N from a larger generated pool | all |
| Flag | Long Form | Description | Default |
|---|---|---|---|
-e |
--exclude SEQ |
Comma-separated sequences to avoid similarity with | none |
-f |
--excludeFile FILE |
FASTA or text file with sequences to avoid | none |
--excludeDist N |
Minimum distance from exclude sequences | 3 | |
--excludeMode MODE |
hard (reject), soft (rank), or combined |
hard |
Exclude sequences can be longer than the tag length. TagGen uses a sliding-window comparison to find the minimum distance at any alignment position, ensuring that generated tags do not match anywhere within longer reference sequences (e.g., adapter or genome sequences).
| Flag | Long Form | Description | Default |
|---|---|---|---|
-o |
--output NAME |
Output filename prefix | tags |
--outDir DIR |
Output directory | current | |
--tsv |
Export TSV format | yes | |
--fasta |
Export FASTA format | yes | |
-m |
--matrix |
Export pairwise distance matrix (CSV) | no |
--heatmap |
Print ASCII heatmap to terminal | no |
| Flag | Long Form | Description |
|---|---|---|
-c |
--config FILE |
Load parameters from JSON configuration file |
-v |
--verbose |
Enable verbose output with statistics |
-h |
--help |
Show help message |
-V |
--version |
Show version information |
USAGE:
taggen --demux --tags FILE --reads FILE [OPTIONS]
| Flag | Long Form | Description |
|---|---|---|
-t |
--tags FILE |
Tag FASTA file (barcode sequences with sample IDs) |
-r |
--reads FILE... |
Input FASTQ file(s); gzip-compressed files accepted |
| Flag | Long Form | Description | Default |
|---|---|---|---|
-p |
--pod5 FILE... |
POD5 signal file(s) to co-demultiplex by read ID | none |
| Flag | Long Form | Description | Default |
|---|---|---|---|
-d |
--max-dist N |
Maximum edit distance for barcode acceptance | auto |
--metric METRIC |
Distance metric: levenshtein or hamming |
levenshtein | |
-s |
--min-score F |
Minimum confidence score (0--1) | 0.0 |
-w |
--search-window N |
Bases to search at each read end | auto (tag_len + 15) |
-m |
--mode MODE |
Search mode: end or full |
end |
-q |
--min-qscore Q |
Skip reads with mean Q-score below Q | 0 (off) |
--position-mask SPEC |
Include zone for full-mode search (repeatable) | none | |
--trim-mode MODE |
Tag trimming: none, ends, or all |
all |
Automatic max-dist: When --max-dist is not specified, TagGen automatically sets the maximum edit distance based on tag length:
- Tags shorter than 20 bp: tag_length / 5
- Tags 20--29 bp: tag_length / 4
- Tags 30 bp or longer: tag_length / 3
Position mask formats (full mode only):
5p:N-- Search the first N bp from the 5' end of the read3p:N-- Search the last N bp from the 3' end of the readSTART:END-- Absolute base positions (e.g.,0:60)F_START:F_END-- Read-length fractions (e.g.,0.05:0.20for 5--20% of read length)
Multiple masks can be specified (comma-separated or repeated flags); their zones are combined as a union.
| Flag | Long Form | Description | Default |
|---|---|---|---|
-o |
--outdir DIR |
Output directory | demux_out |
--reads-per-file N |
Split FASTQ output into N-read files | 0 (no split) | |
--pod5-reads-per-file N |
Split POD5 output into N-read files | 0 (no split) | |
--pod5-parallel N |
Concurrent pod5 subset processes during routing |
0 (auto: min(N_samples, max(1, min(nproc/2, 8)))) |
|
-z |
--compress |
Gzip-compress output FASTQ files | off |
--unassigned |
Write unassigned reads to separate directory | on | |
--stats |
Write per-sample statistics TSV | on | |
--batch-size N |
Reads per processing batch | 10000 |
demux_out/
Tag_001/
Tag_001.fastq # Reads assigned to Tag_001
Tag_001.pod5 # (only if --pod5 was given) co-routed signal
Tag_002/
Tag_002.fastq
Tag_002.pod5
...
unassigned/
unassigned.fastq # Reads that failed matching criteria
demux_stats.tsv # Per-sample assignment summary
FASTQ headers of assigned reads are annotated with the matched barcode name, edit distance, and confidence score.
When --pod5-reads-per-file N is set, the POD5 outputs are split into chunks named Tag_001_001.pod5, Tag_001_002.pod5, ... in the same per-sample folder.
Save and reload generation parameters via JSON for reproducibility:
{
"coreLength": 20,
"numTags": 96,
"minDifference": 8,
"maxHomopolymer": 3,
"minGC": 0.4,
"maxGC": 0.6,
"distanceMetric": "levenshtein",
"outputPrefix": "experiment_tags",
"exportFasta": true,
"exportTsv": true,
"exportMatrix": false,
"excludeSequences": [
"AGATCGGAAGAGCACACGTCT",
"AGATCGGAAGAGCGTCGTGTA"
],
"excludeMinDistance": 4
}Usage:
taggen --cli -c experiment_config.json -vStandard FASTA format with sequentially numbered tag IDs:
>Tag_001
ACGTACGTACGTAC
>Tag_002
TGCATGCATGCATG
>Tag_003
GATCGATCGATCGA
Tab-separated file with header, suitable for spreadsheet analysis:
# Tag Sequence Core Only Length GC %
1 ACGTACGTACGTAC ACGTACGTACGTAC 14 50.000000
2 TGCATGCATGCATG TGCATGCATGCATG 14 50.000000
Pairwise distance matrix in CSV format for downstream analysis or clustering:
,Tag_001,Tag_002,Tag_003
Tag_001,0,8,6
Tag_002,8,0,7
Tag_003,6,7,0
Per-sample summary with assigned read counts, expected counts, and mean Q-scores.
Scenario: Multiplex 96 samples on a standard ONT flow cell.
# Generate 96 barcodes of 14 bp, min Hamming distance 4
taggen --cli -n 96 -l 14 -d 4 --minGc 40 --maxGc 60 -r 3 -o ont_barcodes
# Demultiplex reads (standard dual-ended library)
taggen --demux --tags ont_barcodes.fasta --reads reads.fastq \
--mode end --trim-mode endsExpected performance: >97% correct barcode assignment at typical ONT error rates (10--15%).
Scenario: Direct RNA sequencing of degraded clinical samples. 48-sample experiment with a custom ligation adapter.
# Generate 48 barcodes with stringent diversity, excluding adapter sequences
taggen --cli -n 48 -l 24 -d 10 --metric levenshtein \
--minGc 45 --maxGc 55 -r 2 \
-f custom_adapter.fasta -o uc1_tags -v
# Demultiplex (end mode for standard 5'/3' tagging)
taggen --demux --tags uc1_tags.fasta --reads reads.fastq \
--mode end --trim-mode allExpected performance: >98% correct assignment even at 20% error rate.
Scenario: Older R9.4.1 flow cells producing reads at 85--90% identity. Barcode is at the 5' end but may be shifted a few bases due to adapter degradation.
# Generate 96 barcodes of 30 bp, Levenshtein distance 8
taggen --cli -n 96 -l 30 -d 8 --metric levenshtein -o uc2_tags -v
# Demultiplex with full-mode search + 5' position mask (first 60 bp)
taggen --demux --tags uc2_tags.fasta --reads reads.fastq \
--mode full --position-mask 5p:60 --trim-mode allExpected performance: 97% accuracy at 95% identity, 84% at 85% identity, with near-zero misassignment.
Scenario: Custom spatial transcriptomics array with 384 capture spots. Barcode is embedded in a synthetic RNA spike-in at 5--20% of read length. Zero cross-sample contamination required.
# Generate 384 barcodes with pairwise distance heatmap
taggen --cli -n 384 -l 30 -d 8 --metric levenshtein --heatmap -o uc3_tags -v
# Demultiplex with fractional position mask and strict threshold
taggen --demux --tags uc3_tags.fasta --reads reads.fastq \
--mode full --position-mask 0.05:0.20 \
--max-dist 4 --trim-mode allExpected performance: Misassignment reduced to <0.001% at 90% read identity; approximately 10--15% of reads left unassigned (acceptable when sample integrity is paramount).
TagGen implements a two-phase algorithm:
Rather than enumerating all 4^n possible sequences (which causes memory exhaustion at n >= 14), TagGen generates random candidate sequences in parallel across all CPU cores. Each candidate is validated in real-time against user-specified constraints:
- GC content within bounds
- No homopolymer runs exceeding the limit
- Not matching any exclude sequence within the minimum distance
Valid candidates are collected into a thread-safe pool (typically ~100,000 candidates in ~55 ms). The Mersenne Twister PRNG is used for sequence generation.
From the candidate pool, TagGen iteratively selects the sequence that maximises the minimum Hamming or Levenshtein distance to all previously selected barcodes. This greedy approach produces a locally optimal but not globally optimal set. For practical applications (96--384 barcodes), the approximation is highly effective, typically achieving minimum inter-tag distances of 12--15 bp for d=8 nominal targets.
Selection completes in ~25 ms for 96 barcodes from 100,000 candidates.
The demultiplexer uses a two-stage pipeline:
-
K-mer voting (Stage 1): All 8-mers in the search region are looked up against a pre-built index mapping each 8-mer to the set of barcodes containing it. Candidate barcodes are ranked by hit count, and top candidates' hit positions are averaged to estimate barcode location.
-
Banded edit-distance alignment (Stage 2): A sliding window of (tag_length + 15) bp centred on the estimated location is compared against each candidate barcode via banded Levenshtein or Hamming distance. The best match is accepted if:
- The edit distance does not exceed the maximum threshold
- The margin between best and second-best candidate is at least 2 (ambiguity guard)
- The confidence score exceeds the minimum threshold
Reads failing any criterion are written to the unassigned output, annotated with the rejection reason (no_match, ambiguous, or low_confidence).
| Configuration | TagGen | DNABarcodes | Speedup |
|---|---|---|---|
| 8 bp, d>=3 | 24 ms | 120 ms | 5x |
| 10 bp, d>=3 | 31 ms | 24 s | 770x |
| 12 bp, d>=4 | 34 ms | 463 s | 13,600x |
| 14 bp, d>=4 | 35 ms | Memory exhaustion | -- |
| 16 bp, d>=4 | 35 ms | Memory exhaustion | -- |
| 20 bp, d>=6 | 44 ms | Memory exhaustion | -- |
| 30 bp, d>=8 | 29 ms | Memory exhaustion | -- |
All benchmarks: 96 target barcodes, Hamming distance, GC 25--75%. Wall-clock times averaged over 3 replicates. Linux, 12-core AMD processor.
| Barcode Length | Min. Dist. | 5% Error | 10% Error | 15% Error | 20% Error | 25% Error |
|---|---|---|---|---|---|---|
| 10 bp | d>=3 | 99.8% | 96.2% | 91.3% | 83.3% | 74.7% |
| 12 bp | d>=4 | 99.7% | 99.0% | 96.1% | 88.6% | 82.0% |
| 14 bp | d>=4 | 100% | 99.5% | 97.9% | 94.1% | 86.1% |
| 16 bp | d>=5 | 99.9% | 100% | 99.1% | 95.8% | 92.2% |
| 20 bp | d>=6 | 100% | 99.9% | 99.6% | 99.6% | 96.4% |
| 30 bp | d>=8 | 100% | 100% | 100% | 100% | 99.6% |
Values indicate the percentage of reads correctly assigned to the original barcode. Error model: 50% deletions, 25% insertions, 25% substitutions. N = 1,000 simulated reads per barcode per condition.
| Application | Length | Count | Distance | Metric | Notes |
|---|---|---|---|---|---|
| Standard ONT (R10.4/Kit14) | 14 bp | 96 | d=4 | Hamming | >97% accuracy at 95% identity |
| High-throughput ONT | 20 bp | 96--384 | d=6--8 | Levenshtein | Recommended for most new experiments |
| Direct RNA sequencing | 24--30 bp | 48--96 | d=8--10 | Levenshtein | High error tolerance |
| Spatial transcriptomics | 30 bp | 384 | d=8 | Levenshtein | Use fractional position mask + --max-dist 4 |
| Clinical (zero misassignment) | 30 bp | 48--96 | d=8 | Levenshtein | Use --max-dist 4 for <0.001% misassignment |
- Levenshtein distance is recommended over Hamming for all ONT workflows, as it accounts for the indel-dominated error profile and achieves 3--8 percentage points higher accuracy at 85--90% read identity.
- Longer barcodes (20--30 bp) provide substantially better error resilience. At 20% error, 30 bp barcodes maintain 100% correct assignment while 10 bp barcodes degrade to 83%.
- Position masks are important in full-mode search to avoid spurious off-target matches. Without a mask, accuracy drops by 6--8 percentage points.
--max-distcontrols the trade-off between assignment sensitivity and sample purity. The default adaptive threshold keeps misassignment below 0.5%. For clinical applications, use--max-dist 4with 30 bp tags to reduce misassignment to <0.001% at the cost of ~10--15% more unassigned reads.
Full changelog in CHANGELOG.md. Recent highlights:
- POD5 routing parallelised across K concurrent
pod5 subsetprocesses (--pod5-parallel N, GUI SpinButton). Bin-packed buckets share the OS-page-cached input. Measured 7.1× routing speedup vs 1.2.4 on a 14-sample / 92 GB nanopore run (kaya HPC, K=8).
- POD5 routing now issues one combined
pod5 subsetcall instead of one-per-sample (~1.3× routing speedup from per-call setup amortisation). - macOS
.appbundle: native Mach-O launcher replaces shell-scriptContents/MacOS/TagGento satisfy Tahoe Gatekeeper (fixes_LSOpenURLsWithCompletionHandler() error -10669). - TagGen confirmed LDC 1.42+ compatible on Linux and macOS.
- FASTQ parser fix: read IDs now split on any whitespace (TAB or space). Dorado emits SAM-style auxiliary tags TAB-separated; v1.2.1/1.2.2 captured the entire TSV header line as the read id, breaking POD5 routing (
pod5 subsetreportedFound 0 read_ids). Closed the cluster-side regression that affected all dorado-basecalled inputs.
- POD5 read-id normalisation (lowercase + strip
read_prefix) as defensive belt-and-braces for non-dorado pipelines that emit uppercase or prefixed UUIDs. - Failed-routing CSV preservation (
TAGGEN_DEBUG_POD5=1) for post-mortem inspection.
- Fixed POD5 co-demultiplexing (
--pod5): switched from the unsupported--read-id-fileflag topod5 subset --csvdirect mapping with positional inputs - All input POD5 files now passed in a single
pod5 subsetcall per sample, with--missing-okso partial coverage no longer aborts routing
- Position masks refactored from exclude to include semantics
- GUI: 3-mode anchor selector per zone (From 3' end / From 5' end / Custom range)
- CLI:
--position-maskaccepts3p:Nand5p:Nshorthand - Comprehensive demux test framework with parameter sweeps
- Positional mask for full-mode demux: constrain tag search to specific read regions
- Post-deconvolution statistics window with four visualisation tabs
- Integrated demultiplexer (GUI tab + CLI
--demuxmode) - POD5 signal file co-demultiplexing
- Tag trimming modes: none, ends, all
- Confidence scoring with ambiguity detection
- Levenshtein edit distance for tag selection (
--metric levenshtein) - Greedy selection performance fix: O(n*k) cached min-distances
- Interactive heatmap visualisation of pairwise tag distances
- Exclude sequences feature with sliding-window comparison
- Increased maximum tag limit to 1,000,000
- Initial release: GUI, parallel generation, greedy selection, TSV/FASTA export
TagGen is released under the MIT License.
MIT License
Copyright (c) 2026 Biocodecs, Arnaroo Ribologicals, RMODEL
- Faiza Chowdhury
- Tessa Swain
- Roderik Shirokikh
- Danielle L. Rudler
- Archa H. Fox
- Alice Cleynen
- Nikolay E. Shirokikh
School of Human Sciences / School of Molecular Sciences, The University of Western Australia, Perth, WA, Australia
France-Australia Mathematical Sciences and Interactions, CNRS International Research Laboratory, Canberra, ACT, Australia
Contact: nikolay.shirokikh@uwa.edu.au, alice.cleynen@cnrs.fr
Repository: https://github.com/Arnaroo/taggen
