Fast, modular FASTA toolkit built in Zig.
SIMD-accelerated indexing, O(1) region extraction, and instant assembly stats.
samtools-compatible FASTA indexing and extraction, benchmarked against seqkit, fastahack, and pyfaidx.
Current release: v0.2.8
Quick links: Supported Today · Installation · Usage · Performance & Correctness · Benchmarking · Roadmap
z-fasta is focused on uncompressed FASTA workflows: building indexes, extracting one or many indexed regions, and computing assembly/proteome statistics. It supports compact .zfi indexes and samtools-compatible .fai output. get accepts positional regions, BED files, BED from stdin, names files, strand-aware extraction, and explicit orientation transforms through --rc, --reverse-only, --complement-only, and --annotate-rc. FASTQ and compressed FASTA/BGZF streams remain outside the current scope.
Modern bioinformatics workflows are often bottlenecked by legacy text parsers. z-fasta keeps the hot paths close to the data: memory-mapped FASTA input for the default indexer, explicit SIMD header scanning, compact binary indexes, and startup-conscious CLI dispatch for tiny commands.
- samtools-compatible output: Both
z-fasta index --emit-faiandz-fasta getproduce output byte-identical tosamtools faidxfor the verified cases. Lookup falls back from.zfito.faiwith mtime + file-size staleness validation. - Single binary: No dependencies, no
condaenvironments, noglibcversion errors. - Arena-scoped allocations: Uses Zig's
ArenaAllocatorfor short-lived command state, keeping heap overhead low and cleanup simple.
# Download Zig 0.16.0 if you are not using the vendored toolchain
curl -L https://ziglang.org/download/0.16.0/zig-linux-x86_64-0.16.0.tar.xz | tar xJ
# Build with the repo-local Zig wrapper (uses ./zig-0.16.0/zig)
./zig build -Doptimize=ReleaseFast
# The executable is now at ./zig-out/bin/z-fasta
./zig-out/bin/z-fasta --helpz-fasta index [options] <file.fasta>
Options:
--emit-fai Output FAI format to stdout (default: create .zfi binary file)
--no-dedup Disable duplicate name filtering (maximizes speed)
--low-mem Use chunked reader instead of mmap (limits RAM to 4 MB)
--help Show help message
--version Print versionz-fasta get <file.fasta> [--bed file.bed|-] [--names file.txt]
[--strand-aware] [--summary]
[--rc|--complement-only|--reverse-only] [--annotate-rc]
[--chunk-size N|-1] <region> [region ...]Extract one or more sequences or sub-regions from an indexed FASTA file. Output is byte-identical to samtools faidx for the positional-region path, and the BED / names batch flows are verified against bedtools getfasta and samtools faidx -r. Multiple regions are accepted in a single call; the index loads once and results stream in CLI order. BED rows and names-file entries are appended in source order ahead of later positional arguments.
Requires an index: either .zfi (preferred) or .fai. If .zfi is not found, falls back to .fai automatically.
Region formats:
| Format | Description |
|---|---|
NAME |
Full sequence |
NAME:START-END |
1-based, inclusive sub-region |
NAME:START- |
From START to end of sequence |
Handles Ensembl-style names containing colons (e.g., chromosome:GRCh38:1:1:248956422:1).
Additional GET flags:
| Flag | Description |
|---|---|
--bed file.bed |
Read BED regions from a file. BED coordinates are 0-based, half-open; z-fasta converts them to 1-based inclusive internally. |
--bed - |
Read BED regions from stdin. |
--names file.txt |
Read one full-sequence name per line. Useful for long batch lists. |
--strand-aware |
Use BED column 6. - applies reverse-complement orientation before any global orientation flag. Alias: --honor-strand. |
--rc |
Reverse-complement the extracted sequence. Verified against samtools faidx -i --mark-strand no. |
--complement-only |
Complement the extracted sequence without reversing it. Mutually exclusive with --rc and --reverse-only. |
--reverse-only |
Reverse the extracted sequence without complementing it. Mutually exclusive with --rc and --complement-only. |
--annotate-rc |
Append a human-readable transform suffix to headers, for example (reverse complement). Default headers stay samtools-style and unannotated. |
--summary |
Print region count, total bases, elapsed time, and regions/sec to stderr. |
--chunk-size N |
Process BED rows in batches instead of resolving the entire BED in one batch. Default: 4096, which is the current best speed/memory tradeoff on the checked benchmark workloads. |
--chunk-size -1 |
Process all BED rows in a single batch when memory use is acceptable. |
Complement-based transforms are rejected for protein FASTA input with a clear error. This keeps --rc and --complement-only biologically constrained to nucleotide-like records.
z-fasta stats [options] <file.fasta>
Options:
--index-only Compute stats from index only (no FASTA scan; startup-dominated)Compute assembly/proteome statistics. Automatically detects nucleotide vs. protein sequences.
Tier 1 (index-only): sequence count, total bases, min/max/mean/median lengths, N50, L50, N90, L90, AU, duplicate count.
Tier 2 (default): full composition scan: nucleotide frequencies, GC content (N excluded), GC skew, soft-masked fraction. For proteins: top 3 amino acids with full names.
# Create .zfi binary index (default, compact binary format)
z-fasta index genome.fa
# Output .fai to stdout (samtools-compatible)
z-fasta index --emit-fai genome.fa > genome.fai
# Extract a full sequence
z-fasta get genome.fa chr1
# Extract a sub-region (1-based, inclusive)
z-fasta get genome.fa chr1:1000000-2000000
# Extract multiple regions in one call (index loads once)
z-fasta get genome.fa chr1:1000-2000 chr2:5000-6000 chrX:100-200
# Reverse-complement a region
z-fasta get genome.fa chr1:1000-2000 --rc
# Reverse without complementing
z-fasta get genome.fa chr1:1000-2000 --reverse-only
# Complement without reversing
z-fasta get genome.fa chr1:1000-2000 --complement-only
# Add explicit transform text to the FASTA header
z-fasta get genome.fa chr1:1000-2000 --rc --annotate-rc
# Extract regions from BED
z-fasta get genome.fa --bed regions.bed
# Read BED from stdin
awk '$5 > 100' raw.bed | z-fasta get genome.fa --bed -
# Extract whole sequences from a names file
z-fasta get genome.fa --names ids.txt
# Respect BED strand and print a stderr summary
z-fasta get genome.fa --bed regions.bed --strand-aware --summary
# Compose BED strand handling with a global reverse-complement flip
z-fasta get genome.fa --bed regions.bed --honor-strand --rc
# Assembly stats (full composition scan)
z-fasta stats genome.fa
# Quick stats from index only (does not scan FASTA sequence bytes)
z-fasta stats --index-only genome.faAll timings on AMD Ryzen 9 3950X, warm cache.
| Dataset | Size | z-fasta (no-dedup) | samtools | fastahack | pyfaidx | Speedup vs samtools |
|---|---|---|---|---|---|---|
| Human Genome | 3.0 GB | 0.39s | 9.03s | 21.73s | 27.48s | 22.9× |
| Transcriptome | 972 MB | 0.093s | 1.79s | 5.72s | 6.50s | 19.3× |
| Proteome | 66 MB | 0.0056s | 0.055s | 0.275s | 0.368s | 10.0× |
| Mode | Genome timing | Memory behavior |
|---|---|---|
--no-dedup |
0.39s | Fastest on repeated-name-free inputs. mmap-backed; MaxRSS reflects mapped pages. |
default |
0.40s | Deduplicates names while staying in the same mmap-backed performance class. |
--low-mem |
2.46s | Streaming path; measured at 4.5 MB MaxRSS on the genome benchmark. |
mmapmodes show RSS close to the mapped FASTA size because/usr/bin/time -vcounts mapped pages, not just private heap. See bench/index/REPORT.md for full scaling curves and memory analysis.
| Dataset | Region | z-fasta | samtools | seqtk | pyfaidx | Speedup vs samtools |
|---|---|---|---|---|---|---|
| Any (warm cache) | 100 bp – 10 kbp | 0.7–0.9 ms | 1.5–1.6 ms | 4–34 ms | ~60 ms | 1.8–2.1× |
| Proteome (14 MB) | 1 kbp region | 1.3 ms | 10.9 ms | 7.2 ms | 119 ms | 8.4× |
| Transcriptome (972 MB) | 1 kbp region | 25.3 ms | 278.7 ms | 220.3 ms | 1103 ms | 11.0× |
Small-region extraction is O(1), but on this host the end-to-end CLI path is startup-dominated below roughly 10 kbp. The historical checked-in benchmark report for v0.2.6 was generated under a faster local benchmark environment than the current reruns; direct side-by-side rebuilds of v0.2.6, v0.2.7, and current
mainon the same machine do not reproduce a material no-flaggetregression. For very large full-sequence extraction, fastahack can still win on raw write-path overhead; z-fasta stays ahead of samtools across the real-dataset GET cases.
Orientation note: the shipped --rc path keeps the same mmap-backed extraction model and applies reverse traversal plus complement lookup during emission, rather than materializing a second copy of the region. The main GET benchmark report now includes dedicated RC timing and RSS sections against samtools faidx -i and bedtools getfasta | seqtk seq -r, and the implementation choice is summarized in bench/get/RC_STRATEGY.md.
Multi-region (v0.2.4): z-fasta get accepts multiple regions per call, loading the index once and streaming all results in CLI order.
| Regions | z-fasta | samtools | seqtk | Speedup vs samtools |
|---|---|---|---|---|
| 1 | 25.6 ms | 289 ms | 221 ms | 11.3× |
| 10 | 33.8 ms | 283 ms | 226 ms | 8.4× |
| 50 | 66.7 ms | 292 ms | 225 ms | 4.4× |
| 100 | 66.7 ms | 279 ms | 222 ms | 4.2× |
Benchmarked on REAL_Transcriptome.fa (972 MB, 254,070 sequences). Latency is dominated by index resolution and output setup rather than region byte count. seqtk performs a full-file scan per call regardless of region count and is listed for reference only.
Run .venv/bin/python bench/get/generate_report.py to regenerate the full GET report under bench/get/REPORT.md, including RC positional/BED comparisons and the RC memory snapshot.
| Mode | Dataset | z-fasta | seqkit -a | seqtk comp | Speedup vs seqkit -a |
|---|---|---|---|---|---|
| Index-only | Genome (3.0 GB) | 0.9 ms | 17.45 s | N/A | ~19,000× |
| Index-only | Proteome (14 MB) | 2.9 ms | 57.8 ms | N/A | ~20× |
| Full scan | 1 GB single-seq file | 0.78 s | 5.62 s | 2.65 s | ~7× |
| Full scan | Proteome (14 MB) | 11.8 ms | 57.8 ms | 93.0 ms | ~4.9× |
Index-only time is effectively constant with file size and is best described as startup-dominated. It reads
.zfiindex data and computes length-derived metrics without scanning FASTA sequence bytes. Full-scan throughput on synthetic files is ~1.3 GB/s, and the latest benchmark report has z-fasta ahead of seqkit on the real genome/proteome/transcriptome stats cases while still computing richer statistics. See bench/stats/REPORT.md for full results.
- Index: 20/20 edge cases match
samtools faidx(exit codes and output). - Get: 90/90 single-region and 22/22 multi-region byte-identical diff tests pass vs samtools across 5+ test files: full sequences, sub-regions, single bases, line-boundary spans, clamped ranges, duplicate regions, reversed CLI order, sort-path (≥16 regions).
- BED / names batch extraction: 16/16 verification cases pass in
bench/get/verify_bed.sh, covering default BED,--bed -, stranded BED vsbedtools getfasta -s, default BED vssamtools faidx -r, and--namesbatch extraction. - Reverse / complement extraction: 19/19 verification cases pass in
bench/get/verify_rc.sh, covering--rcvssamtools faidx -i --mark-strand no, exact-output checks for--reverse-only,--complement-only, and--annotate-rc, multi-region concatenation, BED--honor-strand --rccomposition, protein rejection, and a synthetic chromosome-like full-sequence case. - Stats: 107/107 BioPython verification tests pass: exact agreement on all Tier 1 and Tier 2 values across nucleotide and protein files.
- Unit tests: 102/102 Zig unit tests (26 index · 30 get · 33 stats · 7 complement · 6 BED parser).
- Messy FASTA: z-fasta is the only tool tested that correctly indexes mixed-width and trailing-whitespace FASTA files. samtools, fastahack, and pyfaidx all reject them. See bench/index/REPORT.md for the full compatibility matrix.
# Download real test data (~4 GB, one-time)
bash bench/shared/download_data.sh
# ── Index ─────────────────────────────────────────────────────────
./zig build -Doptimize=ReleaseFast
bash bench/index/run_benchmarks.sh # timing + memory
bash bench/index/run_tests.sh # 20 edge-case correctness tests
.venv/bin/python bench/index/generate_report.py # → bench/index/REPORT.md
# ── Get ───────────────────────────────────────────────────────────
bash bench/get/run_benchmarks.sh # latency, scaling, real datasets
bash bench/get/verify_get.sh # 90 byte-identical diff tests vs samtools
bash bench/get/verify_bed.sh # 16 BED / names verification cases vs bedtools + samtools
bash bench/get/verify_rc.sh # 19 RC / reverse / complement verification cases
.venv/bin/python bench/get/generate_report.py # → bench/get/REPORT.md
# ── Stats ─────────────────────────────────────────────────────────
bash bench/stats/run_benchmarks.sh # full/index-only, scaling, throughput
.venv/bin/python bench/stats/verify_stats.py # 107 BioPython verification tests
.venv/bin/python bench/stats/generate_report.py # → bench/stats/REPORT.mdFull local refresh, in the same order used before publishing benchmark updates:
./zig build -Doptimize=ReleaseFast && bash bench/index/run_tests.sh && bash bench/get/verify_get.sh && bash bench/get/verify_multi_get.sh && .venv/bin/python bench/stats/verify_stats.py && bash bench/index/run_benchmarks.sh --runs 5 && .venv/bin/python bench/index/generate_report.py && bash bench/get/run_benchmarks.sh --runs 5 && .venv/bin/python bench/get/generate_report.py && bash bench/stats/run_benchmarks.sh --runs 5 && .venv/bin/python bench/stats/generate_report.py && bash bench/perf-recovery/run_startup.shAdd --skip-real to the get / stats scripts to skip real dataset runs (~3 GB downloads required otherwise). See bench/README.md for prerequisites and full instructions. The shipped reverse-path note is in bench/get/RC_STRATEGY.md.
| Format | Flag | Description |
|---|---|---|
.zfi |
(default) | Compact binary index. Fast to read/write programmatically. |
.fai |
--emit-fai |
Tab-separated text, identical to samtools faidx output. |
# Build (debug)
./zig build
# Run all tests (index + get + stats)
./zig build test --summary all
# Build optimized binary
./zig build -Doptimize=ReleaseFastDelivered
-
z-fasta index: SIMD-accelerated FASTA indexing (v0.1) -
z-fasta get: O(1) byte-offset sequence extraction (v0.2) -
z-fasta stats: Assembly/proteome statistics with index-only mode (v0.2) - Unified benchmark suite with per-module reports and figures (v0.2.2)
- Expanded tool comparison: pyfaidx, seqtk added across all benchmark modules; messy FASTA compatibility matrix (v0.2.3)
- Multi-region
get: single call with N regions, index loads once, results stream in CLI order; ~2× faster than samtools across 1–100 regions (v0.2.4) - Zig 0.16.0 migration plus benchmark/report refresh for v0.2.5
- v0.2.6 performance recovery: lower startup overhead, faster index loading, buffered GET emission, fixed-width stats/index fast paths, and refreshed benchmark reports
- v0.2.7 BED batch extraction:
--bed,--bed -,--names,--strand-aware, bounded chunked processing, and verification/benchmark coverage - v0.2.8 reverse/complement extraction:
--rc,--reverse-only,--complement-only,--annotate-rc, RC verification, and integrated RC benchmark/report coverage
Near-term
- v0.3.0: Validate + Tier 2 benchmarks + release polish
-
z-fasta validate: single-pass FASTA format checker with line-numbered error/warning output - Checks: duplicate names, inconsistent line widths, invalid characters, empty sequences, missing terminal newline
-
--strictflag treats warnings as errors - Tier 2 benchmark suite: noodles, rust-bio, Fusta, htslib, bedtools comparisons
- Fix GET on messy FASTA (mixed-width and trailing-whitespace files indexed but not retrievable)
-
Long-term / Exploratory
-
z-fasta digest: In-silico trypsin digestion for mass spectrometry (v0.4+) - Parallel mmap scanning for multi-threaded indexing on NVMe arrays
- Native BGZF / gzip streaming read support
MIT. See LICENSE
Aligned life in bytes,
FASTA sings through mirrored streams.
Humans bloom as code.