Skip to content

eneskemalergin/z-fasta

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

z-fasta ⚡

Fast, modular FASTA toolkit built in Zig.
SIMD-accelerated indexing, O(1) region extraction, and instant assembly stats.
samtools-compatible FASTA indexing and extraction, benchmarked against seqkit, fastahack, and pyfaidx.

Current release: v0.2.8


CI Zig 0.16.0 License: MIT 22x faster than samtools for indexing

Quick links: Supported Today · Installation · Usage · Performance & Correctness · Benchmarking · Roadmap

Supported Today

z-fasta is focused on uncompressed FASTA workflows: building indexes, extracting one or many indexed regions, and computing assembly/proteome statistics. It supports compact .zfi indexes and samtools-compatible .fai output. get accepts positional regions, BED files, BED from stdin, names files, strand-aware extraction, and explicit orientation transforms through --rc, --reverse-only, --complement-only, and --annotate-rc. FASTQ and compressed FASTA/BGZF streams remain outside the current scope.

Why z-fasta?

Modern bioinformatics workflows are often bottlenecked by legacy text parsers. z-fasta keeps the hot paths close to the data: memory-mapped FASTA input for the default indexer, explicit SIMD header scanning, compact binary indexes, and startup-conscious CLI dispatch for tiny commands.

  • samtools-compatible output: Both z-fasta index --emit-fai and z-fasta get produce output byte-identical to samtools faidx for the verified cases. Lookup falls back from .zfi to .fai with mtime + file-size staleness validation.
  • Single binary: No dependencies, no conda environments, no glibc version errors.
  • Arena-scoped allocations: Uses Zig's ArenaAllocator for short-lived command state, keeping heap overhead low and cleanup simple.

Installation

# Download Zig 0.16.0 if you are not using the vendored toolchain
curl -L https://ziglang.org/download/0.16.0/zig-linux-x86_64-0.16.0.tar.xz | tar xJ

# Build with the repo-local Zig wrapper (uses ./zig-0.16.0/zig)
./zig build -Doptimize=ReleaseFast

# The executable is now at ./zig-out/bin/z-fasta
./zig-out/bin/z-fasta --help

Usage

Index

z-fasta index [options] <file.fasta>

Options:
  --emit-fai    Output FAI format to stdout (default: create .zfi binary file)
  --no-dedup    Disable duplicate name filtering (maximizes speed)
  --low-mem     Use chunked reader instead of mmap (limits RAM to 4 MB)
  --help        Show help message
  --version     Print version

Get (sequence extraction)

z-fasta get <file.fasta> [--bed file.bed|-] [--names file.txt]
            [--strand-aware] [--summary]
            [--rc|--complement-only|--reverse-only] [--annotate-rc]
            [--chunk-size N|-1] <region> [region ...]

Extract one or more sequences or sub-regions from an indexed FASTA file. Output is byte-identical to samtools faidx for the positional-region path, and the BED / names batch flows are verified against bedtools getfasta and samtools faidx -r. Multiple regions are accepted in a single call; the index loads once and results stream in CLI order. BED rows and names-file entries are appended in source order ahead of later positional arguments.

Requires an index: either .zfi (preferred) or .fai. If .zfi is not found, falls back to .fai automatically.

Region formats:

Format Description
NAME Full sequence
NAME:START-END 1-based, inclusive sub-region
NAME:START- From START to end of sequence

Handles Ensembl-style names containing colons (e.g., chromosome:GRCh38:1:1:248956422:1).

Additional GET flags:

Flag Description
--bed file.bed Read BED regions from a file. BED coordinates are 0-based, half-open; z-fasta converts them to 1-based inclusive internally.
--bed - Read BED regions from stdin.
--names file.txt Read one full-sequence name per line. Useful for long batch lists.
--strand-aware Use BED column 6. - applies reverse-complement orientation before any global orientation flag. Alias: --honor-strand.
--rc Reverse-complement the extracted sequence. Verified against samtools faidx -i --mark-strand no.
--complement-only Complement the extracted sequence without reversing it. Mutually exclusive with --rc and --reverse-only.
--reverse-only Reverse the extracted sequence without complementing it. Mutually exclusive with --rc and --complement-only.
--annotate-rc Append a human-readable transform suffix to headers, for example (reverse complement). Default headers stay samtools-style and unannotated.
--summary Print region count, total bases, elapsed time, and regions/sec to stderr.
--chunk-size N Process BED rows in batches instead of resolving the entire BED in one batch. Default: 4096, which is the current best speed/memory tradeoff on the checked benchmark workloads.
--chunk-size -1 Process all BED rows in a single batch when memory use is acceptable.

Complement-based transforms are rejected for protein FASTA input with a clear error. This keeps --rc and --complement-only biologically constrained to nucleotide-like records.

Stats

z-fasta stats [options] <file.fasta>

Options:
  --index-only  Compute stats from index only (no FASTA scan; startup-dominated)

Compute assembly/proteome statistics. Automatically detects nucleotide vs. protein sequences.

Tier 1 (index-only): sequence count, total bases, min/max/mean/median lengths, N50, L50, N90, L90, AU, duplicate count.

Tier 2 (default): full composition scan: nucleotide frequencies, GC content (N excluded), GC skew, soft-masked fraction. For proteins: top 3 amino acids with full names.

Examples

# Create .zfi binary index (default, compact binary format)
z-fasta index genome.fa

# Output .fai to stdout (samtools-compatible)
z-fasta index --emit-fai genome.fa > genome.fai

# Extract a full sequence
z-fasta get genome.fa chr1

# Extract a sub-region (1-based, inclusive)
z-fasta get genome.fa chr1:1000000-2000000

# Extract multiple regions in one call (index loads once)
z-fasta get genome.fa chr1:1000-2000 chr2:5000-6000 chrX:100-200

# Reverse-complement a region
z-fasta get genome.fa chr1:1000-2000 --rc

# Reverse without complementing
z-fasta get genome.fa chr1:1000-2000 --reverse-only

# Complement without reversing
z-fasta get genome.fa chr1:1000-2000 --complement-only

# Add explicit transform text to the FASTA header
z-fasta get genome.fa chr1:1000-2000 --rc --annotate-rc

# Extract regions from BED
z-fasta get genome.fa --bed regions.bed

# Read BED from stdin
awk '$5 > 100' raw.bed | z-fasta get genome.fa --bed -

# Extract whole sequences from a names file
z-fasta get genome.fa --names ids.txt

# Respect BED strand and print a stderr summary
z-fasta get genome.fa --bed regions.bed --strand-aware --summary

# Compose BED strand handling with a global reverse-complement flip
z-fasta get genome.fa --bed regions.bed --honor-strand --rc

# Assembly stats (full composition scan)
z-fasta stats genome.fa

# Quick stats from index only (does not scan FASTA sequence bytes)
z-fasta stats --index-only genome.fa

Performance & Correctness

All timings on AMD Ryzen 9 3950X, warm cache.

Index: SIMD-Accelerated Indexing

Dataset Size z-fasta (no-dedup) samtools fastahack pyfaidx Speedup vs samtools
Human Genome 3.0 GB 0.39s 9.03s 21.73s 27.48s 22.9×
Transcriptome 972 MB 0.093s 1.79s 5.72s 6.50s 19.3×
Proteome 66 MB 0.0056s 0.055s 0.275s 0.368s 10.0×
Mode Genome timing Memory behavior
--no-dedup 0.39s Fastest on repeated-name-free inputs. mmap-backed; MaxRSS reflects mapped pages.
default 0.40s Deduplicates names while staying in the same mmap-backed performance class.
--low-mem 2.46s Streaming path; measured at 4.5 MB MaxRSS on the genome benchmark.

mmap modes show RSS close to the mapped FASTA size because /usr/bin/time -v counts mapped pages, not just private heap. See bench/index/REPORT.md for full scaling curves and memory analysis.

Get: O(1) Region Extraction

Dataset Region z-fasta samtools seqtk pyfaidx Speedup vs samtools
Any (warm cache) 100 bp – 10 kbp 0.7–0.9 ms 1.5–1.6 ms 4–34 ms ~60 ms 1.8–2.1×
Proteome (14 MB) 1 kbp region 1.3 ms 10.9 ms 7.2 ms 119 ms 8.4×
Transcriptome (972 MB) 1 kbp region 25.3 ms 278.7 ms 220.3 ms 1103 ms 11.0×

Small-region extraction is O(1), but on this host the end-to-end CLI path is startup-dominated below roughly 10 kbp. The historical checked-in benchmark report for v0.2.6 was generated under a faster local benchmark environment than the current reruns; direct side-by-side rebuilds of v0.2.6, v0.2.7, and current main on the same machine do not reproduce a material no-flag get regression. For very large full-sequence extraction, fastahack can still win on raw write-path overhead; z-fasta stays ahead of samtools across the real-dataset GET cases.

Orientation note: the shipped --rc path keeps the same mmap-backed extraction model and applies reverse traversal plus complement lookup during emission, rather than materializing a second copy of the region. The main GET benchmark report now includes dedicated RC timing and RSS sections against samtools faidx -i and bedtools getfasta | seqtk seq -r, and the implementation choice is summarized in bench/get/RC_STRATEGY.md.

Multi-region (v0.2.4): z-fasta get accepts multiple regions per call, loading the index once and streaming all results in CLI order.

Regions z-fasta samtools seqtk Speedup vs samtools
1 25.6 ms 289 ms 221 ms 11.3×
10 33.8 ms 283 ms 226 ms 8.4×
50 66.7 ms 292 ms 225 ms 4.4×
100 66.7 ms 279 ms 222 ms 4.2×

Benchmarked on REAL_Transcriptome.fa (972 MB, 254,070 sequences). Latency is dominated by index resolution and output setup rather than region byte count. seqtk performs a full-file scan per call regardless of region count and is listed for reference only.

Run .venv/bin/python bench/get/generate_report.py to regenerate the full GET report under bench/get/REPORT.md, including RC positional/BED comparisons and the RC memory snapshot.

Stats: Assembly/Proteome Statistics

Mode Dataset z-fasta seqkit -a seqtk comp Speedup vs seqkit -a
Index-only Genome (3.0 GB) 0.9 ms 17.45 s N/A ~19,000×
Index-only Proteome (14 MB) 2.9 ms 57.8 ms N/A ~20×
Full scan 1 GB single-seq file 0.78 s 5.62 s 2.65 s ~7×
Full scan Proteome (14 MB) 11.8 ms 57.8 ms 93.0 ms ~4.9×

Index-only time is effectively constant with file size and is best described as startup-dominated. It reads .zfi index data and computes length-derived metrics without scanning FASTA sequence bytes. Full-scan throughput on synthetic files is ~1.3 GB/s, and the latest benchmark report has z-fasta ahead of seqkit on the real genome/proteome/transcriptome stats cases while still computing richer statistics. See bench/stats/REPORT.md for full results.

Correctness

  • Index: 20/20 edge cases match samtools faidx (exit codes and output).
  • Get: 90/90 single-region and 22/22 multi-region byte-identical diff tests pass vs samtools across 5+ test files: full sequences, sub-regions, single bases, line-boundary spans, clamped ranges, duplicate regions, reversed CLI order, sort-path (≥16 regions).
  • BED / names batch extraction: 16/16 verification cases pass in bench/get/verify_bed.sh, covering default BED, --bed -, stranded BED vs bedtools getfasta -s, default BED vs samtools faidx -r, and --names batch extraction.
  • Reverse / complement extraction: 19/19 verification cases pass in bench/get/verify_rc.sh, covering --rc vs samtools faidx -i --mark-strand no, exact-output checks for --reverse-only, --complement-only, and --annotate-rc, multi-region concatenation, BED --honor-strand --rc composition, protein rejection, and a synthetic chromosome-like full-sequence case.
  • Stats: 107/107 BioPython verification tests pass: exact agreement on all Tier 1 and Tier 2 values across nucleotide and protein files.
  • Unit tests: 102/102 Zig unit tests (26 index · 30 get · 33 stats · 7 complement · 6 BED parser).
  • Messy FASTA: z-fasta is the only tool tested that correctly indexes mixed-width and trailing-whitespace FASTA files. samtools, fastahack, and pyfaidx all reject them. See bench/index/REPORT.md for the full compatibility matrix.

Benchmarking

# Download real test data (~4 GB, one-time)
bash bench/shared/download_data.sh

# ── Index ─────────────────────────────────────────────────────────
./zig build -Doptimize=ReleaseFast
bash bench/index/run_benchmarks.sh       # timing + memory
bash bench/index/run_tests.sh            # 20 edge-case correctness tests
.venv/bin/python bench/index/generate_report.py   # → bench/index/REPORT.md

# ── Get ───────────────────────────────────────────────────────────
bash bench/get/run_benchmarks.sh         # latency, scaling, real datasets
bash bench/get/verify_get.sh             # 90 byte-identical diff tests vs samtools
bash bench/get/verify_bed.sh             # 16 BED / names verification cases vs bedtools + samtools
bash bench/get/verify_rc.sh              # 19 RC / reverse / complement verification cases
.venv/bin/python bench/get/generate_report.py     # → bench/get/REPORT.md

# ── Stats ─────────────────────────────────────────────────────────
bash bench/stats/run_benchmarks.sh       # full/index-only, scaling, throughput
.venv/bin/python bench/stats/verify_stats.py  # 107 BioPython verification tests
.venv/bin/python bench/stats/generate_report.py   # → bench/stats/REPORT.md

Full local refresh, in the same order used before publishing benchmark updates:

./zig build -Doptimize=ReleaseFast && bash bench/index/run_tests.sh && bash bench/get/verify_get.sh && bash bench/get/verify_multi_get.sh && .venv/bin/python bench/stats/verify_stats.py && bash bench/index/run_benchmarks.sh --runs 5 && .venv/bin/python bench/index/generate_report.py && bash bench/get/run_benchmarks.sh --runs 5 && .venv/bin/python bench/get/generate_report.py && bash bench/stats/run_benchmarks.sh --runs 5 && .venv/bin/python bench/stats/generate_report.py && bash bench/perf-recovery/run_startup.sh

Add --skip-real to the get / stats scripts to skip real dataset runs (~3 GB downloads required otherwise). See bench/README.md for prerequisites and full instructions. The shipped reverse-path note is in bench/get/RC_STRATEGY.md.

Output Formats

Format Flag Description
.zfi (default) Compact binary index. Fast to read/write programmatically.
.fai --emit-fai Tab-separated text, identical to samtools faidx output.

Development

# Build (debug)
./zig build

# Run all tests (index + get + stats)
./zig build test --summary all

# Build optimized binary
./zig build -Doptimize=ReleaseFast

Roadmap

Delivered

  • z-fasta index: SIMD-accelerated FASTA indexing (v0.1)
  • z-fasta get: O(1) byte-offset sequence extraction (v0.2)
  • z-fasta stats: Assembly/proteome statistics with index-only mode (v0.2)
  • Unified benchmark suite with per-module reports and figures (v0.2.2)
  • Expanded tool comparison: pyfaidx, seqtk added across all benchmark modules; messy FASTA compatibility matrix (v0.2.3)
  • Multi-region get: single call with N regions, index loads once, results stream in CLI order; ~2× faster than samtools across 1–100 regions (v0.2.4)
  • Zig 0.16.0 migration plus benchmark/report refresh for v0.2.5
  • v0.2.6 performance recovery: lower startup overhead, faster index loading, buffered GET emission, fixed-width stats/index fast paths, and refreshed benchmark reports
  • v0.2.7 BED batch extraction: --bed, --bed -, --names, --strand-aware, bounded chunked processing, and verification/benchmark coverage
  • v0.2.8 reverse/complement extraction: --rc, --reverse-only, --complement-only, --annotate-rc, RC verification, and integrated RC benchmark/report coverage

Near-term

  • v0.3.0: Validate + Tier 2 benchmarks + release polish
    • z-fasta validate: single-pass FASTA format checker with line-numbered error/warning output
    • Checks: duplicate names, inconsistent line widths, invalid characters, empty sequences, missing terminal newline
    • --strict flag treats warnings as errors
    • Tier 2 benchmark suite: noodles, rust-bio, Fusta, htslib, bedtools comparisons
    • Fix GET on messy FASTA (mixed-width and trailing-whitespace files indexed but not retrievable)

Long-term / Exploratory

  • z-fasta digest: In-silico trypsin digestion for mass spectrometry (v0.4+)
  • Parallel mmap scanning for multi-threaded indexing on NVMe arrays
  • Native BGZF / gzip streaming read support

License

MIT. See LICENSE


Aligned life in bytes,
FASTA sings through mirrored streams.
Humans bloom as code.

About

A zero-dependency FASTA indexer and extractor written in Zig. Speeds up standard samtools workflows by ~10-20x using SIMD and mmap.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors