An energy-optimal compute fabric for the perception / media / AI workloads of a human–machine interface — written in Rust, riding ordinary binary hardware.
Apply ternary / low-bit / sparsity only where the workload has slack: redundant, error-tolerant, throughput-bound math (neural inference, signal, media). Never to exact general-purpose compute (OS kernels, data structures, crypto) — there it is pure overhead, because emulating 3 states on 2-state transistors costs more than it saves. The real lever is less work + less data moved. Ternary is one face of that.
Weights live in {-1, 0, +1} with one f32 scale. Seven kernels across two packings all
compute y = scale * (W · x) losslessly (to ~1e-4) with no multiply in the inner loop.
cargo run --release # quantize, verify lossless, benchmark all kernels
cargo test # every kernel agrees with the f32 reference
| layout | size | bits/weight | vs f32 |
|---|---|---|---|
| f32 | 64.0 MiB | 32.0 | 1.0× |
| 2-bit pack | 4.0 MiB | 2.00 | 16× |
| 1.6-bit pack | 3.2 MiB | 1.60 | 20× (20% smaller than 2-bit) |
Speed (median ms/matvec, single core — absolute ms drift run-to-run on a shared core; ratios are the signal)
matvec_simd is hand-vectorized per arch: AVX2 on x86_64, NEON on aarch64.
Same kernels, both arches — the SIMD-vs-LUT frontier reproduces on each; only the
scalar f32 baseline (also unvectorized) shifts. x86 = Intel i7-1165G7 (Tiger Lake,
Win11); arm64 = Apple M3 Ultra. Both sets are real, measured 4096×4096 medians.
| kernel | x86 (AVX2) | arm64 (NEON) | takeaway |
|---|---|---|---|
| f32 reference (scalar) | ~15 | ~12 | baseline (also unvectorized) |
| 2-bit branchless | ~43 | ~11 | no data branch; scalar |
| 2-bit SIMD (v1) | ~2.3 | ~2.7 | AVX2/NEON, 8 trits/instr — fastest |
| 2-bit LUT (v2) | ~3.9 | ~3.0 | T-MAC-style; 4 trits → 1 read |
| 1.6-bit dense scalar (v3) | ~37 | ~11 | base-3 decode; ~even with 2-bit scalar |
| 1.6-bit dense LUT (v3) | ~3.6 | ~2.7 | byte = index; matches 2-bit LUT |
| 1.6-bit dense SIMD (v5) | ~2.7 | ~3.1 | base-3 decode vectorized — 14× over scalar dense |
| 2-bit int8 T-MAC (v6) | ~3.1 | ~3.0 | int8 acts, integer LUT — FP-free energy path |
- v0: footprint (16×) and zero inner-loop multiplies are definitional; a naive branchy kernel is ~6× slower than f32 — branch misprediction eats the data win.
- v1: killing branches got ~2.3×; going wide (AVX2, independent lanes that also break the reduction's dependency chain) cashed the footprint win as ~55× over naive, ~5× over scalar f32.
- v2: a lookup table collapses 4 weights into one read — it reaches SIMD's league, but a scalar LUT doesn't beat hand-AVX2 (random reads into a ~1 MiB table + an O(cols) rebuild per matvec, vs AVX2's sequential 8-wide stream). LUT's real edge: int8 activations, big-batch amortization, vectorizable gather — which is why bitnet.cpp uses it for general low-bit.
- v3: the 1.6-bit layout is 20% smaller and — surprise vs the naive expectation — not slower: under a LUT it matches the 2-bit LUT (the dense byte is just an index), and even the scalar dense kernel lands ~even with 2-bit scalar. v3 concluded 2-bit's advantage was alignment → vectorizability — that base-3 fields couldn't line up for SIMD-arithmetic and dense would top out at LUT speed. v5 disproves the second half.
- v4a: the fabric eats real weights. A 200-line std-only GGUF reader + I2_S
loader pulls
BitNet b1.58 2B4T's ternary tensors straight onto our kernels — I2_S is just 4 codes/byte + a per-tensor f32 scale,trit = code-1. Realattn_qis ~50% zeros (the sparsity dividend, measured) and our no-multiply matvec reproduces it losslessly. Aside worth keeping: candle and the officialgguftool both reject ggml type 36, so "load via candle" was a dead end — the hand-rolled reader was the shorter path, not the longer one. - v4b: real model, real tokens — and the bug that hid from every test. The full
BitNet forward pass ran end-to-end but emitted word-salad. Architecture matched HF
modeling_bitnetline-for-line (RMSNorm, NEOX RoPE, GQA, ReLU²-GLU, tied output); the culprit was the I2_S bit layout. bitnet.cpp packs ternary planar — within a 128-weight block, bytepholds weights{p,p+32,p+64,p+96}, not{4p..4p+3}. v4a's checks (trit histogram, kernel-vs-reference) are order-independent, so a sequential misread passed them all yet scrambled every matvec. Lesson: validate the permutation, not just the values — and an end-to-end coherence test catches what unit tests structurally can't. - v5: dense base-3 does vectorize. Two moves dissolve the wall: the decode goes SIMD
because
x/3 == (x*171)>>9is exact for any byte (no integer divide), and instead of interleaving the five trit-planes into activation order, the activations are reorganized plane-major once per matvec (the LUT's amortization trick) so each plane meets a contiguousf32x4. NEON dense lands ~3.1 ms — frontier band, just behind 2-bit SIMD (~2.7); the gap is the base-3 decode the LUT/shift paths skip. Alignment doesn't gate vectorizability — it sets the price. The densest layout (20× vs f32) now also runs near-frontier: dense is no longer the LUT-only choice.
Same source, zero deps, generates coherent text end-to-end on each:
- arm64 — Apple M3 Ultra: ~15 tok/s decode (NEON), full Metal GPU backend (4.8× prefill)
- x86_64 — Intel i7-1165G7 / Win11: ~7.8 tok/s (AVX2; exercises the non-mmap File-read path)
- wasm32 —
cargo build --lib --target wasm32-unknown-unknowncompiles (serial, no threads)
Microsoft's bitnet.cpp — its formats, quality bar, and lookup-table kernels
(TL1/TL2, built on T-MAC; the I2_S Int2+scale path). Match it, then beat it in Rust.
- v0 seed kernel: packed ternary matvec, lossless, footprint/op proof
- v1 branchless + hand-vectorized SIMD kernels — AVX2 (x86) and NEON (aarch64); ~5× over scalar f32 on both arches
- v2 lookup-table (T-MAC-style) kernel: 4 trits → one table read
- v3 packing study: 2-bit (aligned, vectorizable) vs 1.6-bit (denser) — measured
- v4a load real
BitNet b1.58 2B4Tweights: std-only GGUF reader + I2_S loader (candle/ggufreject ggml type 36). Every kernel reproduces a realattn_qBitLinear matvec losslessly (~1e-4); decode byte-verified vs an independent decoder.cargo run --release --bin bitnet - v4b end-to-end inference: real
BitNet b1.58 2B4T, tokens in → tokens out, every BitLinear computed by our no-multiply ternary kernel. std-only Llama-3 BPE tokenizer + full forward pass (RMSNorm, NEOX RoPE, GQA, ReLU²-GLU, tied F16 output), KV-cached. "The capital of France is" → " Paris. Paris is a city known for its rich history, culture, and architecture."cargo run --release --bin bitnet -- generate "<prompt>" - v4c/d multi-core decode: row-parallel matvec (
matvec_par) over a persistent std-only broadcast thread-pool (src/pool.rs) — each BitLinear + the 128k-vocab output projection fanned across cores. ~1.8 → ~15 tok/s, past reading speed. - v5a dense-layout SIMD: NEON shuffle-based base-3 unpack — dense now runs at frontier speed (~3.1 ms), not just under a LUT
- v5b (GPU) ternary matvec on the GPU, dependency-free — raw objc-runtime
FFI to Metal + a runtime-compiled MSL kernel (no candle/CubeCL/objc crates). CPU
and GPU share one packed weight format; verified lossless on Apple M3 Ultra.
cargo run --release --bin bitnet -- gpu - v5c persistent GPU weight buffers + batched ternary matmul (
GpuContext/GpuTernary). Weights upload once and stay resident; only activations stream. Measured crossover on M3 Ultra (4096²): batch 1 → CPU wins (0.17×, decode is latency-bound), batch 32 → 2.4×, batch 512 → 5.75× — the honest GPU win is matrix×matrix (prefill / batched serving), not single-stream decode. - v5d outstanding items, all done:
- mmap + parallel transcode — load ~4s → ~1s (zero-copy GGUF reads, layers de-striped across cores)
- int8 T-MAC (
matvec_tmac) — bitnet.cpp's FP-free energy path (int8 acts, integer LUT); ~3 ms, LUT-league on the M3 - GPU prefill (
bitnet generate --gpu) — batched BitLinears on the GPU, byte-identical to CPU; 362-token prompt 25.4s → 5.3s (4.8×) - WASM —
cargo build --lib --target wasm32-unknown-unknown(serial, no threads) - dense AVX2 — x86 sibling of the v5a NEON dense kernel; runtime-verified on a real Tiger Lake i7 (correct to 9.9e-5, ~2.7 ms = 14× over scalar dense)
- v6 a second model lineage — the runtime now ingests both low-bit
families (ternary and 1-bit), same zero-dep CPU fabric. Beyond BitNet, it
runs PrismML's Qwen3 "Bonsai" GGUFs:
- Q1_0 (ggml type 41,
src/blockq.rs) — 1-bit binary {−1,+1}; 128-elem block = fp16 scale (first) + 16 code bytes. Layout confirmed verbatim against llama.cppdequantize_row_q1_0. - Q2_0_g128 (ggml type 42, PrismML-custom) — 2-bit ternary {−1,0,+1};
128-elem block = fp16 scale (first) + 32 code bytes, sequential codes. No spec
exists on disk, so the layout was reverse-engineered by cross-decoding the
same tensor from the matching
q1_0file — only scale-first + sequential makes the per-128 block scales agree and the signs match (79% on nonzeros, the rest being where 1-bit and ternary legitimately disagree near zero). Same v4b doctrine: validate the permutation, not just the values. - per-block fp16 scale — unlike BitNet's single per-tensor scale; applied once per 128-wide block, still no weight multiply in the inner loop.
- Qwen3 forward pass (
src/qwen3.rs) — per-head QK-norm before NEOX RoPE, SwiGLU, GQA, tied or untied output head. Pure std, zero deps. - Coherent end-to-end on CPU at 1.7B / 4B / 8B, both formats:
cargo run --release --bin bonsai -- generate <model.gguf> "<prompt>"→ "The capital of France is Paris. Paris is located in the north of France." (Bonsai-8B, Q1_0) - wired into the OS — the
tosconductor (ternaryos infer/bonsai) reads the GGUF architecture and dispatches BitNet or Qwen3/Bonsai through one generic decode loop, so the OS runs any low-bit lineage the same way. - substrate routing —
tos-vm's pipeline router now has two multiply-free inference fabrics,tnn.bitnetandtnn.bonsai, with honestly different energy curves (bitnet: one per-tensor scale, low setup; bonsai: per-128 fp16 scales, least data moved).ternaryos pipelineroutes by cost — short jobs → bitnet, sustained → bonsai (crossover ~8 tokens) — then executes on the chosen fabric for real, reporting measured J/token from the Apple AMU counter.
- Q1_0 (ggml type 41,