ternary-fabric

An energy-optimal compute fabric for the perception / media / AI workloads of a human–machine interface — written in Rust, riding ordinary binary hardware.

The one rule (the scope fence)

Apply ternary / low-bit / sparsity only where the workload has slack: redundant, error-tolerant, throughput-bound math (neural inference, signal, media). Never to exact general-purpose compute (OS kernels, data structures, crypto) — there it is pure overhead, because emulating 3 states on 2-state transistors costs more than it saves. The real lever is less work + less data moved. Ternary is one face of that.

Status

Weights live in {-1, 0, +1} with one f32 scale. Seven kernels across two packings all compute y = scale * (W · x) losslessly (to ~1e-4) with no multiply in the inner loop.

cargo run --release      # quantize, verify lossless, benchmark all kernels
cargo test               # every kernel agrees with the f32 reference

Footprint (4096×4096)

layout	size	bits/weight	vs f32
f32	64.0 MiB	32.0	1.0×
2-bit pack	4.0 MiB	2.00	16×
1.6-bit pack	3.2 MiB	1.60	20× (20% smaller than 2-bit)

Speed (median ms/matvec, single core — absolute ms drift run-to-run on a shared core; ratios are the signal)

matvec_simd is hand-vectorized per arch: AVX2 on x86_64, NEON on aarch64. Same kernels, both arches — the SIMD-vs-LUT frontier reproduces on each; only the scalar f32 baseline (also unvectorized) shifts. x86 = Intel i7-1165G7 (Tiger Lake, Win11); arm64 = Apple M3 Ultra. Both sets are real, measured 4096×4096 medians.

kernel	x86 (AVX2)	arm64 (NEON)	takeaway
f32 reference (scalar)	~15	~12	baseline (also unvectorized)
2-bit branchless	~43	~11	no data branch; scalar
2-bit SIMD (v1)	~2.3	~2.7	AVX2/NEON, 8 trits/instr — fastest
2-bit LUT (v2)	~3.9	~3.0	T-MAC-style; 4 trits → 1 read
1.6-bit dense scalar (v3)	~37	~11	base-3 decode; ~even with 2-bit scalar
1.6-bit dense LUT (v3)	~3.6	~2.7	byte = index; matches 2-bit LUT
1.6-bit dense SIMD (v5)	~2.7	~3.1	base-3 decode vectorized — 14× over scalar dense
2-bit int8 T-MAC (v6)	~3.1	~3.0	int8 acts, integer LUT — FP-free energy path

What the build taught

v0: footprint (16×) and zero inner-loop multiplies are definitional; a naive branchy kernel is ~6× slower than f32 — branch misprediction eats the data win.
v1: killing branches got ~2.3×; going wide (AVX2, independent lanes that also break the reduction's dependency chain) cashed the footprint win as ~55× over naive, ~5× over scalar f32.
v2: a lookup table collapses 4 weights into one read — it reaches SIMD's league, but a scalar LUT doesn't beat hand-AVX2 (random reads into a ~1 MiB table + an O(cols) rebuild per matvec, vs AVX2's sequential 8-wide stream). LUT's real edge: int8 activations, big-batch amortization, vectorizable gather — which is why bitnet.cpp uses it for general low-bit.
v3: the 1.6-bit layout is 20% smaller and — surprise vs the naive expectation — not slower: under a LUT it matches the 2-bit LUT (the dense byte is just an index), and even the scalar dense kernel lands ~even with 2-bit scalar. v3 concluded 2-bit's advantage was alignment → vectorizability — that base-3 fields couldn't line up for SIMD-arithmetic and dense would top out at LUT speed. v5 disproves the second half.
v4a: the fabric eats real weights. A 200-line std-only GGUF reader + I2_S loader pulls BitNet b1.58 2B4T's ternary tensors straight onto our kernels — I2_S is just 4 codes/byte + a per-tensor f32 scale, trit = code-1. Real attn_q is ~50% zeros (the sparsity dividend, measured) and our no-multiply matvec reproduces it losslessly. Aside worth keeping: candle and the official gguf tool both reject ggml type 36, so "load via candle" was a dead end — the hand-rolled reader was the shorter path, not the longer one.
v4b: real model, real tokens — and the bug that hid from every test. The full BitNet forward pass ran end-to-end but emitted word-salad. Architecture matched HF modeling_bitnet line-for-line (RMSNorm, NEOX RoPE, GQA, ReLU²-GLU, tied output); the culprit was the I2_S bit layout. bitnet.cpp packs ternary planar — within a 128-weight block, byte p holds weights {p,p+32,p+64,p+96}, not {4p..4p+3}. v4a's checks (trit histogram, kernel-vs-reference) are order-independent, so a sequential misread passed them all yet scrambled every matvec. Lesson: validate the permutation, not just the values — and an end-to-end coherence test catches what unit tests structurally can't.
v5: dense base-3 does vectorize. Two moves dissolve the wall: the decode goes SIMD because x/3 == (x*171)>>9 is exact for any byte (no integer divide), and instead of interleaving the five trit-planes into activation order, the activations are reorganized plane-major once per matvec (the LUT's amortization trick) so each plane meets a contiguous f32x4. NEON dense lands ~3.1 ms — frontier band, just behind 2-bit SIMD (~2.7); the gap is the base-3 decode the LUT/shift paths skip. Alignment doesn't gate vectorizability — it sets the price. The densest layout (20× vs f32) now also runs near-frontier: dense is no longer the LUT-only choice.

Verified on three platforms

Same source, zero deps, generates coherent text end-to-end on each:

arm64 — Apple M3 Ultra: ~15 tok/s decode (NEON), full Metal GPU backend (4.8× prefill)
x86_64 — Intel i7-1165G7 / Win11: ~7.8 tok/s (AVX2; exercises the non-mmap File-read path)
wasm32 — cargo build --lib --target wasm32-unknown-unknown compiles (serial, no threads)

Optimize for

Microsoft's bitnet.cpp — its formats, quality bar, and lookup-table kernels (TL1/TL2, built on T-MAC; the I2_S Int2+scale path). Match it, then beat it in Rust.

Name		Name	Last commit message	Last commit date
Latest commit History 118 Commits
src		src
tos		tos
.gitignore		.gitignore
Cargo.toml		Cargo.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ternary-fabric

The one rule (the scope fence)

Status

Footprint (4096×4096)

Speed (median ms/matvec, single core — absolute ms drift run-to-run on a shared core; ratios are the signal)

What the build taught

Verified on three platforms

Optimize for

Roadmap

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ternary-fabric

The one rule (the scope fence)

Status

Footprint (4096×4096)

Speed (median ms/matvec, single core — absolute ms drift run-to-run on a shared core; ratios are the signal)

What the build taught

Verified on three platforms

Optimize for

Roadmap

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages