Skip to content

openIE-dev/tei-ternaryOS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

118 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ternary-fabric

An energy-optimal compute fabric for the perception / media / AI workloads of a human–machine interface — written in Rust, riding ordinary binary hardware.

The one rule (the scope fence)

Apply ternary / low-bit / sparsity only where the workload has slack: redundant, error-tolerant, throughput-bound math (neural inference, signal, media). Never to exact general-purpose compute (OS kernels, data structures, crypto) — there it is pure overhead, because emulating 3 states on 2-state transistors costs more than it saves. The real lever is less work + less data moved. Ternary is one face of that.

Status

Weights live in {-1, 0, +1} with one f32 scale. Seven kernels across two packings all compute y = scale * (W · x) losslessly (to ~1e-4) with no multiply in the inner loop.

cargo run --release      # quantize, verify lossless, benchmark all kernels
cargo test               # every kernel agrees with the f32 reference

Footprint (4096×4096)

layout size bits/weight vs f32
f32 64.0 MiB 32.0 1.0×
2-bit pack 4.0 MiB 2.00 16×
1.6-bit pack 3.2 MiB 1.60 20× (20% smaller than 2-bit)

Speed (median ms/matvec, single core — absolute ms drift run-to-run on a shared core; ratios are the signal)

matvec_simd is hand-vectorized per arch: AVX2 on x86_64, NEON on aarch64. Same kernels, both arches — the SIMD-vs-LUT frontier reproduces on each; only the scalar f32 baseline (also unvectorized) shifts. x86 = Intel i7-1165G7 (Tiger Lake, Win11); arm64 = Apple M3 Ultra. Both sets are real, measured 4096×4096 medians.

kernel x86 (AVX2) arm64 (NEON) takeaway
f32 reference (scalar) ~15 ~12 baseline (also unvectorized)
2-bit branchless ~43 ~11 no data branch; scalar
2-bit SIMD (v1) ~2.3 ~2.7 AVX2/NEON, 8 trits/instr — fastest
2-bit LUT (v2) ~3.9 ~3.0 T-MAC-style; 4 trits → 1 read
1.6-bit dense scalar (v3) ~37 ~11 base-3 decode; ~even with 2-bit scalar
1.6-bit dense LUT (v3) ~3.6 ~2.7 byte = index; matches 2-bit LUT
1.6-bit dense SIMD (v5) ~2.7 ~3.1 base-3 decode vectorized — 14× over scalar dense
2-bit int8 T-MAC (v6) ~3.1 ~3.0 int8 acts, integer LUT — FP-free energy path

What the build taught

  • v0: footprint (16×) and zero inner-loop multiplies are definitional; a naive branchy kernel is ~6× slower than f32 — branch misprediction eats the data win.
  • v1: killing branches got ~2.3×; going wide (AVX2, independent lanes that also break the reduction's dependency chain) cashed the footprint win as ~55× over naive, ~5× over scalar f32.
  • v2: a lookup table collapses 4 weights into one read — it reaches SIMD's league, but a scalar LUT doesn't beat hand-AVX2 (random reads into a ~1 MiB table + an O(cols) rebuild per matvec, vs AVX2's sequential 8-wide stream). LUT's real edge: int8 activations, big-batch amortization, vectorizable gather — which is why bitnet.cpp uses it for general low-bit.
  • v3: the 1.6-bit layout is 20% smaller and — surprise vs the naive expectation — not slower: under a LUT it matches the 2-bit LUT (the dense byte is just an index), and even the scalar dense kernel lands ~even with 2-bit scalar. v3 concluded 2-bit's advantage was alignment → vectorizability — that base-3 fields couldn't line up for SIMD-arithmetic and dense would top out at LUT speed. v5 disproves the second half.
  • v4a: the fabric eats real weights. A 200-line std-only GGUF reader + I2_S loader pulls BitNet b1.58 2B4T's ternary tensors straight onto our kernels — I2_S is just 4 codes/byte + a per-tensor f32 scale, trit = code-1. Real attn_q is ~50% zeros (the sparsity dividend, measured) and our no-multiply matvec reproduces it losslessly. Aside worth keeping: candle and the official gguf tool both reject ggml type 36, so "load via candle" was a dead end — the hand-rolled reader was the shorter path, not the longer one.
  • v4b: real model, real tokens — and the bug that hid from every test. The full BitNet forward pass ran end-to-end but emitted word-salad. Architecture matched HF modeling_bitnet line-for-line (RMSNorm, NEOX RoPE, GQA, ReLU²-GLU, tied output); the culprit was the I2_S bit layout. bitnet.cpp packs ternary planar — within a 128-weight block, byte p holds weights {p,p+32,p+64,p+96}, not {4p..4p+3}. v4a's checks (trit histogram, kernel-vs-reference) are order-independent, so a sequential misread passed them all yet scrambled every matvec. Lesson: validate the permutation, not just the values — and an end-to-end coherence test catches what unit tests structurally can't.
  • v5: dense base-3 does vectorize. Two moves dissolve the wall: the decode goes SIMD because x/3 == (x*171)>>9 is exact for any byte (no integer divide), and instead of interleaving the five trit-planes into activation order, the activations are reorganized plane-major once per matvec (the LUT's amortization trick) so each plane meets a contiguous f32x4. NEON dense lands ~3.1 ms — frontier band, just behind 2-bit SIMD (~2.7); the gap is the base-3 decode the LUT/shift paths skip. Alignment doesn't gate vectorizability — it sets the price. The densest layout (20× vs f32) now also runs near-frontier: dense is no longer the LUT-only choice.

Verified on three platforms

Same source, zero deps, generates coherent text end-to-end on each:

  • arm64 — Apple M3 Ultra: ~15 tok/s decode (NEON), full Metal GPU backend (4.8× prefill)
  • x86_64 — Intel i7-1165G7 / Win11: ~7.8 tok/s (AVX2; exercises the non-mmap File-read path)
  • wasm32cargo build --lib --target wasm32-unknown-unknown compiles (serial, no threads)

Optimize for

Microsoft's bitnet.cpp — its formats, quality bar, and lookup-table kernels (TL1/TL2, built on T-MAC; the I2_S Int2+scale path). Match it, then beat it in Rust.

Roadmap

  • v0 seed kernel: packed ternary matvec, lossless, footprint/op proof
  • v1 branchless + hand-vectorized SIMD kernels — AVX2 (x86) and NEON (aarch64); ~5× over scalar f32 on both arches
  • v2 lookup-table (T-MAC-style) kernel: 4 trits → one table read
  • v3 packing study: 2-bit (aligned, vectorizable) vs 1.6-bit (denser) — measured
  • v4a load real BitNet b1.58 2B4T weights: std-only GGUF reader + I2_S loader (candle/gguf reject ggml type 36). Every kernel reproduces a real attn_q BitLinear matvec losslessly (~1e-4); decode byte-verified vs an independent decoder. cargo run --release --bin bitnet
  • v4b end-to-end inference: real BitNet b1.58 2B4T, tokens in → tokens out, every BitLinear computed by our no-multiply ternary kernel. std-only Llama-3 BPE tokenizer + full forward pass (RMSNorm, NEOX RoPE, GQA, ReLU²-GLU, tied F16 output), KV-cached. "The capital of France is"" Paris. Paris is a city known for its rich history, culture, and architecture." cargo run --release --bin bitnet -- generate "<prompt>"
  • v4c/d multi-core decode: row-parallel matvec (matvec_par) over a persistent std-only broadcast thread-pool (src/pool.rs) — each BitLinear + the 128k-vocab output projection fanned across cores. ~1.8 → ~15 tok/s, past reading speed.
  • v5a dense-layout SIMD: NEON shuffle-based base-3 unpack — dense now runs at frontier speed (~3.1 ms), not just under a LUT
  • v5b (GPU) ternary matvec on the GPU, dependency-free — raw objc-runtime FFI to Metal + a runtime-compiled MSL kernel (no candle/CubeCL/objc crates). CPU and GPU share one packed weight format; verified lossless on Apple M3 Ultra. cargo run --release --bin bitnet -- gpu
  • v5c persistent GPU weight buffers + batched ternary matmul (GpuContext / GpuTernary). Weights upload once and stay resident; only activations stream. Measured crossover on M3 Ultra (4096²): batch 1 → CPU wins (0.17×, decode is latency-bound), batch 32 → 2.4×, batch 512 → 5.75× — the honest GPU win is matrix×matrix (prefill / batched serving), not single-stream decode.
  • v5d outstanding items, all done:
    • mmap + parallel transcode — load ~4s → ~1s (zero-copy GGUF reads, layers de-striped across cores)
    • int8 T-MAC (matvec_tmac) — bitnet.cpp's FP-free energy path (int8 acts, integer LUT); ~3 ms, LUT-league on the M3
    • GPU prefill (bitnet generate --gpu) — batched BitLinears on the GPU, byte-identical to CPU; 362-token prompt 25.4s → 5.3s (4.8×)
    • WASMcargo build --lib --target wasm32-unknown-unknown (serial, no threads)
    • dense AVX2 — x86 sibling of the v5a NEON dense kernel; runtime-verified on a real Tiger Lake i7 (correct to 9.9e-5, ~2.7 ms = 14× over scalar dense)
  • v6 a second model lineage — the runtime now ingests both low-bit families (ternary and 1-bit), same zero-dep CPU fabric. Beyond BitNet, it runs PrismML's Qwen3 "Bonsai" GGUFs:
    • Q1_0 (ggml type 41, src/blockq.rs) — 1-bit binary {−1,+1}; 128-elem block = fp16 scale (first) + 16 code bytes. Layout confirmed verbatim against llama.cpp dequantize_row_q1_0.
    • Q2_0_g128 (ggml type 42, PrismML-custom) — 2-bit ternary {−1,0,+1}; 128-elem block = fp16 scale (first) + 32 code bytes, sequential codes. No spec exists on disk, so the layout was reverse-engineered by cross-decoding the same tensor from the matching q1_0 file — only scale-first + sequential makes the per-128 block scales agree and the signs match (79% on nonzeros, the rest being where 1-bit and ternary legitimately disagree near zero). Same v4b doctrine: validate the permutation, not just the values.
    • per-block fp16 scale — unlike BitNet's single per-tensor scale; applied once per 128-wide block, still no weight multiply in the inner loop.
    • Qwen3 forward pass (src/qwen3.rs) — per-head QK-norm before NEOX RoPE, SwiGLU, GQA, tied or untied output head. Pure std, zero deps.
    • Coherent end-to-end on CPU at 1.7B / 4B / 8B, both formats: cargo run --release --bin bonsai -- generate <model.gguf> "<prompt>""The capital of France is Paris. Paris is located in the north of France." (Bonsai-8B, Q1_0)
    • wired into the OS — the tos conductor (ternaryos infer/bonsai) reads the GGUF architecture and dispatches BitNet or Qwen3/Bonsai through one generic decode loop, so the OS runs any low-bit lineage the same way.
    • substrate routingtos-vm's pipeline router now has two multiply-free inference fabrics, tnn.bitnet and tnn.bonsai, with honestly different energy curves (bitnet: one per-tensor scale, low setup; bonsai: per-128 fp16 scales, least data moved). ternaryos pipeline routes by cost — short jobs → bitnet, sustained → bonsai (crossover ~8 tokens) — then executes on the chosen fabric for real, reporting measured J/token from the Apple AMU counter.

About

ternaryOS — open-source ternary/low-bit OS + runtime for heterogeneous adaptive compute fabric (OpenIE/ThermoEdge). Public mirror; not accepting external contributions.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages