Skip to content

Feat/low end models#25

Merged
Aatricks merged 24 commits into
mainfrom
feat/low-end-models
Jun 3, 2026
Merged

Feat/low end models#25
Aatricks merged 24 commits into
mainfrom
feat/low-end-models

Conversation

@Aatricks

@Aatricks Aatricks commented Jun 3, 2026

Copy link
Copy Markdown
Owner

This pull request introduces significant improvements to low-end device support in llmedge by adding ready-to-use model presets for lightweight models and laying out a design for on-device safetensors-to-GGUF model conversion. The main focus is to make it easier for users to run efficient models on low-resource devices without manual setup, and to provide a clear path for converting and loading models not natively supported by the runtime.

Low-end model presets (Track A):

  • Added public ModelPresets in Kotlin, exposing ready-to-use specs for Microsoft BitNet b1.58 2B4T (IQ2_BN, ~988 MB) and SmolVLM2-256M (vision, ~280 MB), optimized for low-end devices. These models are now easily accessible and do not require manual repo/filename entry. [1] [2] [3] [4]
  • Enhanced ModelHints to include an optional chatTemplate field, ensuring BitNet loads with the correct template and produces well-formed output out of the box.
  • Updated documentation to showcase the new presets and clarify usage, including notes on model size and template handling. [1] [2] [3] [4]

Safetensors → GGUF conversion (Track B):

  • Added a design spec for a safetensors-to-GGUF conversion pipeline, enabling users to convert and quantize Hugging Face safetensors models for use in llmedge. The spec outlines a two-phase approach: a host-side converter (shipping first) and an optional on-device native converter (deferred).
  • The spec details special handling for models like Bonsai, which require a pre-processing step to fold ternary scales before conversion, and describes the new ModelSpec.safetensors(...) API for seamless integration.

These changes make it much easier to run efficient, supported models on low-end Android devices and provide a clear path for supporting additional models through conversion.

Low-end model support:

  • Introduced ModelPresets for Microsoft BitNet b1.58 2B4T and SmolVLM2-256M, making them easily accessible for apps targeting low-end devices. [1] [2] [3] [4]
  • Ensured BitNet uses the correct chat template by extending ModelHints and updating the load path to prioritize the preset's template.

Safetensors conversion pipeline:

  • Designed a safetensors-to-GGUF conversion flow (host-side tool and API), supporting both direct (F16) and quantized (Q8_0, Q4_K_M, IQ2_BN) conversion, and handling special cases like Bonsai.
  • Outlined a future on-device native converter for broader architecture coverage, sequenced after the host-side tool.

Documentation updates:

  • Updated README.md and usage docs to highlight new presets, clarify template handling, and document model sizes and limitations. [1] [2] [3] [4]
  • Added detailed specs for both the low-end presets and the safetensors conversion pipeline for future reference and implementation guidance. [1] [2]

Aatricks and others added 24 commits June 1, 2026 23:40
Track A of the "low-end models" request. Adds ready-to-use presets for
Microsoft BitNet b1.58 2B4T (1-bit IQ2_BN_R4) and SmolVLM2-256M (vision),
both already supported by the bundled ik_llama.cpp runtime.

Key design points:
- ModelHints gains an optional chatTemplate so the BitNet preset is
  self-contained (its GGUF-embedded template is wrong).
- New public ModelPresets surface (DefaultModelCatalog is internal).
- ModelRegistry defaults unchanged to avoid forcing large downloads.
- Safetensors conversion split out as Track B; Bonsai is an appendix.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
One converter with a precision param (F16="direct", quantized="lossy").
Recommends a host/build-time converter reusing the in-repo
convert_hf_to_gguf.py plus a ModelSpec.safetensors(source, precision)
consumption API (Phase B1); native on-device converter for stock-Llama
deferred to optional Phase B2. Documents the Bonsai QLinear scale-fold
(W_eff = W x scale) needed because Bonsai needs custom modeling and
stock conversion fails.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds ready-to-use ModelSpec presets for two models the bundled
ik_llama.cpp runtime already supports, exposed via a new public
ModelPresets surface (DefaultModelCatalog is internal):

- ModelPresets.bitnet  -> Microsoft BitNet b1.58 2B4T, native 1-bit
  IQ2_BN_R4 GGUF (tdh111/bitnet-b1.58-2B-4T-GGUF). ~988 MB.
- ModelPresets.smolVlm2 -> SmolVLM2-256M vision base + mmproj
  (ggml-org/SmolVLM2-256M-Video-Instruct-GGUF). ~280 MB.

ModelHints gains an optional chatTemplate so a preset stays
self-contained. BitNet's GGUF metadata template is wrong, so the
canonical template (verbatim from microsoft/bitnet-b1.58-2B-4T
tokenizer_config.json) is supplied via the preset and threaded into the
text load path as a fallback (caller-supplied template still wins). The
effective template is folded into the runtime cache key.

Registry defaults are unchanged, so existing users are not forced into
large downloads. Adds ModelPresetsTest and docs.

Note: could not compile/run locally (no JDK/Android SDK in this
environment); run `./gradlew :llmedge:testDebugUnitTest` to verify.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
tools/safetensors-convert/: one converter with a --precision param
(f16="direct", q8_0/q4_k_m/iq2_bn="lossy"), reusing the in-repo
llama.cpp/convert_hf_to_gguf.py. Arbitrary HF archs inherit upstream
coverage; --adapter bonsai-qlinear folds Bonsai's per-output QLinear
scales into the weights (W_eff = W*scale) and rewrites the config to
stock LlamaForCausalLM so the upstream converter accepts it.

The fold math is backend-agnostic and unit-tested with numpy alone
(test_bonsai_fold.py, 5/5 passing): verifies (x@W^T)*scale == x@W_eff^T,
scale dropping, error cases, and config rewrite.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ible

Gradle invokes this script via `bash`, which on stock macOS is bash 3.2
(no `mapfile`), failing :llmedge:generateNativeTargetNames with
"mapfile: command not found" (exit 127). Replace the two `mapfile`
calls with portable `while IFS= read -r` loops. Behavior-equivalent on
bash 4+/Linux; unblocks the build on macOS.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds a hint-based safetensors-conversion API (no sealed-variant ripple):

- ModelConversion {precision, adapter} + ConversionPrecision /
  ConversionAdapter enums; ModelHints gains an optional `conversion`.
- ModelSpec.safetensors(repoId, precision, adapter) and
  safetensorsLocal(path, ...) factories. The conversion is tagged into
  the cache key (only when present, so existing keys are unchanged).
- DefaultModelRepository.resolve() short-circuits on the conversion
  hint: returns a cached converted GGUF from <filesDir>/llmedge-converted,
  else throws an actionable LLMEdgeException naming the exact
  tools/safetensors-convert command + target path (on-device conversion
  is Phase B2, deferred).

Verified: :llmedge:testDebugUnitTest model.* green — SafetensorsSpecTest
(4), ConvertedModelResolveTest (2), ModelPresetsTest (4).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
IQ2_BN_R4 is a CPU row-interleaved repack that may not load on the app's
GPU-preferring backend path (OpenCL/Vulkan -> CPU). Default the preset to
the plain `bitnet1582b4t-iq2_bn.gguf` (same repo/size), which is portable;
ik_llama can repack to R4 at runtime on CPU. The _r4 variant is documented
as the pure-CPU alternative. All four preset artifacts HEAD-verified (200).

Re-verified: ModelPresetsTest 4/4 green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Makes scripts/build_native_linux.sh + the jni-desktop CMake build work on
macOS (Apple Silicon), so the host E2E tests run without an emulator:

- build_native_linux.sh: replace Linux-only `nproc` with a portable core
  count; emit a `.dylib` alias next to each `.so` (System.loadLibrary maps
  names to .dylib on macOS, but the harness names libs .so).
- jni-desktop/CMakeLists.txt: force the smollm JNI target's suffix to .so on
  Apple so the harness finds it.
- README-macos-host-build.md: document the prereqs (Homebrew bash 5, cmake,
  ninja, gpatch as `patch`, a full JDK for JNI headers since the Android
  Studio JBR ships none) and the build+run commands.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Desktop-host E2E proving the bundled ModelChatTemplates.BITNET produces
coherent output where the GGUF's embedded template does not. Verified on
real native inference (host libsmollm + bitnet1582b4t-iq2_bn.gguf): with
the preset template, "What is the capital of France?" -> "Paris"; with the
embedded template the model emits only <|begin_of_text|>. Gated via Assume
on LLMEDGE_TEST_TEXT_MODEL_PATH (a bitnet gguf) + LLMEDGE_BUILD_NATIVE_LIB_PATH.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Begins the on-device native safetensors->GGUF converter. Reuse ruled out
(convert_hf_to_gguf.py hard-imports torch; impractical on-device), so it's
a C++ reimplementation, scoped v1 to ONE vertical slice: Llama + SentencePiece.

Plan (docs/.../2026-06-02-b2-native-converter-plan.md) follows the
ground-truth-oracle-first discipline: validate converted output by diffing
greedy tokens / logits against an official GGUF at matching precision (not
"non-empty"), and verify tensor-conversion and tokenizer-baking independently.

Layer 1: cpp/convert/safetensors_reader.{h,cpp} parses the safetensors
header (u64 len + JSON) and reads tensor bytes. Standalone clang++ test
(test_safetensors_reader.cpp) passes: round-trips F32/BF16 tensors,
shapes/dtypes/nbytes, data bytes, and malformed-input errors. Not yet wired
into the gradle/native build (added as later layers land).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Instrumented (androidTest) E2E that loads BitNet b1.58 via the preset chat
template and generates ("capital of France" -> "Paris"), and runs SmolVLM2
vision analysis. Assume-gated on model files in the test app's external dir,
so it skips on CI without models. Intended for a real arm64 device.

Documents (in KDoc) the Apple-Silicon Android-emulator HVF PAC quirk: the
inference thread SIGILLs on a pointer-auth instruction inside the NDK's
prebuilt libc++ there; the same native build produces correct output on real
arm64 (verified via the desktop host JNI build).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
cpp/convert/gguf_writer.{h,cpp}: hand-rolled GGUF v3 writer (no ggml dep, so
host-testable and identical on-device) supporting string/int/float/bool/array
KVs and tensors (ne in ggml order, aligned data section).

Verified cross-implementation: test_gguf_writer.cpp writes a sample GGUF, and
verify_gguf.py reads it with the canonical gguf-py reader — 7 KVs + 2 tensors,
shapes and data values confirmed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
cpp/convert/hf_to_gguf.{h,cpp}: convert a Llama-arch HF safetensors dir to GGUF
tensors + llama.* hparams. Handles HF->GGUF tensor-name mapping, the llama.cpp
Q/K RoPE row permutation (the classic correctness gotcha), tied embeddings,
bf16/f16/f32->f16 (2-D) / f32 (1-D) conversion, and ggml ne ordering.

Ground-truth oracle (the discipline the whole plan hinges on): convert
SmolLM-135M and diff every tensor against the upstream convert_hf_to_gguf.py
output as fp32. Result: 272/272 tensors match (shapes + values, rtol/atol
2e-3) -- a wrong permutation or name map fails allclose, not silently.

Tokenizer baking (Layer 4) and JNI wiring still pending; this GGUF carries
tensors + hparams only.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…boundary

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds tokenizer_bake.{h,cpp} and wires it into convert_llama_dir via a new
optional tokenizer_pre argument. For a HF byte-level BPE tokenizer this emits
the tokenizer.ggml.* KVs llama.cpp needs:

  - model="gpt2", pre=<caller-supplied>
  - tokens[vocab] (vocab.json inverted, verbatim byte-level form)
  - token_type[vocab] (NORMAL, CONTROL for special added tokens)
  - merges[] (space-joined string pairs, verbatim)
  - bos/eos/unknown/padding_token_id (resolved by content via
    tokenizer_config.json, falling back to special_tokens_map.json)
  - add_space_prefix / add_bos_token
  - tokenizer.chat_template

The `pre` field is caller-supplied and required (mirrors ModelHints.chatTemplate):
upstream derives it by hashing the real tokenizer's output over a probe string,
which v1 does not reimplement. A wrong/guessed pre loads silently and
mis-tokenizes, so an empty pre throws rather than defaulting.

Fail-loud guards: model.type must be "BPE"; merges must be space-joined strings
(the newer array-pair form is rejected); vocab ids must be contiguous 0..N-1.

Verified on SmolLM-135M, three ways:
  - compare_tokenizer_kv.py: all 12 tokenizer KVs byte-match the upstream
    convert_hf_to_gguf.py reference (tokens, token_type, merges, ids, flags,
    chat_template).
  - compare_gguf.py: 272/272 tensors still match (regression intact).
  - llama-quantize loads the converted GGUF end-to-end through the real
    llama.cpp loader (arch + full tokenizer + tensors) and requantizes it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…lution

Makes ModelSpec.safetensors / safetensorsLocal actually convert on-device, end
to end, instead of throwing host-tool instructions.

Native (JNI bridge):
  - smollm_jni_convert.cpp exposes
    SmolLM.nativeConvertSafetensors(modelDir, outPath, tokenizerPre), a thin
    marshalling layer over llmedge::convert::convert_llama_dir.
  - convert/*.cpp (reader, writer, tokenizer_bake, hf_to_gguf) + the wrapper are
    added to both the Android (llmedge-common.cmake SMOLLM_SOURCES) and desktop
    (jni-desktop/CMakeLists.txt) smollm targets. nlohmann/json resolves via the
    existing vendor include path on both.

Kotlin:
  - SmolLM.convertSafetensorsToGguf(...) ensures the lib is loaded then calls the
    static native method.
  - ModelConversion gains tokenizerPre (folded into cacheToken); ModelSpec
    .safetensors / .safetensorsLocal thread it through.
  - DefaultModelRepository.resolveConvertedModel is now suspend and performs the
    conversion: download the HF model dir (config.json + model.safetensors +
    tokenizer files, flat-cached) or use the local dir, then convert into the
    cache target. Three guards per review:
      * fail fast (before any download) if tokenizerPre is absent — a text GGUF
        without a baked tokenizer is not loadable;
      * write to a temp path and rename only on success, so a mid-convert crash
        never leaves a corrupt cached GGUF that passes the size check;
      * catch UnsatisfiedLinkError (build without the converter) and fall back to
        the actionable host-tool instructions.
    Missing single-file model.safetensors (sharded/unsupported) -> clear error.

Verified end-to-end on host arm64 (B2ConvertE2ETest): resolve(
safetensorsLocal(SmolLM-135M, tokenizerPre="smollm")) runs the native converter,
caches a 270MB F16 GGUF under llmedge-converted/, loads it, and generates
"The capital of France is Paris!" — proving the baked GPT2-BPE tokenizer builds
a working vocab and the baked chat_template is used. Model-package unit tests
stay green (ConvertedModelResolveTest, SafetensorsSpecTest, ModelPresets, ...).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The native converter writes F16; this adds the quantization step so
ModelConversion.precision other than F16 produces a smaller GGUF on-device —
the point of the feature for low-end devices.

Quantization is layered in the JNI wrapper (smollm_jni_convert.cpp), not in
convert/*.cpp, keeping the converter ggml/llama-free and host-testable. The
wrapper now takes a precision label: "f16" writes the converter output directly;
"q8_0" / "q4_k_m" / "iq2_bn" / "iq2_bn_r4" convert to a temp F16 GGUF and then
requantize it via llama_model_quantize (with llama_backend_init/free scoped
around the call, mirroring the upstream llama-quantize tool). The temp file is
always removed; a non-zero quantize code throws.

SmolLM.convertSafetensorsToGguf + nativeConvertSafetensors gain the precision
argument; DefaultModelRepository passes conversion.precision.ggufLabel.

Verified end-to-end on host arm64 (B2ConvertE2ETest, 2/2):
  - F16:    270 MB GGUF -> "The capital of France is Paris!"
  - Q4_K_M: 105 MB GGUF (2.6x smaller than the source safetensors) ->
            "The capital of France!" (coherent text from the quantized model)
Model-package unit tests stay green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…Bonsai adapter

Updates the B2 plan to reflect the full convert→bake→quantize→load→generate
pipeline working on host arm64 (B2ConvertE2ETest: F16 "...Paris!", Q4_K_M
coherent at 105MB). The only remaining piece is the on-device Bonsai QLinear
fold adapter, which is verification-blocked (no local Bonsai model) and already
covered offline by the B1 host path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…n path

The quantize wrapper wrote a "<out>.f16.tmp" intermediate; previously std::remove
only ran on the normal/return paths, so if convert_llama_dir or
llama_model_quantize threw, the intermediate leaked (Kotlin's tmp.delete() only
removes the outer temp). Wrap the convert+quantize in a try that removes the
intermediate on every path before rethrowing.

Verified to parse + resolve includes/symbols under the Android NDK arm64
toolchain (cxx_std_20) alongside the other 4 convert TUs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…or workaround)

The Android emulator on Apple-Silicon Macs mis-virtualizes ARM pointer-auth keys,
so the auth instructions in the NDK prebuilt libc++/libunwind SIGILL at runtime.
This tool NOPs every PAC sign/auth hint (paciasp/autiasp/auti*1716/...) and turns
retaa/retab into plain ret, within executable sections only, whole-program so the
result is internally consistent. It refuses (does not silently mangle) braa/blraa
auth-branches. Emulator-only dev aid — never ship a depacified .so; real arm64
devices virtualize PAC correctly.

Verified: depacifying the packaged libsmollm.so lets the B2 converter run on the
Apple-Silicon emulator (read safetensors -> convert -> bake tokenizer -> quantize
Q4_K_M -> valid 105MB GGUF, host-verified), past the prior PAC-at-load boundary.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Point submodule to 2a9eaf8 (was c4d77d8)
presets

The safetensors-conversion docs were stale ("on-device conversion is not
yet
available (Phase B2)") — they predated the shipped B2 converter.

- docs/usage.md "Converting safetensors models": rewritten to describe
  on-device
  conversion (download model dir → convert → quantize → cache → load),
  the
  required `tokenizerPre` hint, the v1 scope (Llama arch + GPT2-BPE,
  single-file
  safetensors; Bonsai fold + other arches still host-only), and that
  it's
  verified end-to-end on a real arm64 device.
- README: added Features bullets for the low-end `ModelPresets` and
  on-device
  safetensors→GGUF conversion; fixed the BitNet note to link the usage
  section
  instead of the design doc.
- docs/index.md: added a Core Features bullet.
@Aatricks Aatricks self-assigned this Jun 3, 2026
@Aatricks Aatricks merged commit 57c07c6 into main Jun 3, 2026
1 check failed
@Aatricks Aatricks linked an issue Jun 3, 2026 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bonsai 1 bit image and LLM support Microsoft lens also

1 participant