Feat/low end models by Aatricks · Pull Request #25 · Aatricks/llmedge

Aatricks · 2026-06-03T03:28:20Z

This pull request introduces significant improvements to low-end device support in llmedge by adding ready-to-use model presets for lightweight models and laying out a design for on-device safetensors-to-GGUF model conversion. The main focus is to make it easier for users to run efficient models on low-resource devices without manual setup, and to provide a clear path for converting and loading models not natively supported by the runtime.

Low-end model presets (Track A):

Added public ModelPresets in Kotlin, exposing ready-to-use specs for Microsoft BitNet b1.58 2B4T (IQ2_BN, ~988 MB) and SmolVLM2-256M (vision, ~280 MB), optimized for low-end devices. These models are now easily accessible and do not require manual repo/filename entry. [1] [2] [3] [4]
Enhanced ModelHints to include an optional chatTemplate field, ensuring BitNet loads with the correct template and produces well-formed output out of the box.
Updated documentation to showcase the new presets and clarify usage, including notes on model size and template handling. [1] [2] [3] [4]

Safetensors → GGUF conversion (Track B):

Added a design spec for a safetensors-to-GGUF conversion pipeline, enabling users to convert and quantize Hugging Face safetensors models for use in llmedge. The spec outlines a two-phase approach: a host-side converter (shipping first) and an optional on-device native converter (deferred).
The spec details special handling for models like Bonsai, which require a pre-processing step to fold ternary scales before conversion, and describes the new ModelSpec.safetensors(...) API for seamless integration.

These changes make it much easier to run efficient, supported models on low-end Android devices and provide a clear path for supporting additional models through conversion.

Low-end model support:

Introduced ModelPresets for Microsoft BitNet b1.58 2B4T and SmolVLM2-256M, making them easily accessible for apps targeting low-end devices. [1] [2] [3] [4]
Ensured BitNet uses the correct chat template by extending ModelHints and updating the load path to prioritize the preset's template.

Safetensors conversion pipeline:

Designed a safetensors-to-GGUF conversion flow (host-side tool and API), supporting both direct (F16) and quantized (Q8_0, Q4_K_M, IQ2_BN) conversion, and handling special cases like Bonsai.
Outlined a future on-device native converter for broader architecture coverage, sequenced after the host-side tool.

Documentation updates:

Updated README.md and usage docs to highlight new presets, clarify template handling, and document model sizes and limitations. [1] [2] [3] [4]
Added detailed specs for both the low-end presets and the safetensors conversion pipeline for future reference and implementation guidance. [1] [2]

Track A of the "low-end models" request. Adds ready-to-use presets for Microsoft BitNet b1.58 2B4T (1-bit IQ2_BN_R4) and SmolVLM2-256M (vision), both already supported by the bundled ik_llama.cpp runtime. Key design points: - ModelHints gains an optional chatTemplate so the BitNet preset is self-contained (its GGUF-embedded template is wrong). - New public ModelPresets surface (DefaultModelCatalog is internal). - ModelRegistry defaults unchanged to avoid forcing large downloads. - Safetensors conversion split out as Track B; Bonsai is an appendix. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

One converter with a precision param (F16="direct", quantized="lossy"). Recommends a host/build-time converter reusing the in-repo convert_hf_to_gguf.py plus a ModelSpec.safetensors(source, precision) consumption API (Phase B1); native on-device converter for stock-Llama deferred to optional Phase B2. Documents the Bonsai QLinear scale-fold (W_eff = W x scale) needed because Bonsai needs custom modeling and stock conversion fails. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Adds ready-to-use ModelSpec presets for two models the bundled ik_llama.cpp runtime already supports, exposed via a new public ModelPresets surface (DefaultModelCatalog is internal): - ModelPresets.bitnet -> Microsoft BitNet b1.58 2B4T, native 1-bit IQ2_BN_R4 GGUF (tdh111/bitnet-b1.58-2B-4T-GGUF). ~988 MB. - ModelPresets.smolVlm2 -> SmolVLM2-256M vision base + mmproj (ggml-org/SmolVLM2-256M-Video-Instruct-GGUF). ~280 MB. ModelHints gains an optional chatTemplate so a preset stays self-contained. BitNet's GGUF metadata template is wrong, so the canonical template (verbatim from microsoft/bitnet-b1.58-2B-4T tokenizer_config.json) is supplied via the preset and threaded into the text load path as a fallback (caller-supplied template still wins). The effective template is folded into the runtime cache key. Registry defaults are unchanged, so existing users are not forced into large downloads. Adds ModelPresetsTest and docs. Note: could not compile/run locally (no JDK/Android SDK in this environment); run `./gradlew :llmedge:testDebugUnitTest` to verify. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

tools/safetensors-convert/: one converter with a --precision param (f16="direct", q8_0/q4_k_m/iq2_bn="lossy"), reusing the in-repo llama.cpp/convert_hf_to_gguf.py. Arbitrary HF archs inherit upstream coverage; --adapter bonsai-qlinear folds Bonsai's per-output QLinear scales into the weights (W_eff = W*scale) and rewrites the config to stock LlamaForCausalLM so the upstream converter accepts it. The fold math is backend-agnostic and unit-tested with numpy alone (test_bonsai_fold.py, 5/5 passing): verifies (x@W^T)*scale == x@W_eff^T, scale dropping, error cases, and config rewrite. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ible Gradle invokes this script via `bash`, which on stock macOS is bash 3.2 (no `mapfile`), failing :llmedge:generateNativeTargetNames with "mapfile: command not found" (exit 127). Replace the two `mapfile` calls with portable `while IFS= read -r` loops. Behavior-equivalent on bash 4+/Linux; unblocks the build on macOS. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Adds a hint-based safetensors-conversion API (no sealed-variant ripple): - ModelConversion {precision, adapter} + ConversionPrecision / ConversionAdapter enums; ModelHints gains an optional `conversion`. - ModelSpec.safetensors(repoId, precision, adapter) and safetensorsLocal(path, ...) factories. The conversion is tagged into the cache key (only when present, so existing keys are unchanged). - DefaultModelRepository.resolve() short-circuits on the conversion hint: returns a cached converted GGUF from <filesDir>/llmedge-converted, else throws an actionable LLMEdgeException naming the exact tools/safetensors-convert command + target path (on-device conversion is Phase B2, deferred). Verified: :llmedge:testDebugUnitTest model.* green — SafetensorsSpecTest (4), ConvertedModelResolveTest (2), ModelPresetsTest (4). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

IQ2_BN_R4 is a CPU row-interleaved repack that may not load on the app's GPU-preferring backend path (OpenCL/Vulkan -> CPU). Default the preset to the plain `bitnet1582b4t-iq2_bn.gguf` (same repo/size), which is portable; ik_llama can repack to R4 at runtime on CPU. The _r4 variant is documented as the pure-CPU alternative. All four preset artifacts HEAD-verified (200). Re-verified: ModelPresetsTest 4/4 green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Makes scripts/build_native_linux.sh + the jni-desktop CMake build work on macOS (Apple Silicon), so the host E2E tests run without an emulator: - build_native_linux.sh: replace Linux-only `nproc` with a portable core count; emit a `.dylib` alias next to each `.so` (System.loadLibrary maps names to .dylib on macOS, but the harness names libs .so). - jni-desktop/CMakeLists.txt: force the smollm JNI target's suffix to .so on Apple so the harness finds it. - README-macos-host-build.md: document the prereqs (Homebrew bash 5, cmake, ninja, gpatch as `patch`, a full JDK for JNI headers since the Android Studio JBR ships none) and the build+run commands. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Desktop-host E2E proving the bundled ModelChatTemplates.BITNET produces coherent output where the GGUF's embedded template does not. Verified on real native inference (host libsmollm + bitnet1582b4t-iq2_bn.gguf): with the preset template, "What is the capital of France?" -> "Paris"; with the embedded template the model emits only <|begin_of_text|>. Gated via Assume on LLMEDGE_TEST_TEXT_MODEL_PATH (a bitnet gguf) + LLMEDGE_BUILD_NATIVE_LIB_PATH. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Begins the on-device native safetensors->GGUF converter. Reuse ruled out (convert_hf_to_gguf.py hard-imports torch; impractical on-device), so it's a C++ reimplementation, scoped v1 to ONE vertical slice: Llama + SentencePiece. Plan (docs/.../2026-06-02-b2-native-converter-plan.md) follows the ground-truth-oracle-first discipline: validate converted output by diffing greedy tokens / logits against an official GGUF at matching precision (not "non-empty"), and verify tensor-conversion and tokenizer-baking independently. Layer 1: cpp/convert/safetensors_reader.{h,cpp} parses the safetensors header (u64 len + JSON) and reads tensor bytes. Standalone clang++ test (test_safetensors_reader.cpp) passes: round-trips F32/BF16 tensors, shapes/dtypes/nbytes, data bytes, and malformed-input errors. Not yet wired into the gradle/native build (added as later layers land). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Instrumented (androidTest) E2E that loads BitNet b1.58 via the preset chat template and generates ("capital of France" -> "Paris"), and runs SmolVLM2 vision analysis. Assume-gated on model files in the test app's external dir, so it skips on CI without models. Intended for a real arm64 device. Documents (in KDoc) the Apple-Silicon Android-emulator HVF PAC quirk: the inference thread SIGILLs on a pointer-auth instruction inside the NDK's prebuilt libc++ there; the same native build produces correct output on real arm64 (verified via the desktop host JNI build). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

cpp/convert/gguf_writer.{h,cpp}: hand-rolled GGUF v3 writer (no ggml dep, so host-testable and identical on-device) supporting string/int/float/bool/array KVs and tensors (ne in ggml order, aligned data section). Verified cross-implementation: test_gguf_writer.cpp writes a sample GGUF, and verify_gguf.py reads it with the canonical gguf-py reader — 7 KVs + 2 tensors, shapes and data values confirmed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

cpp/convert/hf_to_gguf.{h,cpp}: convert a Llama-arch HF safetensors dir to GGUF tensors + llama.* hparams. Handles HF->GGUF tensor-name mapping, the llama.cpp Q/K RoPE row permutation (the classic correctness gotcha), tied embeddings, bf16/f16/f32->f16 (2-D) / f32 (1-D) conversion, and ggml ne ordering. Ground-truth oracle (the discipline the whole plan hinges on): convert SmolLM-135M and diff every tensor against the upstream convert_hf_to_gguf.py output as fp32. Result: 272/272 tensors match (shapes + values, rtol/atol 2e-3) -- a wrong permutation or name map fails allclose, not silently. Tokenizer baking (Layer 4) and JNI wiring still pending; this GGUF carries tensors + hparams only. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…boundary Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Adds tokenizer_bake.{h,cpp} and wires it into convert_llama_dir via a new optional tokenizer_pre argument. For a HF byte-level BPE tokenizer this emits the tokenizer.ggml.* KVs llama.cpp needs: - model="gpt2", pre=<caller-supplied> - tokens[vocab] (vocab.json inverted, verbatim byte-level form) - token_type[vocab] (NORMAL, CONTROL for special added tokens) - merges[] (space-joined string pairs, verbatim) - bos/eos/unknown/padding_token_id (resolved by content via tokenizer_config.json, falling back to special_tokens_map.json) - add_space_prefix / add_bos_token - tokenizer.chat_template The `pre` field is caller-supplied and required (mirrors ModelHints.chatTemplate): upstream derives it by hashing the real tokenizer's output over a probe string, which v1 does not reimplement. A wrong/guessed pre loads silently and mis-tokenizes, so an empty pre throws rather than defaulting. Fail-loud guards: model.type must be "BPE"; merges must be space-joined strings (the newer array-pair form is rejected); vocab ids must be contiguous 0..N-1. Verified on SmolLM-135M, three ways: - compare_tokenizer_kv.py: all 12 tokenizer KVs byte-match the upstream convert_hf_to_gguf.py reference (tokens, token_type, merges, ids, flags, chat_template). - compare_gguf.py: 272/272 tensors still match (regression intact). - llama-quantize loads the converted GGUF end-to-end through the real llama.cpp loader (arch + full tokenizer + tensors) and requantizes it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…lution Makes ModelSpec.safetensors / safetensorsLocal actually convert on-device, end to end, instead of throwing host-tool instructions. Native (JNI bridge): - smollm_jni_convert.cpp exposes SmolLM.nativeConvertSafetensors(modelDir, outPath, tokenizerPre), a thin marshalling layer over llmedge::convert::convert_llama_dir. - convert/*.cpp (reader, writer, tokenizer_bake, hf_to_gguf) + the wrapper are added to both the Android (llmedge-common.cmake SMOLLM_SOURCES) and desktop (jni-desktop/CMakeLists.txt) smollm targets. nlohmann/json resolves via the existing vendor include path on both. Kotlin: - SmolLM.convertSafetensorsToGguf(...) ensures the lib is loaded then calls the static native method. - ModelConversion gains tokenizerPre (folded into cacheToken); ModelSpec .safetensors / .safetensorsLocal thread it through. - DefaultModelRepository.resolveConvertedModel is now suspend and performs the conversion: download the HF model dir (config.json + model.safetensors + tokenizer files, flat-cached) or use the local dir, then convert into the cache target. Three guards per review: * fail fast (before any download) if tokenizerPre is absent — a text GGUF without a baked tokenizer is not loadable; * write to a temp path and rename only on success, so a mid-convert crash never leaves a corrupt cached GGUF that passes the size check; * catch UnsatisfiedLinkError (build without the converter) and fall back to the actionable host-tool instructions. Missing single-file model.safetensors (sharded/unsupported) -> clear error. Verified end-to-end on host arm64 (B2ConvertE2ETest): resolve( safetensorsLocal(SmolLM-135M, tokenizerPre="smollm")) runs the native converter, caches a 270MB F16 GGUF under llmedge-converted/, loads it, and generates "The capital of France is Paris!" — proving the baked GPT2-BPE tokenizer builds a working vocab and the baked chat_template is used. Model-package unit tests stay green (ConvertedModelResolveTest, SafetensorsSpecTest, ModelPresets, ...). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The native converter writes F16; this adds the quantization step so ModelConversion.precision other than F16 produces a smaller GGUF on-device — the point of the feature for low-end devices. Quantization is layered in the JNI wrapper (smollm_jni_convert.cpp), not in convert/*.cpp, keeping the converter ggml/llama-free and host-testable. The wrapper now takes a precision label: "f16" writes the converter output directly; "q8_0" / "q4_k_m" / "iq2_bn" / "iq2_bn_r4" convert to a temp F16 GGUF and then requantize it via llama_model_quantize (with llama_backend_init/free scoped around the call, mirroring the upstream llama-quantize tool). The temp file is always removed; a non-zero quantize code throws. SmolLM.convertSafetensorsToGguf + nativeConvertSafetensors gain the precision argument; DefaultModelRepository passes conversion.precision.ggufLabel. Verified end-to-end on host arm64 (B2ConvertE2ETest, 2/2): - F16: 270 MB GGUF -> "The capital of France is Paris!" - Q4_K_M: 105 MB GGUF (2.6x smaller than the source safetensors) -> "The capital of France!" (coherent text from the quantized model) Model-package unit tests stay green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…Bonsai adapter Updates the B2 plan to reflect the full convert→bake→quantize→load→generate pipeline working on host arm64 (B2ConvertE2ETest: F16 "...Paris!", Q4_K_M coherent at 105MB). The only remaining piece is the on-device Bonsai QLinear fold adapter, which is verification-blocked (no local Bonsai model) and already covered offline by the B1 host path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…n path The quantize wrapper wrote a "<out>.f16.tmp" intermediate; previously std::remove only ran on the normal/return paths, so if convert_llama_dir or llama_model_quantize threw, the intermediate leaked (Kotlin's tmp.delete() only removes the outer temp). Wrap the convert+quantize in a try that removes the intermediate on every path before rethrowing. Verified to parse + resolve includes/symbols under the Android NDK arm64 toolchain (cxx_std_20) alongside the other 4 convert TUs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…or workaround) The Android emulator on Apple-Silicon Macs mis-virtualizes ARM pointer-auth keys, so the auth instructions in the NDK prebuilt libc++/libunwind SIGILL at runtime. This tool NOPs every PAC sign/auth hint (paciasp/autiasp/auti*1716/...) and turns retaa/retab into plain ret, within executable sections only, whole-program so the result is internally consistent. It refuses (does not silently mangle) braa/blraa auth-branches. Emulator-only dev aid — never ship a depacified .so; real arm64 devices virtualize PAC correctly. Verified: depacifying the packaged libsmollm.so lets the B2 converter run on the Apple-Silicon emulator (read safetensors -> convert -> bake tokenizer -> quantize Q4_K_M -> valid 105MB GGUF, host-verified), past the prior PAC-at-load boundary. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Point submodule to 2a9eaf8 (was c4d77d8)

presets The safetensors-conversion docs were stale ("on-device conversion is not yet available (Phase B2)") — they predated the shipped B2 converter. - docs/usage.md "Converting safetensors models": rewritten to describe on-device conversion (download model dir → convert → quantize → cache → load), the required `tokenizerPre` hint, the v1 scope (Llama arch + GPT2-BPE, single-file safetensors; Bonsai fold + other arches still host-only), and that it's verified end-to-end on a real arm64 device. - README: added Features bullets for the low-end `ModelPresets` and on-device safetensors→GGUF conversion; fixed the BitNet note to link the usage section instead of the design doc. - docs/index.md: added a Core Features bullet.

Aatricks and others added 24 commits June 1, 2026 23:40

Track B: record decision to build B1+B2, arbitrary-arch target

d278e05

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Mark Track A done + Track B B1 done/B2 deferred in spec status

f44afad

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Track B (B2): record Layers 1-3 verified; Layer 4 (tokenizer) is the …

3ff9662

…boundary Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Update llmedge-examples submodule

4e2faa3

Point submodule to 2a9eaf8 (was c4d77d8)

Aatricks self-assigned this Jun 3, 2026

Aatricks merged commit 57c07c6 into main Jun 3, 2026
1 check failed

Aatricks linked an issue Jun 3, 2026 that may be closed by this pull request

bonsai 1 bit image and LLM support Microsoft lens also #24

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/low end models#25

Feat/low end models#25
Aatricks merged 24 commits into
mainfrom
feat/low-end-models

Aatricks commented Jun 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Aatricks commented Jun 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant