Feat/low end models#25
Merged
Merged
Conversation
Track A of the "low-end models" request. Adds ready-to-use presets for Microsoft BitNet b1.58 2B4T (1-bit IQ2_BN_R4) and SmolVLM2-256M (vision), both already supported by the bundled ik_llama.cpp runtime. Key design points: - ModelHints gains an optional chatTemplate so the BitNet preset is self-contained (its GGUF-embedded template is wrong). - New public ModelPresets surface (DefaultModelCatalog is internal). - ModelRegistry defaults unchanged to avoid forcing large downloads. - Safetensors conversion split out as Track B; Bonsai is an appendix. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
One converter with a precision param (F16="direct", quantized="lossy"). Recommends a host/build-time converter reusing the in-repo convert_hf_to_gguf.py plus a ModelSpec.safetensors(source, precision) consumption API (Phase B1); native on-device converter for stock-Llama deferred to optional Phase B2. Documents the Bonsai QLinear scale-fold (W_eff = W x scale) needed because Bonsai needs custom modeling and stock conversion fails. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds ready-to-use ModelSpec presets for two models the bundled ik_llama.cpp runtime already supports, exposed via a new public ModelPresets surface (DefaultModelCatalog is internal): - ModelPresets.bitnet -> Microsoft BitNet b1.58 2B4T, native 1-bit IQ2_BN_R4 GGUF (tdh111/bitnet-b1.58-2B-4T-GGUF). ~988 MB. - ModelPresets.smolVlm2 -> SmolVLM2-256M vision base + mmproj (ggml-org/SmolVLM2-256M-Video-Instruct-GGUF). ~280 MB. ModelHints gains an optional chatTemplate so a preset stays self-contained. BitNet's GGUF metadata template is wrong, so the canonical template (verbatim from microsoft/bitnet-b1.58-2B-4T tokenizer_config.json) is supplied via the preset and threaded into the text load path as a fallback (caller-supplied template still wins). The effective template is folded into the runtime cache key. Registry defaults are unchanged, so existing users are not forced into large downloads. Adds ModelPresetsTest and docs. Note: could not compile/run locally (no JDK/Android SDK in this environment); run `./gradlew :llmedge:testDebugUnitTest` to verify. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
tools/safetensors-convert/: one converter with a --precision param (f16="direct", q8_0/q4_k_m/iq2_bn="lossy"), reusing the in-repo llama.cpp/convert_hf_to_gguf.py. Arbitrary HF archs inherit upstream coverage; --adapter bonsai-qlinear folds Bonsai's per-output QLinear scales into the weights (W_eff = W*scale) and rewrites the config to stock LlamaForCausalLM so the upstream converter accepts it. The fold math is backend-agnostic and unit-tested with numpy alone (test_bonsai_fold.py, 5/5 passing): verifies (x@W^T)*scale == x@W_eff^T, scale dropping, error cases, and config rewrite. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ible Gradle invokes this script via `bash`, which on stock macOS is bash 3.2 (no `mapfile`), failing :llmedge:generateNativeTargetNames with "mapfile: command not found" (exit 127). Replace the two `mapfile` calls with portable `while IFS= read -r` loops. Behavior-equivalent on bash 4+/Linux; unblocks the build on macOS. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds a hint-based safetensors-conversion API (no sealed-variant ripple):
- ModelConversion {precision, adapter} + ConversionPrecision /
ConversionAdapter enums; ModelHints gains an optional `conversion`.
- ModelSpec.safetensors(repoId, precision, adapter) and
safetensorsLocal(path, ...) factories. The conversion is tagged into
the cache key (only when present, so existing keys are unchanged).
- DefaultModelRepository.resolve() short-circuits on the conversion
hint: returns a cached converted GGUF from <filesDir>/llmedge-converted,
else throws an actionable LLMEdgeException naming the exact
tools/safetensors-convert command + target path (on-device conversion
is Phase B2, deferred).
Verified: :llmedge:testDebugUnitTest model.* green — SafetensorsSpecTest
(4), ConvertedModelResolveTest (2), ModelPresetsTest (4).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
IQ2_BN_R4 is a CPU row-interleaved repack that may not load on the app's GPU-preferring backend path (OpenCL/Vulkan -> CPU). Default the preset to the plain `bitnet1582b4t-iq2_bn.gguf` (same repo/size), which is portable; ik_llama can repack to R4 at runtime on CPU. The _r4 variant is documented as the pure-CPU alternative. All four preset artifacts HEAD-verified (200). Re-verified: ModelPresetsTest 4/4 green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Makes scripts/build_native_linux.sh + the jni-desktop CMake build work on macOS (Apple Silicon), so the host E2E tests run without an emulator: - build_native_linux.sh: replace Linux-only `nproc` with a portable core count; emit a `.dylib` alias next to each `.so` (System.loadLibrary maps names to .dylib on macOS, but the harness names libs .so). - jni-desktop/CMakeLists.txt: force the smollm JNI target's suffix to .so on Apple so the harness finds it. - README-macos-host-build.md: document the prereqs (Homebrew bash 5, cmake, ninja, gpatch as `patch`, a full JDK for JNI headers since the Android Studio JBR ships none) and the build+run commands. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Desktop-host E2E proving the bundled ModelChatTemplates.BITNET produces coherent output where the GGUF's embedded template does not. Verified on real native inference (host libsmollm + bitnet1582b4t-iq2_bn.gguf): with the preset template, "What is the capital of France?" -> "Paris"; with the embedded template the model emits only <|begin_of_text|>. Gated via Assume on LLMEDGE_TEST_TEXT_MODEL_PATH (a bitnet gguf) + LLMEDGE_BUILD_NATIVE_LIB_PATH. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Begins the on-device native safetensors->GGUF converter. Reuse ruled out
(convert_hf_to_gguf.py hard-imports torch; impractical on-device), so it's
a C++ reimplementation, scoped v1 to ONE vertical slice: Llama + SentencePiece.
Plan (docs/.../2026-06-02-b2-native-converter-plan.md) follows the
ground-truth-oracle-first discipline: validate converted output by diffing
greedy tokens / logits against an official GGUF at matching precision (not
"non-empty"), and verify tensor-conversion and tokenizer-baking independently.
Layer 1: cpp/convert/safetensors_reader.{h,cpp} parses the safetensors
header (u64 len + JSON) and reads tensor bytes. Standalone clang++ test
(test_safetensors_reader.cpp) passes: round-trips F32/BF16 tensors,
shapes/dtypes/nbytes, data bytes, and malformed-input errors. Not yet wired
into the gradle/native build (added as later layers land).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Instrumented (androidTest) E2E that loads BitNet b1.58 via the preset chat
template and generates ("capital of France" -> "Paris"), and runs SmolVLM2
vision analysis. Assume-gated on model files in the test app's external dir,
so it skips on CI without models. Intended for a real arm64 device.
Documents (in KDoc) the Apple-Silicon Android-emulator HVF PAC quirk: the
inference thread SIGILLs on a pointer-auth instruction inside the NDK's
prebuilt libc++ there; the same native build produces correct output on real
arm64 (verified via the desktop host JNI build).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
cpp/convert/gguf_writer.{h,cpp}: hand-rolled GGUF v3 writer (no ggml dep, so
host-testable and identical on-device) supporting string/int/float/bool/array
KVs and tensors (ne in ggml order, aligned data section).
Verified cross-implementation: test_gguf_writer.cpp writes a sample GGUF, and
verify_gguf.py reads it with the canonical gguf-py reader — 7 KVs + 2 tensors,
shapes and data values confirmed.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
cpp/convert/hf_to_gguf.{h,cpp}: convert a Llama-arch HF safetensors dir to GGUF
tensors + llama.* hparams. Handles HF->GGUF tensor-name mapping, the llama.cpp
Q/K RoPE row permutation (the classic correctness gotcha), tied embeddings,
bf16/f16/f32->f16 (2-D) / f32 (1-D) conversion, and ggml ne ordering.
Ground-truth oracle (the discipline the whole plan hinges on): convert
SmolLM-135M and diff every tensor against the upstream convert_hf_to_gguf.py
output as fp32. Result: 272/272 tensors match (shapes + values, rtol/atol
2e-3) -- a wrong permutation or name map fails allclose, not silently.
Tokenizer baking (Layer 4) and JNI wiring still pending; this GGUF carries
tensors + hparams only.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…boundary Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds tokenizer_bake.{h,cpp} and wires it into convert_llama_dir via a new
optional tokenizer_pre argument. For a HF byte-level BPE tokenizer this emits
the tokenizer.ggml.* KVs llama.cpp needs:
- model="gpt2", pre=<caller-supplied>
- tokens[vocab] (vocab.json inverted, verbatim byte-level form)
- token_type[vocab] (NORMAL, CONTROL for special added tokens)
- merges[] (space-joined string pairs, verbatim)
- bos/eos/unknown/padding_token_id (resolved by content via
tokenizer_config.json, falling back to special_tokens_map.json)
- add_space_prefix / add_bos_token
- tokenizer.chat_template
The `pre` field is caller-supplied and required (mirrors ModelHints.chatTemplate):
upstream derives it by hashing the real tokenizer's output over a probe string,
which v1 does not reimplement. A wrong/guessed pre loads silently and
mis-tokenizes, so an empty pre throws rather than defaulting.
Fail-loud guards: model.type must be "BPE"; merges must be space-joined strings
(the newer array-pair form is rejected); vocab ids must be contiguous 0..N-1.
Verified on SmolLM-135M, three ways:
- compare_tokenizer_kv.py: all 12 tokenizer KVs byte-match the upstream
convert_hf_to_gguf.py reference (tokens, token_type, merges, ids, flags,
chat_template).
- compare_gguf.py: 272/272 tensors still match (regression intact).
- llama-quantize loads the converted GGUF end-to-end through the real
llama.cpp loader (arch + full tokenizer + tensors) and requantizes it.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…lution
Makes ModelSpec.safetensors / safetensorsLocal actually convert on-device, end
to end, instead of throwing host-tool instructions.
Native (JNI bridge):
- smollm_jni_convert.cpp exposes
SmolLM.nativeConvertSafetensors(modelDir, outPath, tokenizerPre), a thin
marshalling layer over llmedge::convert::convert_llama_dir.
- convert/*.cpp (reader, writer, tokenizer_bake, hf_to_gguf) + the wrapper are
added to both the Android (llmedge-common.cmake SMOLLM_SOURCES) and desktop
(jni-desktop/CMakeLists.txt) smollm targets. nlohmann/json resolves via the
existing vendor include path on both.
Kotlin:
- SmolLM.convertSafetensorsToGguf(...) ensures the lib is loaded then calls the
static native method.
- ModelConversion gains tokenizerPre (folded into cacheToken); ModelSpec
.safetensors / .safetensorsLocal thread it through.
- DefaultModelRepository.resolveConvertedModel is now suspend and performs the
conversion: download the HF model dir (config.json + model.safetensors +
tokenizer files, flat-cached) or use the local dir, then convert into the
cache target. Three guards per review:
* fail fast (before any download) if tokenizerPre is absent — a text GGUF
without a baked tokenizer is not loadable;
* write to a temp path and rename only on success, so a mid-convert crash
never leaves a corrupt cached GGUF that passes the size check;
* catch UnsatisfiedLinkError (build without the converter) and fall back to
the actionable host-tool instructions.
Missing single-file model.safetensors (sharded/unsupported) -> clear error.
Verified end-to-end on host arm64 (B2ConvertE2ETest): resolve(
safetensorsLocal(SmolLM-135M, tokenizerPre="smollm")) runs the native converter,
caches a 270MB F16 GGUF under llmedge-converted/, loads it, and generates
"The capital of France is Paris!" — proving the baked GPT2-BPE tokenizer builds
a working vocab and the baked chat_template is used. Model-package unit tests
stay green (ConvertedModelResolveTest, SafetensorsSpecTest, ModelPresets, ...).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The native converter writes F16; this adds the quantization step so
ModelConversion.precision other than F16 produces a smaller GGUF on-device —
the point of the feature for low-end devices.
Quantization is layered in the JNI wrapper (smollm_jni_convert.cpp), not in
convert/*.cpp, keeping the converter ggml/llama-free and host-testable. The
wrapper now takes a precision label: "f16" writes the converter output directly;
"q8_0" / "q4_k_m" / "iq2_bn" / "iq2_bn_r4" convert to a temp F16 GGUF and then
requantize it via llama_model_quantize (with llama_backend_init/free scoped
around the call, mirroring the upstream llama-quantize tool). The temp file is
always removed; a non-zero quantize code throws.
SmolLM.convertSafetensorsToGguf + nativeConvertSafetensors gain the precision
argument; DefaultModelRepository passes conversion.precision.ggufLabel.
Verified end-to-end on host arm64 (B2ConvertE2ETest, 2/2):
- F16: 270 MB GGUF -> "The capital of France is Paris!"
- Q4_K_M: 105 MB GGUF (2.6x smaller than the source safetensors) ->
"The capital of France!" (coherent text from the quantized model)
Model-package unit tests stay green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…Bonsai adapter Updates the B2 plan to reflect the full convert→bake→quantize→load→generate pipeline working on host arm64 (B2ConvertE2ETest: F16 "...Paris!", Q4_K_M coherent at 105MB). The only remaining piece is the on-device Bonsai QLinear fold adapter, which is verification-blocked (no local Bonsai model) and already covered offline by the B1 host path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…n path The quantize wrapper wrote a "<out>.f16.tmp" intermediate; previously std::remove only ran on the normal/return paths, so if convert_llama_dir or llama_model_quantize threw, the intermediate leaked (Kotlin's tmp.delete() only removes the outer temp). Wrap the convert+quantize in a try that removes the intermediate on every path before rethrowing. Verified to parse + resolve includes/symbols under the Android NDK arm64 toolchain (cxx_std_20) alongside the other 4 convert TUs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…or workaround) The Android emulator on Apple-Silicon Macs mis-virtualizes ARM pointer-auth keys, so the auth instructions in the NDK prebuilt libc++/libunwind SIGILL at runtime. This tool NOPs every PAC sign/auth hint (paciasp/autiasp/auti*1716/...) and turns retaa/retab into plain ret, within executable sections only, whole-program so the result is internally consistent. It refuses (does not silently mangle) braa/blraa auth-branches. Emulator-only dev aid — never ship a depacified .so; real arm64 devices virtualize PAC correctly. Verified: depacifying the packaged libsmollm.so lets the B2 converter run on the Apple-Silicon emulator (read safetensors -> convert -> bake tokenizer -> quantize Q4_K_M -> valid 105MB GGUF, host-verified), past the prior PAC-at-load boundary. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Point submodule to 2a9eaf8 (was c4d77d8)
presets
The safetensors-conversion docs were stale ("on-device conversion is not
yet
available (Phase B2)") — they predated the shipped B2 converter.
- docs/usage.md "Converting safetensors models": rewritten to describe
on-device
conversion (download model dir → convert → quantize → cache → load),
the
required `tokenizerPre` hint, the v1 scope (Llama arch + GPT2-BPE,
single-file
safetensors; Bonsai fold + other arches still host-only), and that
it's
verified end-to-end on a real arm64 device.
- README: added Features bullets for the low-end `ModelPresets` and
on-device
safetensors→GGUF conversion; fixed the BitNet note to link the usage
section
instead of the design doc.
- docs/index.md: added a Core Features bullet.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request introduces significant improvements to low-end device support in llmedge by adding ready-to-use model presets for lightweight models and laying out a design for on-device safetensors-to-GGUF model conversion. The main focus is to make it easier for users to run efficient models on low-resource devices without manual setup, and to provide a clear path for converting and loading models not natively supported by the runtime.
Low-end model presets (Track A):
ModelPresetsin Kotlin, exposing ready-to-use specs for Microsoft BitNet b1.58 2B4T (IQ2_BN, ~988 MB) and SmolVLM2-256M (vision, ~280 MB), optimized for low-end devices. These models are now easily accessible and do not require manual repo/filename entry. [1] [2] [3] [4]ModelHintsto include an optionalchatTemplatefield, ensuring BitNet loads with the correct template and produces well-formed output out of the box.Safetensors → GGUF conversion (Track B):
ModelSpec.safetensors(...)API for seamless integration.These changes make it much easier to run efficient, supported models on low-end Android devices and provide a clear path for supporting additional models through conversion.
Low-end model support:
ModelPresetsfor Microsoft BitNet b1.58 2B4T and SmolVLM2-256M, making them easily accessible for apps targeting low-end devices. [1] [2] [3] [4]ModelHintsand updating the load path to prioritize the preset's template.Safetensors conversion pipeline:
Documentation updates:
README.mdand usage docs to highlight new presets, clarify template handling, and document model sizes and limitations. [1] [2] [3] [4]