FLUX.2 Klein 4B image gen + Bonsai ternary low-end (sequential)#27
Merged
Conversation
Add support for FLUX.2 Klein 4B — the distilled diffusion-transformer architecture behind PrismML's binary/ternary "Bonsai Image". Bonsai's own 1-bit/ternary weights ship only as MLX (Apple) and GemLite (CUDA) packings, neither of which loads on Android; this GGUF build via stable-diffusion.cpp is the Android-runnable equivalent. FLUX.2 is a split model (separate diffusion transformer, Qwen3-4B text encoder, and VAE) rather than a single checkpoint, so the sdcpp JNI bridge gains diffusion_model_path + llm_path slots alongside the existing model_path/t5xxl_path. A new Flux2Klein helper + ImageGenerationRequest .splitDiffusionModel route the DiT to diffusion_model_path and the Qwen3 encoder to llm_path (model_path left empty) and offload weights to CPU. - JNI: nativeCreate gains diffusionModelPath/llmPath (appended after preferredBackend in loadWithRuntimeBackend to preserve positional mock order) - Kotlin: thread the two slots through the load request/support, runtime planner, and ImageClient request - Flux2Klein: public preset (DiT + Qwen3-4B Q3_K_M encoder + VAE, CFG 1.0/4 steps) - desktop CMake: compile the full sdcpp_jni_*.cpp split set so the host JNI library exports nativeCreate (enables host image/video E2E) - tests: Flux2KleinLinuxE2ETest (real JNI split-model generation) + ImageClientTest split-routing assertion - docs: README + docs/index.md Verified on host (sd-cli reference + JNI E2E generate coherent images; full unit suite green). On-device viability is high-RAM-only and unverified; true low-end remains the deferred binary/ternary Bonsai conversion. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e gen
Add scripts/convert_bonsai_flux2_to_bfl.py: converts PrismML Bonsai Image
"-unpacked" transformers (dense bf16, Flux2KleinPipeline diffusers naming)
into the BFL tensor layout stable-diffusion.cpp expects, so the QAT ternary
DiT can be quantized to GGUF and run on-device.
The mapping is mostly renames; the only structural op is concatenating the
separate double-block to_q/k/v (and add_{q,k,v}_proj) into the fused
*_attn.qkv tensors (raw byte concat, row-major, no numpy/safetensors dep).
169 -> 149 tensors, oracle-matched to the base FLUX.2 Klein GGUF layout.
Verified end to end on the CPU JNI path (no emulator): convert -> sd
-M convert --type q2_K -> loads as "Flux.2 klein" -> generates a coherent
image at ~1.3 GB DiT (vs ~2.5 GB base Q4_0). ggml's literal ternary types
(tq1_0/tq2_0, ~0.8-1.0 GB) load and run on CPU but their per-256-weight
scale is too coarse for Bonsai's per-128 trained scales (degraded output);
Q2_K's per-16 sub-block scales preserve quality. README documents the recipe.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Run the Qwen3 text encoder and the diffusion transformer in separate phases so peak RAM is max(encoder, DiT) instead of their sum — unlocking ~4GB-RAM phones for Bonsai/FLUX.2 Klein image generation. JNI: - SdHandle.llm_ctx + try_create_llm_only_handle: an encoder-only (Qwen3) context (VERSION_FLUX2_KLEIN), mirroring the T5-only handle, for the precompute phase; routed in nativeCreate when only an llm path is given; freed in nativeDestroy. - nativePrecomputeCondition: handle the llm_ctx (Qwen3) branch. - Drop the image precompute nullptr shims (now real in stable-diffusion.cpp); remove the duplicate raw structs from sdcpp_jni_shared.h. Kotlin: - ImageGenerationRequest.sequential + Flux2Klein.imageRequest(sequential=). - DiffusionRuntimeSpec.encoderOnly + loader routes it to llm_path only. - StableDiffusionLoadSupport: encoder-only resolve branch (llm path only). - ImageRuntimeRequestPlanner.imageSequentialPlan + ImageGenerationExecutor .generateSequential: phase 1 precompute on the encoder runtime, free it, phase 2 generate on the DiT-only runtime via the precomputed condition. Bumps the stable-diffusion.cpp submodule to the precomputed-condition commit. Verified on host (CPU JNI): encoder-only precompute (qwen3 2621MB, no DiT) -> DiT-only generate (flux 1298 + vae 95MB, no encoder) -> coherent image. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
prepare_sdcpp_mods bound a nameref named out_args_ref to the caller's nameref of the same name (a circular self-reference), so bash silently dropped its appends — SD_ROOT_OVERRIDE never reached cmake and the mods overlay was always bypassed. Bind a distinct nameref name and pass the underlying array name. The mods/ overlay is also currently stale vs the pinned stable-diffusion.cpp submodule (won't compile), and the active sdcpp customizations now live in the submodule directly. So default the overlay OFF (opt in with LLMEDGE_SDCPP_USE_MODS=1 after refreshing it) — the default build now uses the submodule directly and stays green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add Flux2Klein.bonsaiDiffusionModel pointing at the published Aatricks/bonsai-image-ternary-4B-FLUX2-klein-GGUF (Q2_K, ~1.3 GB) and a Flux2Klein.bonsaiImageRequest(...) convenience that defaults to sequential loading for ~4 GB-RAM devices. README documents the hosted model + sequential. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The submodule now carries the precomputed-condition (Lever 1) commit, which lives on the fork (leejet upstream doesn't have it). Repoint the submodule URL so the recorded commit is fetchable on a fresh clone. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
On-device FLUX.2 Klein 4B image generation, plus PrismML Bonsai (ternary QAT) support tuned for low-end Android. Answers the "bonsai 1-bit image" GitHub issue (image half).
What's here
diffusion_model_path+llm_path;Flux2Kleinhelper. ~6–8 GB phones.scripts/convert_bonsai_flux2_to_bfl.pyconverts Bonsai's diffusers transformer → BFL layout; quantize to Q2_K → coherent ~1.3 GB DiT (ggml ternarytq1_0/tq2_0are too coarse for Bonsai's per-128 scales; Q2_K's per-16 sub-scales preserve quality). Published:Aatricks/bonsai-image-ternary-4B-FLUX2-klein-GGUF, wired intoFlux2Klein.bonsaiImageRequest.sd_precompute_condition/sd_generate_image_with_precomputed_condition+ FLUX.2 conditioner-skip + encoder-only Qwen3 handle; auto-orchestrated byImageGenerationExecutor.prepare_sdcpp_modsnameref bug (SD_ROOT_OVERRIDEnever reached cmake); stalemods/overlay now opt-in.Verification (host JNI, no emulator)
Flux2KleinLinuxE2ETest: base, Bonsai-Q2K, and sequential paths all generate coherent images on the real JNI.ImageClientTestasserts split-model arg routing; image unit suite green.Notes
stable-diffusion.cppsubmodule to a fork (Aatricks/stable-diffusion.cpp, branchllmedge-flux2-sequential) carrying the precomputed-condition API — upstream leejet doesn't have it.llmedge-examplesdemo-toggle edits are separate (not in this PR).