Skip to content

FLUX.2 Klein 4B image gen + Bonsai ternary low-end (sequential)#27

Merged
Aatricks merged 6 commits into
mainfrom
feat/flux2-klein-image
Jun 3, 2026
Merged

FLUX.2 Klein 4B image gen + Bonsai ternary low-end (sequential)#27
Aatricks merged 6 commits into
mainfrom
feat/flux2-klein-image

Conversation

@Aatricks

@Aatricks Aatricks commented Jun 3, 2026

Copy link
Copy Markdown
Owner

On-device FLUX.2 Klein 4B image generation, plus PrismML Bonsai (ternary QAT) support tuned for low-end Android. Answers the "bonsai 1-bit image" GitHub issue (image half).

What's here

  • FLUX.2 Klein 4B via stable-diffusion.cpp: split-model (DiT + Qwen3-4B encoder + VAE). New JNI slots diffusion_model_path + llm_path; Flux2Klein helper. ~6–8 GB phones.
  • Bonsai QAT → Q2_K: scripts/convert_bonsai_flux2_to_bfl.py converts Bonsai's diffusers transformer → BFL layout; quantize to Q2_K → coherent ~1.3 GB DiT (ggml ternary tq1_0/tq2_0 are too coarse for Bonsai's per-128 scales; Q2_K's per-16 sub-scales preserve quality). Published: Aatricks/bonsai-image-ternary-4B-FLUX2-klein-GGUF, wired into Flux2Klein.bonsaiImageRequest.
  • sequential low-memory: run the Qwen3 encoder and the DiT in separate phases (precompute → free → generate) so peak RAM ≈ max(encoder, DiT) ≈ 2.6 GB vs ~4.0 GB → 4 GB-phone tier. New sdcpp sd_precompute_condition / sd_generate_image_with_precomputed_condition + FLUX.2 conditioner-skip + encoder-only Qwen3 handle; auto-orchestrated by ImageGenerationExecutor.
  • Build fix: prepare_sdcpp_mods nameref bug (SD_ROOT_OVERRIDE never reached cmake); stale mods/ overlay now opt-in.

Verification (host JNI, no emulator)

  • sd-cli + Flux2KleinLinuxE2ETest: base, Bonsai-Q2K, and sequential paths all generate coherent images on the real JNI.
  • ImageClientTest asserts split-model arg routing; image unit suite green.
  • Measured peaks: encoder-only 2621 MB → freed → DiT-only 1393 MB.

Notes

  • Bumps stable-diffusion.cpp submodule to a fork (Aatricks/stable-diffusion.cpp, branch llmedge-flux2-sequential) carrying the precomputed-condition API — upstream leejet doesn't have it.
  • TQ types load+run on CPU but Metal/Vulkan lack ternary kernels; low-end CPU path is the target.
  • llmedge-examples demo-toggle edits are separate (not in this PR).

Aatricks and others added 6 commits June 3, 2026 11:02
Add support for FLUX.2 Klein 4B — the distilled diffusion-transformer
architecture behind PrismML's binary/ternary "Bonsai Image". Bonsai's own
1-bit/ternary weights ship only as MLX (Apple) and GemLite (CUDA) packings,
neither of which loads on Android; this GGUF build via stable-diffusion.cpp
is the Android-runnable equivalent.

FLUX.2 is a split model (separate diffusion transformer, Qwen3-4B text
encoder, and VAE) rather than a single checkpoint, so the sdcpp JNI bridge
gains diffusion_model_path + llm_path slots alongside the existing
model_path/t5xxl_path. A new Flux2Klein helper + ImageGenerationRequest
.splitDiffusionModel route the DiT to diffusion_model_path and the Qwen3
encoder to llm_path (model_path left empty) and offload weights to CPU.

- JNI: nativeCreate gains diffusionModelPath/llmPath (appended after
  preferredBackend in loadWithRuntimeBackend to preserve positional mock order)
- Kotlin: thread the two slots through the load request/support, runtime
  planner, and ImageClient request
- Flux2Klein: public preset (DiT + Qwen3-4B Q3_K_M encoder + VAE, CFG 1.0/4 steps)
- desktop CMake: compile the full sdcpp_jni_*.cpp split set so the host
  JNI library exports nativeCreate (enables host image/video E2E)
- tests: Flux2KleinLinuxE2ETest (real JNI split-model generation) +
  ImageClientTest split-routing assertion
- docs: README + docs/index.md

Verified on host (sd-cli reference + JNI E2E generate coherent images; full
unit suite green). On-device viability is high-RAM-only and unverified;
true low-end remains the deferred binary/ternary Bonsai conversion.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e gen

Add scripts/convert_bonsai_flux2_to_bfl.py: converts PrismML Bonsai Image
"-unpacked" transformers (dense bf16, Flux2KleinPipeline diffusers naming)
into the BFL tensor layout stable-diffusion.cpp expects, so the QAT ternary
DiT can be quantized to GGUF and run on-device.

The mapping is mostly renames; the only structural op is concatenating the
separate double-block to_q/k/v (and add_{q,k,v}_proj) into the fused
*_attn.qkv tensors (raw byte concat, row-major, no numpy/safetensors dep).
169 -> 149 tensors, oracle-matched to the base FLUX.2 Klein GGUF layout.

Verified end to end on the CPU JNI path (no emulator): convert -> sd
-M convert --type q2_K -> loads as "Flux.2 klein" -> generates a coherent
image at ~1.3 GB DiT (vs ~2.5 GB base Q4_0). ggml's literal ternary types
(tq1_0/tq2_0, ~0.8-1.0 GB) load and run on CPU but their per-256-weight
scale is too coarse for Bonsai's per-128 trained scales (degraded output);
Q2_K's per-16 sub-block scales preserve quality. README documents the recipe.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Run the Qwen3 text encoder and the diffusion transformer in separate phases so
peak RAM is max(encoder, DiT) instead of their sum — unlocking ~4GB-RAM phones
for Bonsai/FLUX.2 Klein image generation.

JNI:
- SdHandle.llm_ctx + try_create_llm_only_handle: an encoder-only (Qwen3)
  context (VERSION_FLUX2_KLEIN), mirroring the T5-only handle, for the
  precompute phase; routed in nativeCreate when only an llm path is given;
  freed in nativeDestroy.
- nativePrecomputeCondition: handle the llm_ctx (Qwen3) branch.
- Drop the image precompute nullptr shims (now real in stable-diffusion.cpp);
  remove the duplicate raw structs from sdcpp_jni_shared.h.

Kotlin:
- ImageGenerationRequest.sequential + Flux2Klein.imageRequest(sequential=).
- DiffusionRuntimeSpec.encoderOnly + loader routes it to llm_path only.
- StableDiffusionLoadSupport: encoder-only resolve branch (llm path only).
- ImageRuntimeRequestPlanner.imageSequentialPlan + ImageGenerationExecutor
  .generateSequential: phase 1 precompute on the encoder runtime, free it,
  phase 2 generate on the DiT-only runtime via the precomputed condition.

Bumps the stable-diffusion.cpp submodule to the precomputed-condition commit.
Verified on host (CPU JNI): encoder-only precompute (qwen3 2621MB, no DiT) ->
DiT-only generate (flux 1298 + vae 95MB, no encoder) -> coherent image.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
prepare_sdcpp_mods bound a nameref named out_args_ref to the caller's nameref
of the same name (a circular self-reference), so bash silently dropped its
appends — SD_ROOT_OVERRIDE never reached cmake and the mods overlay was always
bypassed. Bind a distinct nameref name and pass the underlying array name.

The mods/ overlay is also currently stale vs the pinned stable-diffusion.cpp
submodule (won't compile), and the active sdcpp customizations now live in the
submodule directly. So default the overlay OFF (opt in with
LLMEDGE_SDCPP_USE_MODS=1 after refreshing it) — the default build now uses the
submodule directly and stays green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add Flux2Klein.bonsaiDiffusionModel pointing at the published
Aatricks/bonsai-image-ternary-4B-FLUX2-klein-GGUF (Q2_K, ~1.3 GB) and a
Flux2Klein.bonsaiImageRequest(...) convenience that defaults to sequential
loading for ~4 GB-RAM devices. README documents the hosted model + sequential.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The submodule now carries the precomputed-condition (Lever 1) commit, which
lives on the fork (leejet upstream doesn't have it). Repoint the submodule URL
so the recorded commit is fetchable on a fresh clone.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Aatricks Aatricks self-assigned this Jun 3, 2026
@Aatricks Aatricks merged commit 7d3121a into main Jun 3, 2026
1 check passed
@Aatricks Aatricks linked an issue Jun 3, 2026 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bonsai 1 bit image and LLM support Microsoft lens also

1 participant