Hydralisk is the standalone OpenAgents Python/NVIDIA inference lane. It owns conventional serving work such as vLLM, SGLang, TensorRT-LLM, CUDA host runbooks, model profiles, smoke tests, and public-safe receipts.
Current Khala lane:
- served model:
openai/gpt-oss-20b - internal alias:
khala - public alias:
openagents/khala - compatibility aliases:
openagents/khala-oss-20b,gpt-oss-20b - engine: vLLM
- first host class: one NVIDIA L4
- proxy port:
127.0.0.1:8012 - raw vLLM port:
127.0.0.1:8000
The proxy exposes public-safe health, capabilities, receipt lookup, and bearer-authenticated Chat Completions forwarding. Raw vLLM stays localhost-only.
uv sync --extra dev
uv run pytestRun the proxy against a local vLLM server:
export HYDRALISK_BEARER_TOKEN=local-dev-token
uv run hydralisk-proxy --host 127.0.0.1 --port 8012Smoke it:
uv run hydralisk-smoke \
--base-url http://127.0.0.1:8012 \
--bearer-token "$HYDRALISK_BEARER_TOKEN" \
--model openai/gpt-oss-20bUse docs/gce-l4-vllm-runbook.md for the GCE L4 setup, systemd services, start/stop, rollback, and public-safe evidence rules.
Hydralisk exists beside Psionic, not inside it. Psionic is the Rust-native ML substrate where we build the runtime ourselves. Hydralisk is the pragmatic Python stack for environments where conventional NVIDIA-serving practice is the fastest honest path.
Hydralisk must not claim to be Psionic-native, an admitted Pylon payout lane, or OpenAgents product authority. Pricing, credits, routing, settlement, payout, public copy, and product-promise promotion remain in the product repos.
Initial targets:
gpt-oss-20bon L4 with vLLM for the first cheap internal dogfood lane.gpt-oss-120bon H100/H200/B200/G4-class high-memory GPUs with vLLM.- GLM-5.2 first as a hosted baseline, then as a high-memory self-hosting
campaign. The current accessible-hardware target is
0xSero/GLM-5.2-504BREAP/NVFP4 on 4 x GCE G4 RTX PRO 6000 with the b12x/vLLM SM120 recipe; the olderzai-org/GLM-5.2-FP8SGLang G4 profile remains a blocked FP8 evidence lane, not the REAP serving plan. As of 2026-06-25, the first REAP lane has a private Hydralisk proxy on the admitted G4 fallback host, with raw vLLM still bound to localhost, bearer auth required, fail-closed profile evidence checks, and GLM sampler defaults injected by the proxy. The tuned speed envelope is 250K context,max_num_seqs=2,max_num_batched_tokens=4096, and MTP-2 speculative decoding with defaultmin_pomitted for vLLM compatibility; two concurrent full-250K requests are not admitted. See docs/evidence/2026-06-25-glm-52-reap-504b-mtp2-speed-gate.md. The first lane also has a Worker-reachable authenticated HTTPS origin shape for Khala arming, with the concrete URL and bearer token kept out of tracked files: docs/evidence/2026-06-25-glm-52-reap-504b-public-https-origin.md. Durability is admitted as a Spot auto-restart canary: the boot disk is preserved, host services are enabled, keep-warm units are installed for post-benchmark use, and Cloud Scheduler triggers a Cloud Run watchdog that conditionally starts the VM after STOP: docs/evidence/2026-06-25-glm-52-reap-504b-durable-canary.md. A second independent 4 x G4 Spot endpoint was admitted on 2026-06-25 while the first lane was reserved for the Harbor Terminal-Bench run. It uses the same MTP-2/no-min_pprofile, a cloned model disk, an authenticated HTTPS origin, a distinct watchdog, host-local keep-warm, and singleflight admission. Its warmed median proxy benchmark is about0.281sTTFT and46.7completion tok/s including TTFT for 160-token streamed outputs; same-endpoint concurrency still admits one request and rejects the other with 429. See docs/evidence/2026-06-25-glm-52-reap-504b-second-endpoint.md. The proxy now exposes public-safe replica routing metadata for Khala pool selection: stablereplicaRef/profileRef, inflight and 429/backpressure counters, keep-warm summary status, lifecycle/provisioning class, and reserved/draining flags without endpoint URLs, IPs, bearer tokens, prompts, responses, weights, or raw logs: docs/evidence/2026-06-25-glm-52-reap-504b-replica-routing-metadata.md. A repeatable replica provisioner now wraps the second-endpoint bring-up path: G4 admission, same-zone model-disk clone or download staging, read-only mount verification, vLLM launch, proxy/HTTPS setup, distinct watchdog/keep-warm resources, public-safe evidence, and cleanup instructions: docs/evidence/2026-06-25-glm-52-reap-504b-replica-provisioner.md. Multi-replica capacity should be modeled as one warmed 4 x G4 replica per fast interactive slot, with Spot reserved for cheap interruptible burst and DWS/reservation/on-demand procurement used for the durable Khala floor. See docs/evidence/2026-06-25-glm-52-reap-504b-capacity-plan.md. The private lane is now operator-hardened with a raw vLLM Docker restart policy, a systemd-managed private proxy, public-safe metrics, durable model and cache paths, and a stop/start recovery runbook in docs/evidence/2026-06-24-glm-52-reap-504b-operator-hardening.md. The consolidated runbook and public-safe integration receipt mark the lane asprivate_canary, not a public endpoint or product SLA: docs/glm-5.2-reap-504b-g4-runbook.md. - DeepSeek-V4-Flash as a Google GPU admission experiment: G4 capacity was
admitted on 2026-06-24 with 2 x RTX PRO 6000. The current blocker is now
past the original vLLM
0.23.0Blackwell FP8 scaled-mm failure: direct CUTLASS FP8 cases still fail, Triton block FP8 works after E8M0 scales are upcast, and localo_projRHS rank/scale hotpatches still stop in DeepGEMM before/v1/models. A clean provider-guided vLLM/DeepGEMM container also builds and imports successfully on the G4 host, then fails before readiness on the original CUTLASSdispatch_scaled_mmpath. The published-recipe GCE probe can see H100/H200/B200/GB200 catalog entries and machine types, but this project currently exposes only L4 regional GPU quota. The available Google lane today is therefore a custom RTX PRO 6000 kernel/offload path unless we obtain H100/H200/B200/GB200 quota. The NVFP4 Blackwell variant (nvidia/DeepSeek-V4-Flash-NVFP4) also builds and imports on the G4 host, but stock vLLM rejects every tested NVFP4 MoE backend before readiness; the remaining G4 path starts at the FlashInfer TRTLLM NVFP4 device gate. A default-off SM120 gate patch advances the two-card G4 probe into vLLM startup withflashinfer_trtllm, but the private-only host now blocks on Hugging Face artifact access before weight load. Cloud NAT for thedefaultus-central1subnet fixed private config/artifact egress without restoring a VM external IP, and the rerun advanced into real model load withFLASHINFER_TRTLLM; it then stalled in the Hugging Face Xet / vLLM load path before/v1/models. Disabling Xet for the next private G4 run made snapshot acquisition deterministic enough to expose the current blocker again: non-expert FP8 layers hit vLLM's CUTLASSdispatch_scaled_mmpath on SM120 before/v1/models. Forcing vLLM's dense FP8 linear backend totritonwith the derived-image E8M0 upcast patch removed that CUTLASS blocker. The active blocker is now DeepSeek V4's NVIDIAo_projDeepGEMMfp8_einsumscale-factor layout assertion before readiness; an explicit invalid zero-o_projload-only bypass then moves to a FlashInfer TRTLLM NVFP4 MoE GEMM runtime failure on the same SM120 host. A synthetic FlashInfer repro now reproduces that MoE GEMM failure without loading DeepSeek weights, Hugging Face artifacts, prompts, or vLLM scheduling. The newer FlashInfer B12x SM12x path does run DeepSeek-like synthetic MoE shapes on RTX PRO 6000 when all experts are local, but it rejects expert parallelism (num_local_experts != num_experts). That makes the two-card G4 lane a compatibility research lane, not a near-serving stock-vLLM path, unless we use a wider no-EP G4 shape, add B12x expert parallelism, or build the custom offload/prefetch path. The eight-card G4 full-model attempt then proved B12x is rejected for DeepSeek-V4 because it lacks the model's required SwiGLU clamp. A clamp-backend sweep on the same private 8 x G4 host removedflashinfer_cutlassfor the same reason and advancedflashinfer_trtllminto expert-parallel startup with 32 local experts per rank, where it now blocks in DeepGEMMo_projscale-factor layout handling. The next useful G4 issue is therefore a correctness-first DeepSeek V4o_projfallback or scale-factor layout fix that preserves the TRTLLM MoE path. That fallback now exists as a default-offbf16_einsumprobe path and moves the full model pasto_projon all eight ranks. The active blocker is now the FlashInfer TRTLLM NVFP4 MoE GEMM itself on RTX PRO 6000:trtllm_batched_gemm_runner.cu:286,numBatches=32, andGemmMNK=512 4096 4096. That exact full-model MoE shape is now reproduced synthetically without weights or vLLM scheduling, so stockflashinfer_trtllmis no longer a near-serving G4 lane by wrapper changes alone. The follow-up B12x viability probe shows the only positive SM120 MoE path is also not ready as-is: FlashInfer B12x has noswiglu_limitclamp surface, rejects the exact32 / 256expert-parallel shard, and only runs the DeepSeek-like synthetic shape when all 256 experts are local. The next G4 work must be real kernel/scheduler work: B12x clamp, B12x expert parallelism/offload, or a SGLang-style expert repack plus prefetch lane. A follow-up local-shard remap probe proved B12x can run the exact per-rank shard when global expert IDs are remapped to a local 32-expert domain (globalNumExperts=256,kernelNumExperts=32,localNumExperts=32). That leaves clamp semantics plus dispatcher/offload correctness as the next real implementation step. Hydralisk now has a pure-Python local-shard reference fixture for that boundary: DeepSeek/vLLM SwiGLU clamp, global-to-local expert remap, nonlocal expert skipping, and deterministic nonzero routed output. A live wrapper-surface probe then foundB12xMoEWrapperin the installed FlashInfer0.6.12image, but it only exposesnum_local_experts; it lacks bothlocal_expert_offsetandswiglu_limit, so the G4 path still needs a wrapper upgrade or a Hydralisk-local B12x dispatcher/clamp shim before any full-model retry. A matched FlashInfer nightly upgrade (flashinfer-python,flashinfer-cubin, andflashinfer-jit-cache) reached0.6.13.dev20260612on the same G4 host, but the B12x wrapper surface was unchanged for our blocker: nolocal_expert_offset, noswiglu_limit, and the direct256 / 32expert-parallel call still rejects before launch. The next G4 issue should therefore build the Hydralisk-local dispatcher/clamp shim against the reference fixture. Hydralisk now has the dispatcher half of that shim: fixed-shape global-to-local expert remap, zero-scale masking for nonlocal/out-of-range routes, reference-equivalence tests on nonzero inputs, and a fail-closed gate for missing DeepSeekswiglu_limitsupport. The live B12x kernel also accepts that dispatcher-shaped masked local-domain input on RTX PRO 6000 (maskedRouteCount=1536,outShape=[512,4096]). The source audit now maps that remaining blocker to exact FlashInfer B12x SM120 patch points: API surface,launch_sm120_moe, and the fused gated-SiLU activation path. Hydralisk now also has a repeatable FlashInfer B12x clamp overlay that dry-runs against the local reference checkout and source-marks the static, micro, dynamic, and W4A16 activation sites for the next G4 compile/runtime fixture. A disposable G4 container then converted the static marker into real CuTe/CUTLASS clamp ops and ran both zero and nonzero tiny B12x fixtures withswiglu_limit=10.0on RTX PRO 6000. The static clamp path is now a real GPU proof. The follow-up dynamic fixture then patchedmoe_dynamic_kernel.pyand ran the 512-token DeepSeek-shaped masked local-shard case (kernelNumExperts=32,globalNumExperts=256,topK=6) with finite nonzero output. The clamp-patched B12x full-model image now builds, imports on all eight G4 GPUs, and starts vLLM withmoe_backend=flashinfer_b12x. With the existingbf16_einsumo_projfallback enabled, execution moves past the oldo_projDeepGEMM blocker and stops in DeepSeek MLA attention metadata during vLLM cudagraph memory profiling:get_paged_mqa_logits_metadatafails withattention.hpp:219: Unsupported architectureon SM120. Enabling vLLM eager mode avoids that cudagraph profiling path and brings the full model to a live/v1/modelsendpoint on the same 8 x G4 host, but the first public-safe generation smoke fails inflash_mla_sparse_fwdbecause the sparse prefill kernel only admits SM90a and SM100f. The remaining work is therefore an SM120-safe DeepSeek FlashMLA sparse-prefill backend or a correctness-first prefill fallback before any generation or serving claim. A source audit then found a better next probe already in vLLM: explicitFLASHINFER_MLA_SPARSE_DSV4backend selection routes DeepSeek V4 toDeepseekV4FlashInferMLAAttention, which avoidsflash_mla_sparse_fwdand calls FlashInfer's TRTLLM sparse MLA launcher instead. Hydralisk now exposes that asVLLM_ATTENTION_BACKEND=FLASHINFER_MLA_SPARSE_DSV4for the next G4 smoke. The first auth-cleared run found a wrapper mismatch: the selectedbf16_einsumo_projfallback requiresHYDRALISK_DEEPSEEK_O_PROJ_RECIPE=hopper, so the DSV4 wrapper now defaults that recipe. The corrected run reached/v1/models, then the tiny generation smoke failed inflashinfer.mla._core.trtllm_batch_decode_sparse_mla_dsv4withTllmGenFmhaRunnerreportingUnsupported architecturefrom FlashInfer's TRTLLM FMHA runner. A direct one-token synthetic DSV4 FMHA repro now confirms the same guard without model weights, prompts, vLLM scheduling, B12x MoE, oro_proj. The installed FlashInfer package defineskSM_120, but this TRTLLM-gen FMHA path guards to SM100/SM103, its compatibility helper has no SM120 special case, and its installed FMHA cubin inventory has zero SM120 cubins. This is not a safe one-line allowlist bug. The next useful G4 step is either SM120-built DSV4 FMHA cubins plus dispatch metadata or a correctness-first DeepSeek V4 attention fallback for SM120, not another full-model flag trial. Hydralisk now has that fallback's local oracle:reference_sparse_mla_decodecovers the issue #52 sparse MLA shape family with deterministic top-k masking, sequence-length truncation, HND KV cache handling, empty-route zero output, and stable softmax. The remaining step is wiring that contract into a derived vLLM/container fallback and rerunning the synthetic shape before another full-model smoke. Thehydralisk-deepseek-v4-sparse-mla-smokeentry point now runs the exact issue #52-sized fallback shape locally with finite nonzero output, and the GCE wrapper can inject that smoke into a target Docker image; the latest run recordedtarget_missingbecause there is no live DeepSeek G4 host. The vLLM patcher now dry-runs cleanly against the realDeepseekV4FlashInferMLAAttention._forwardsource and adds a default-offHYDRALISK_DEEPSEEK_SPARSE_MLA_FALLBACK=1branch before the missing FlashInfer DSV4 FMHA calls. The patched-vLLM container smoke wrapper is also ready: it patches the installed vLLM source inside the target image and runs the issue #52 tensor shape with torch tensors. The live requirement is now a full 8 x G4 model retry: a bounded one-GPU G4 spot target successfully ran the patched-vLLM synthetic smoke on real RTX PRO 6000 hardware withHYDRALISK_DEEPSEEK_SPARSE_MLA_FALLBACK=1, vLLM0.23.0, Torch2.11.0+cu130, CUDA13.0, finite nonzero[1,64,512]output, and GPU memory back to0 MiBafter the run. The full 8 x G4 retry then built the derived provider image, confirmed all eight SM120 GPUs, and loaded the full model far enough to enter vLLM memory/profile initialization. That moved past the earlier B12x clamp,o_proj, and DSV4 FMHA blockers. It still did not reach/v1/models: the current blocker is tensor-parallel logits all-gather failing through NCCL with CUDA failure 800operation not permittedon the PCIe-only G4 topology. The next useful issue is an 8-rank Torch/NCCL all-gather fixture under the same Docker/runtime envelope, then safe NCCL transport toggles if the fixture reproduces the error. Issue #60 proved the all-gather path itself passes on the same 8 x G4 Docker/runtime envelope, then patched the remaining SM120 DeepGEMM sparse indexer/metadata blockers behindHYDRALISK_DEEPSEEK_INDEXER_SWA_ONLY=1. The derived image now reaches/v1/modelsand completes a public-safe/v1/chat/completionssmoke fornvidia/DeepSeek-V4-Flash-NVFP4on the same 8 x RTX PRO 6000 host. This is an MVP execution proof, not a production serving claim: it is capped atmax_model_len=2048, one sequence, and SWA-only sparse attention without quality or throughput gates. Issue #61 vectorized the sparse MLA fallback's cache gather and attention math. The same G4 lane now reaches warmed 32-token streaming with about0.317sTTFT and11.2 tok/sdecode, up from roughly13.1sTTFT and0.89 tok/s. That makes it worth a Khala readiness gate, but not a Khala serving promise: startup, first warmup, concurrency, long context, and quality still need to pass. Issue #62 added that first resident-server timing gate. The v3 image passed five repeated warmed streaming requests with0.289sTTFT p95 and11.3 tok/sdecode p50. DeepSeek V4 Flash is now a Khala integration candidate, but quality, longer output/context, concurrency, and the SWA-only sparse-indexer bypass still block a serving claim. Issue #63 added a runtime-supplied public-safe quality gate. Three tiny deterministic cases passed without committing raw prompts or responses, and the same resident timing gate still passed. This clears the first quality smoke only: two tiny nonstream quality completions still took roughly10sand18s, and longer output/context plus concurrency remain unproven. Issue #64 added minimum prompt/completion token thresholds plus uncounted streaming warmups. A 1,796-token prompt with two measured 160-token streamed outputs passes after one long streaming prewarm: TTFT p950.207s, decode p5011.1 tok/s, and end-to-end p5011.0 tok/s. Without that streaming prewarm, the first long stream still pays roughly10.8sTTFT. Issue #65 added a measured concurrency mode and ranmax_num_seqs=2with two concurrent streamed requests. The server admitted the configuration and completed both requests, but the gate failed: decode p50 fell to3.0 tok/s, end-to-end p50 to2.5 tok/s, and one request waited13.7sfor first token. The current G4 lane is therefore single-flight/prewarmed canary material only, not a shared Khala serving lane. Issue #66 added that explicit canary envelope to the Hydralisk proxy:HYDRALISK_MAX_INFLIGHT_REQUESTS=1fail-closes saturated traffic with HTTP 429, holds the slot through full streaming responses, and publishes admission metadata in capabilities and receipts. This makes the current DeepSeek lane enforceably single-flight; it does not fix true concurrency.
Hydralisk should produce public-safe capability and run receipts for Khala and OpenAgents to consume. It should not own pricing, credits, payout, referral, customer routing, or public product promises.
The first runtime scaffold has landed: an authenticated Hydralisk proxy for
openai/gpt-oss-20b, systemd units for a one-L4 vLLM host, and a GCE runbook.
Live host promotion still requires installing the repo on a fresh or explicitly
reused L4 VM, setting the bearer token out-of-band, smoking the proxy, and
publishing the HTTPS origin to OpenAgents.
The design anchor lives in the OpenAgents inference docs:
openagents/docs/inference/2026-06-23-hydralisk-python-nvidia-inference-stack.md
First execution roadmap:
docs/gpt-oss-20b-khala-live-roadmap.mddocs/gce-l4-vllm-runbook.mddocs/glm-5.2-sglang-preflight-runbook.mddocs/glm-5.2-reap-504b-g4-runbook.mddocs/evidence/2026-06-24-glm-52-reap-504b-profile.mddocs/evidence/2026-06-24-glm-52-reap-504b-g4-admission.mddocs/evidence/2026-06-24-glm-52-reap-504b-staging.mddocs/evidence/2026-06-24-glm-52-reap-504b-load-smoke.mddocs/evidence/2026-06-24-glm-52-reap-504b-private-endpoint.mddocs/evidence/2026-06-24-glm-52-reap-504b-tuning.mddocs/evidence/2026-06-24-glm-52-reap-504b-terminal-bench-20.mddocs/evidence/2026-06-24-glm-52-reap-504b-fallback-matrix.mddocs/evidence/2026-06-24-glm-52-reap-504b-operator-hardening.mddocs/evidence/2026-06-24-glm-52-reap-504b-integration-receipt.jsondocs/evidence/2026-06-24-glm-52-reap-504b-tracking-closure.mddocs/evidence/2026-06-25-glm-52-reap-504b-khala-canary-status.mddocs/evidence/2026-06-25-glm-52-reap-504b-mtp2-speed-gate.mddocs/evidence/2026-06-25-glm-52-reap-504b-replica-routing-metadata.mddocs/evidence/2026-06-25-glm-52-reap-504b-replica-provisioner.mddocs/evidence/2026-06-25-glm-52-reap-504b-capacity-plan.mddocs/deepseek-v4-flash-gce-preflight.mddocs/evidence/2026-06-24-deepseek-v4-flash-gce-load-smoke.mddocs/evidence/2026-06-24-deepseek-v4-flash-g4-backend-matrix.mddocs/evidence/2026-06-24-deepseek-v4-flash-scaled-mm-g4-probe.mddocs/evidence/2026-06-24-deepseek-v4-flash-e8m0-upcast-g4.mddocs/evidence/2026-06-24-deepseek-v4-flash-o-proj-g4.mddocs/evidence/2026-06-24-deepseek-v4-flash-o-proj-group-rhs-g4.mddocs/evidence/2026-06-24-deepseek-v4-flash-o-proj-rhs-scale-g4.mddocs/evidence/2026-06-24-deepseek-v4-flash-provider-stack-g4.mddocs/evidence/2026-06-24-deepseek-v4-flash-published-recipe-gce-admission.mddocs/evidence/2026-06-24-deepseek-v4-flash-nvfp4-g4-probe.mddocs/evidence/2026-06-24-deepseek-v4-flash-nvfp4-sm120-g4-probe.mddocs/evidence/2026-06-24-deepseek-v4-flash-nvfp4-private-egress-g4.mddocs/evidence/2026-06-24-deepseek-v4-flash-nvfp4-no-xet-g4.mddocs/evidence/2026-06-24-deepseek-v4-flash-nvfp4-triton-g4.mddocs/evidence/2026-06-24-deepseek-v4-flash-nvfp4-oproj-g4.mddocs/evidence/2026-06-24-flashinfer-trtllm-nvfp4-moe-g4.mddocs/evidence/2026-06-24-flashinfer-b12x-moe-g4.mddocs/evidence/2026-06-24-deepseek-v4-flash-b12x-wide-g4.mddocs/evidence/2026-06-24-deepseek-v4-flash-clamp-backends-wide-g4.mddocs/evidence/2026-06-24-deepseek-v4-flash-oproj-fallback-wide-g4.mddocs/evidence/2026-06-24-flashinfer-trtllm-nvfp4-moe-full-shape-g4.mddocs/evidence/2026-06-24-flashinfer-b12x-clamp-ep-g4.mddocs/evidence/2026-06-24-flashinfer-b12x-local-shard-remap-g4.mddocs/evidence/2026-06-24-deepseek-b12x-local-shard-reference-fixture.mddocs/evidence/2026-06-24-flashinfer-b12x-wrapper-surface-g4.mddocs/evidence/2026-06-24-flashinfer-b12x-nightly-wrapper-g4.mddocs/evidence/2026-06-24-deepseek-b12x-local-dispatcher-shim.mddocs/evidence/2026-06-24-flashinfer-b12x-masked-dispatch-g4.mddocs/evidence/2026-06-24-deepseek-b12x-clamp-patch-points.mddocs/evidence/2026-06-24-deepseek-b12x-clamp-overlay.mddocs/evidence/2026-06-24-deepseek-b12x-static-clamp-g4.mddocs/evidence/2026-06-24-deepseek-b12x-dynamic-clamp-g4.mddocs/evidence/2026-06-24-deepseek-b12x-full-model-g4.mddocs/evidence/2026-06-24-deepseek-b12x-eager-mla-g4.mddocs/evidence/2026-06-24-deepseek-v4-sparse-mla-full-g4.mddocs/evidence/2026-06-24-deepseek-v4-issue60-g4-mvp-smoke.mddocs/evidence/2026-06-24-deepseek-v4-vector-gather-g4-timing.mddocs/evidence/2026-06-24-deepseek-v4-khala-readiness-g4-gate.mddocs/evidence/2026-06-24-deepseek-v4-fable-adapter-compatibility.mddocs/evidence/2026-06-24-deepseek-v4-fable-load-canary.mddocs/evidence/2026-06-24-deepseek-v4-fable-authorized-security-policy.mddocs/evidence/2026-06-24-deepseek-v4-fable-lab-eval-decision.mddocs/evidence/2026-06-24-deepseek-v4-fable-retarget-plan.mddocs/evidence/2026-06-24-deepseek-v4-fable-o-proj-ownership.mddocs/evidence/2026-06-24-deepseek-v4-fable-transform-smoke.mddocs/evidence/2026-06-24-deepseek-v4-fable-context-map.mddocs/evidence/2026-06-24-deepseek-v4-fable-indexer-loader-proof.mddocs/evidence/2026-06-24-deepseek-v4-fable-packed-delta.mddocs/evidence/2026-06-24-deepseek-v4-fable-upstream-payload.mddocs/evidence/2026-06-24-deepseek-v4-fable-merged-g4-preflight.mddocs/evidence/2026-06-24-deepseek-v4-fable-merged-staging.mddocs/evidence/2026-06-24-deepseek-v4-fable-merged-staging-manifest.tsvdocs/evidence/2026-06-24-deepseek-v4-fable-merged-canary.mddocs/evidence/2026-06-24-deepseek-v4-fable-google-g4-final.mddocs/evidence/2026-06-24-deepseek-flashmla-sparse-audit.mddocs/evidence/2026-06-24-deepseek-g4-gcloud-auth-preflight.mddocs/evidence/2026-06-24-deepseek-flashinfer-dsv4-g4-wrapper.mddocs/evidence/2026-06-24-deepseek-tailnet-executor-check.mddocs/evidence/2026-06-24-deepseek-gcloud-account-override.mddocs/evidence/2026-06-24-deepseek-g4-iam-preflight.mddocs/evidence/2026-06-24-deepseek-g4-iam-grant-helper.mddocs/evidence/2026-06-24-deepseek-g4-grant-authority-preflight.mddocs/evidence/2026-06-24-deepseek-gcloud-credential-authority-probe.mddocs/evidence/2026-06-24-deepseek-google-alt-credential-probe.mddocs/evidence/2026-06-24-deepseek-service-account-key-probe.mdprofiles/glm-5.2-fp8-sglang.jsonprofiles/deepseek-v4-flash-gce-preflight.json
hydralisk/
hydralisk/
serve/
engines/
models/
dynamo/
receipts/
bench/
evals/
deploy/
gce/
gke/
containers/
docs/
- Do not commit secrets, raw prompts, private source, model credentials, or hidden reasoning traces.
- Do not commit model weights, checkpoints, compiled engines, benchmark output, or large generated artifacts.
- Keep engine versions, model revisions, container images, GPU shape, CUDA runtime, parser behavior, and quantization mode explicit in receipts.
- Fail closed when a model profile, engine pin, GPU admission check, quantization eval, or public-safe receipt path is missing.