cld2labs/sglang-gpt-oss by arpannookala-12 · Pull Request #113 · opea-project/Enterprise-Inference

arpannookala-12 · 2026-06-02T15:21:08Z

Summary

Adds SGLang Helm chart (core/helm-charts/sglang/) for deploying gpt-oss-20b on Intel Xeon CPU, including image-build scripts, Dockerfile, and model-patch overlays for MXFP4/FP32/MoE/dequant paths
Adds scripts/bootstrap-k3s.sh helper to bring up an EI-shape cluster where generate-token.sh works out of the box
Adds model-deployment card + Helm-based deployment guide for gpt-oss-20b under third_party/Dell/model-deployment/gpt-oss-20b/
Adds shared third_party/Dell/model-deployment/sglang-troubleshooting.md
Chart auto-detects nerdctl (kubeadm) vs k3s container runtimes and pulls BuildKit from the upstream GitHub release on demand

Standalone Helm chart at core/helm-charts/sglang/ that deploys lmsysorg/sglang:v0.5.11-xeon serving openai/gpt-oss-20b on a Xeon CPU node. Follows the same standalone pattern as core/helm-charts/ovms (no Ansible playbook wiring): a single helm install/upgrade command brings up the server. Mirrors the OVMS chart's OIDC + APISIX + nginx ingress topology so it slots into the existing auth-apisix stack when those are enabled, and can be deployed bare for smoke tests by disabling them. Defaults: PVC-backed HuggingFace cache (80Gi) so weights survive pod restarts, /dev/shm sized for CPU IPC, OpenAI-compatible API on port 30000, liveness/readiness on /health. Signed-off-by: arpannookala-12 <ganesh.arpan.nookala@cloud2labs.com>

…eon CPU Adds eight cumulative patches on top of lmsysorg/sglang:v0.5.12-xeon: - fix1: sgl-kernel rebuild with -mavx512bf16 / -mamx-bf16 / -mamx-tile. The published binary has 0 AVX-512 BF16 instructions, causing `tinygemm_kernel_nn: scalar path not implemented!` on the first bf16 forward pass. Genuine upstream bug. - fix2: register mxfp4 for CPU + extend GptOss attention-backend allowlist to include intel_amx / torch_native. - fix3: guard hardcoded .cuda() calls in gpt_oss.py weight loaders so CPU-only torch doesn't abort. - fix4: add `_process_weights_for_cpu` + `forward_cpu` to Mxfp4MoEMethod so MXFP4 weights are dequantized to bf16 and the MoE forward routes through CPU instead of triton_kernels. - fix5b: add sinks-attention forward (the gpt-oss-specific scalar added to softmax denominator) to torch_native_backend via an _sdpa_with_sinks wrapper. - fix6: route Mxfp4MoEMethod.apply through forward_cpu on CPU so the CPU path is actually reached from FusedMoE.run_moe_core. - fix7: self-contained MXFP4 dequantizer with MXFP4_NIBBLE_ORDER=low_first (gpt-oss's actual packing). Fixes random-vocab output that fix6 produced due to wrong nibble order. - fix8: delegate forward_cpu to moe_forward_native, which already handles gpt-oss's swiglu_gpt_oss_sigmoid_alpha + W13/W2 biases. Produces coherent output. Chart now serves: - Qwen2.5-7B end-to-end on Xeon with fix1 alone. - openai/gpt-oss-20b end-to-end on Xeon with the full fix1..fix8 stack (short-form coherent; long-form degrades into repetition due to accumulating numerical error in the pure-Python CPU MoE path). Build artifacts: - core/helm-charts/sglang/image-build/Dockerfile + 7 anchored patch scripts. `build-and-import.sh` installs docker, builds the image, and imports it into k3s containerd. - scripts/bootstrap-k3s.sh installs a single-node k3s + helm + kubectl on a fresh Ubuntu host for chart smoke-testing. Signed-off-by: arpannookala-12 <ganesh.arpan.nookala@cloud2labs.com>

…xRoute templating Image-build (fix9-fix11, all behind env-var or flag gates): - enable-fp32-override-debug.py: allow --dtype float32 with mxfp4 models via ALLOW_FP32_MXFP4=1 - enable-dequant-dtype-debug.py: make MXFP4 dequant output dtype env-controlled via MXFP4_OUT_DTYPE - enable-fp32-moe-promotion-debug.py: promote per-expert moe_forward_native intermediates to fp32 via FP32_PROMOTE_MOE=1 - enable-fp32-kv-cache-debug.py: patch sglang's --kv-cache-dtype allowlist, configure_kv_cache_dtype mapping, and torch_native_backend dtype-mismatch handler so fp32 KV cache flows end-to-end Tag bumped to v0.5.12-xeon-fix11-debug. Chart: - values.yaml: default image is now the patched build; MXFP4_NIBBLE_ORDER=low_first baked into extraEnv (required for correct MXFP4 weight decode) - gpt-oss-20b-values.yaml: canonical helm-upgrade override for this model - templates/apisixroute.yaml: ingressClassName field templated All debug patches are no-ops unless the corresponding env var or flag is set; default chart behavior is byte-identical to upstream for the unpatched code paths. Signed-off-by: arpannookala-12 <ganesh.arpan.nookala@cloud2labs.com>

…only notes Rewrite the chart README to match the conventions of core/scripts/vllm-quickstart/README.md — emoji section headers, configuration tables, troubleshooting matrix, project-structure tree. The new README covers: - Build the patched image (image-build/build-and-import.sh) - Deploy on a stock OPEA cluster (single helm upgrade with gpt-oss-20b-values.yaml) - Smoke-test and auth-routed inference curls - Configuration tables for chart values and debug env vars - What each of the 11 patches does and why - Known limitations (long-form drift, throughput, no-TP) - Troubleshooting matrix - From-scratch single-node bootstrap appendix (k3s, nginx, Keycloak, APISIX, TLS) for setups without the OPEA Ansible playbooks Add .gitignore for two local-only working notes that should not be shared (REMAINING_WORK.md, UPSTREAM_BUG_REPORT.md). Signed-off-by: arpannookala-12 <ganesh.arpan.nookala@cloud2labs.com>

Lead the README with the framework: SGLang on Xeon CPU, default model Qwen3-8B, any HF model SGLang supports works. Move gpt-oss-20b content into a single "Noteworthy" section that explains why the model is the driver of the patch stack and links to the full deployment recipe under third_party/Dell/model-deployment/gpt-oss-20b/. What's Patched table now annotates each patch with its scope (all bf16 models / MXFP4 only / gpt-oss specific / debug knob) so it's clear which patches actually apply to a given deployment. Troubleshooting moved out to a symptom-indexed sibling doc at third_party/Dell/model-deployment/sglang-troubleshooting.md; the README links to it. values.yaml: tighten the comment on MXFP4_NIBBLE_ORDER so it reads as a chart default that is a no-op for non-MXFP4 models, not a gpt-oss-only override. Signed-off-by: arpannookala-12 <ganesh.arpan.nookala@cloud2labs.com>

…ng troubleshooting Add deployment recipe for openai/gpt-oss-20b on the SGLang chart in the same shape as the llama-3.1-8b-instruct card on cld2labs/llama-3.1-8b-instruct: third_party/Dell/model-deployment/gpt-oss-20b/ ├── model-card.md — model metadata, license (Apache 2.0), intended │ use, limitations └── deployment.md — step-by-step Keycloak token, image build, helm install, verify, test, undeploy, parameter table Add sibling troubleshooting doc covering issues specific to SGLang deployments (Gateway Timeout 504, content:null with Harmony format, MXFP4 quantization gate errors, scalar-path crashes, nibble-order gibberish, long-form drift, APISIX issuer-claim mismatches). Signed-off-by: arpannookala-12 <ganesh.arpan.nookala@cloud2labs.com>

…odelName explicitly The Qwen3-8B chart default was a leftover from when this branch had concluded gpt-oss-CPU was impossible. After fix1-fix8 made gpt-oss work the default was never revisited. Make the chart fully opinion-free on model selection: - values.yaml: modelSource/modelName both default to "" with a comment pointing at the canonical values file pattern - templates/deployment.yaml: fail loudly at render time if either is unset, with an error message pointing to gpt-oss-20b-values.yaml as a working example Strip the section-header emojis from the README at the same time. Signed-off-by: arpannookala-12 <ganesh.arpan.nookala@cloud2labs.com>

…deployment dir The chart at core/helm-charts/sglang/ is model-agnostic; a gpt-oss-20b-specific values file living inside it was inconsistent with that framing and with the llama precedent on cld2labs/llama-3.1-8b-instruct (generic chart + per-model recipes living elsewhere). Move via git mv so the rename is preserved in history: core/helm-charts/sglang/gpt-oss-20b-values.yaml -> third_party/Dell/model-deployment/gpt-oss-20b/values.yaml The model-deployment/gpt-oss-20b/ directory now holds the complete per-model recipe in one place: - model-card.md (metadata, license, intended use, limitations) - deployment.md (step-by-step deploy guide) - values.yaml (canonical chart overrides) Update all references: README install example, project-structure tree, the deployment.md helm command + parameter table, and the deployment.yaml fail message. Signed-off-by: arpannookala-12 <ganesh.arpan.nookala@cloud2labs.com>

Found during end-to-end revalidation by following the deployment guide verbatim against a fresh redeploy. Step 1 (Prerequisites): generate-token.sh hits https://${BASE_URL}/token and assumes that hostname resolves on port 443 with a real TLS cert. That works on production OPEA clusters on Dell hardware but silently returns an empty TOKEN on a single-node k3s lab where api.example.com isn't in DNS and nginx is on a NodePort. Add a callout pointing lab/single-node users at the cluster-internal token-fetch recipe in sglang-troubleshooting.md issue opea-project#7. Step 4 (Verify the Deployment): the expected-output block was copied from the llama deployment.md and showed keycloak-0 / keycloak-postgresql-0 (StatefulSet + Postgres backend). On lab installs Keycloak is often a single Deployment pod with H2 embedded, and APISIX/nginx pod names also depend on how those components were rolled out. Generalize the block so the sglang pod is the only thing called out, with a note that other component pod names depend on the cluster's deployment shape. Signed-off-by: arpannookala-12 <ganesh.arpan.nookala@cloud2labs.com>

…ags, rewrite Step 1, repair appendix Match the cld2labs/llama-3.1-8b-instruct precedent and drop the per-model values.yaml: there is no longer a values file sitting next to deployment.md. All gpt-oss-specific runtime flags (Harmony parsers, torch_native CPU attention backend) come through as --set overrides on the helm install command directly. Deletions: - third_party/Dell/model-deployment/gpt-oss-20b/values.yaml - third_party/Dell/model-deployment/README.md (placeholder) deployment.md: - Step 1 rewritten to be context-free with two explicit paths: Path A — production OPEA cluster (generate-token.sh) Path B — single-node lab (cluster-internal token one-liner) Both paths declare the same four exports (BASE_URL, KEYCLOAK_CLIENT_ID, KEYCLOAK_CLIENT_SECRET, TOKEN) so later steps are shell-state-portable across the two paths. - Step 3 helm install is now the canonical recipe — no --values flag, all model-specific knobs as --set, including 'server.extraArgs={--attention-backend,torch_native,--reasoning-parser,gpt-oss,--tool-call-parser,gpt-oss}'. - Step 4 includes the upfront kubectl patch to bump the ApisixRoute 60s default timeout (otherwise inference past ~240 tokens 504s). - Step 5 adds a callout for --resolve when running against a lab NodePort instead of a real DNS hostname. core/helm-charts/sglang/README.md: - Drop the model-values column from the model-specific recipes table. - Appendix A.3: replace <your-client-secret> with the lab-default secret that matches what the chart and deployment.md actually consume, and add a verification curl that round-trips the client_credentials grant before moving on. - Appendix A.4: expand from a vague "also needs GatewayProxy and IngressClass parameters" note into the actual commands. The APISIX v2 ingress controller silently drops every ApisixRoute without these, which was the largest gap in the prior appendix. - Appendix A.5: fix the TLS secret namespace (was 'default', actually needs to be 'auth-apisix' to match where the chart-rendered Ingress lives) and the secret name (was 'api-example-com-tls', actually needs to equal the BASE_URL because the chart passes --set ingress.secretName=${BASE_URL}). - Add Appendix A.6 documenting the /etc/hosts vs --resolve trade-off for lab clusters where api.example.com isn't in real DNS. This pass was driven by an honest audit of the cluster vs the appendix. Phase 2 — actually rebuild from a true blank slate following these fixed docs to validate them end-to-end — comes next. Signed-off-by: arpannookala-12 <ganesh.arpan.nookala@cloud2labs.com>

Found during the Phase 2 from-scratch validation: chart values.yaml default image.tag was still v0.5.12-xeon-fix10-debug but build-and-import.sh imports v0.5.12-xeon-fix11-debug. A fresh user running `helm install ./core/helm-charts/sglang ...` after the build script hit ImagePullBackOff because kubelet treated the missing local fix10 tag as a remote pull request and got "pull access denied" from docker.io. Bump tag and tighten the surrounding comment to say: - fix1 is generic (benefits any bf16 model) - fix2..fix8 are gpt-oss specific and runtime no-ops for others - fix9..fix11 are debug knobs - the tag MUST match build-and-import.sh's IMAGE_TAG verbatim because the locally-imported image is the only place this tag exists Signed-off-by: arpannookala-12 <ganesh.arpan.nookala@cloud2labs.com>

…a cluster where generate-token.sh works The previous Step 1 split into "Path A — production OPEA" and "Path B — single-node lab" was a framing I invented; the rest of the repo just talks about "the Enterprise Inference stack" and a "Single Node Deployment". The dual-path layout also implied that lab users had a different runtime to learn, which they don't — Intel testers will go through generate-token.sh and the docs need to make that path land end-to-end on whatever cluster the appendix produces. deployment.md: - Step 1 collapses to the llama-3.1-8b-instruct precedent: a single "Ensure the EI stack is deployed" paragraph and `source generate-token.sh`. No paths. No alternative one-liner. The troubleshooting doc now owns the cluster-internal recovery recipe. - Reword the BASE_URL note in Step 5 to talk about what generate-token.sh actually exports instead of "lab clusters where ...". core/helm-charts/sglang/README.md: - A.3: pin Keycloak's issuer with KC_HOSTNAME=http://keycloak.default.svc.cluster.local (+ KC_HOSTNAME_STRICT=false, KC_HOSTNAME_BACKCHANNEL_DYNAMIC=false). Without this, tokens issued via the edge route in A.7 carry iss=https://api.example.com:30443/... and APISIX's bearer_only check against the cluster-internal discovery URL returns 401. Switched KC_PROXY (deprecated in v26+) to KC_PROXY_HEADERS=xforwarded. - A.5: create the TLS secret in both auth-apisix (for the chart's Ingress) and default (for the Keycloak-edge Ingresses in A.7). - A.6: simplified — `/etc/hosts` only, plus a note that BASE_URL needs the :30443 NodePort since nginx isn't on :443 in the appendix. - A.7 (new): two nginx Ingresses that publish Keycloak under api.example.com:30443 — pass-through for /realms and /admin, rewrite for /token to Keycloak's openid-connect token endpoint. Without these generate-token.sh can't reach Keycloak's admin REST API to fetch the client secret. A.7 ends with a verification curl that round-trips the client_credentials grant against api.example.com:30443/token. troubleshooting.md: - Rewrite issue opea-project#7 from "fetch the token via kubectl run" (Path B workaround) to the actual root cause (missing KC_HOSTNAME) and the permanent fix. Phase 3 commits the docs; the from-scratch rebuild that exercises them happens next. Signed-off-by: arpannookala-12 <ganesh.arpan.nookala@cloud2labs.com>

…I-shape cluster Make the framing explicit so an Intel tester knows immediately whether they need the appendix at all: - On an OPEA-Ansible-deployed cluster, the appendix is unnecessary. Go straight to "Build the Image". The same deployment guide applies. - For users without Ansible bootstrap, the appendix produces the same component shape (Keycloak with KC_HOSTNAME pinned, edge routes for /realms/admin/token, OIDC client `my-client-id`, TLS secret in both consuming namespaces, APISIX GatewayProxy wiring). After it runs, generate-token.sh and the deploy work the same way as on an OPEA cluster. This is the same logical scope the previous "From-Scratch Bootstrap" header implied, but the previous prose left it ambiguous whether production OPEA users would land here too. Signed-off-by: arpannookala-12 <ganesh.arpan.nookala@cloud2labs.com>

…eadm) vs k3s The build script previously assumed k3s and ran: docker build → docker save | k3s ctr images import - That fails on OPEA-deployed (kubeadm + containerd) clusters where neither docker nor k3s is present, but nerdctl is. Detect the runtime and pick the right path: - nerdctl present: nerdctl --namespace k8s.io build (single step; containerd's image store IS where kubelet pulls from) - k3s present: keep the existing docker → k3s ctr import path - neither: hard fail with a helpful message Caught when validating Path A against a real inference-stack-deploy.sh cluster (kubeadm 1.31, containerd 1.7.24). Signed-off-by: arpannookala-12 <ganesh.arpan.nookala@cloud2labs.com>

…ctl path nerdctl needs buildkitd to satisfy `nerdctl build`. OPEA-deployed clusters (kubeadm + containerd) ship nerdctl but not buildkit, so the first `nerdctl build` invocation errors out with "buildctl needs to be installed". Make build-and-import.sh install + start buildkit on the fly if it's missing — same one-shot ergonomic pattern the k3s branch already uses for docker.io. Falls back to a background buildkitd if the system doesn't have a `buildkit` systemd unit. Caught when validating Path A against a real inference-stack-deploy.sh cluster. Signed-off-by: arpannookala-12 <ganesh.arpan.nookala@cloud2labs.com>

Ubuntu 22.04 doesn't ship a `buildkit` apt package — `apt install buildkit` errors with "Unable to locate package". Fetch buildkit (~30 MB) from moby/buildkit GitHub releases and install /usr/local/bin/buildctl + /usr/local/bin/buildkitd directly. Default pinned version v0.18.1; override with BUILDKIT_VERSION env var. Also tighten the buildkitd-startup poll: wait up to 10 s for the unix socket, hard-fail with a pointer to the log file if it never appears (better than the previous silent continue). Signed-off-by: arpannookala-12 <ganesh.arpan.nookala@cloud2labs.com>

…yment.md The build script already auto-detects nerdctl (kubeadm/containerd) vs k3s, but the docs around it still said "imports into k3s containerd" and gave a `k3s ctr images ls` as the only verify command. On a real OPEA cluster (kubeadm + containerd, no k3s binary) that command errors and there's no signal that a different runtime is supported. Rephrase both the chart README and the gpt-oss deployment guide: - Replace "k3s containerd" with "local containerd image store" / generic phrasing in the prose. - Replace the single k3s verify line with a dual block that shows the nerdctl form (kubeadm) and the k3s ctr form, with a one-line explanation of which to use. - Prerequisites: list both kubeadm/containerd and k3s as validated targets instead of saying k3s. - Project-structure tree comment for build-and-import.sh: "local containerd (kubeadm or k3s)" instead of "k3s containerd". The appendix (which explicitly bootstraps k3s as a convenience) stays k3s-specific; that's a deliberate choice for the self-bootstrap path, not a claim about how the chart runs. Signed-off-by: arpannookala-12 <ganesh.arpan.nookala@cloud2labs.com>

Signed-off-by: arpannookala-12 <ganesh.arpan.nookala@cloud2labs.com>

…) and Scenario 2 (k3s bootstrap) Add a decision table at the top so readers go directly to the right path. Elevate the former appendix to a first-class Scenario 2 section (S2.1–S2.7) with consistent step numbering. Mark the convergence point at "Build the Image" explicitly so both paths meet cleanly. Signed-off-by: arpannookala-12 <ganesh.arpan.nookala@cloud2labs.com>

…s-20b deployment.md The kubectl patch apisixroute block was under Step 4 (Verify), but the timeout only surfaces when actually testing inference. Move the callout to Step 5 (Test), framed as a reaction to a 504 rather than a pre-emptive action, and link to the full fix in sglang-troubleshooting.md. Signed-off-by: arpannookala-12 <ganesh.arpan.nookala@cloud2labs.com>

arpannookala-12 added 18 commits June 2, 2026 10:25

cld2labs/sglang-gpt-oss: drop local-only .gitignore from chart dir

1171136

Signed-off-by: arpannookala-12 <ganesh.arpan.nookala@cloud2labs.com>

arpannookala-12 force-pushed the cld2labs/sglang-gpt-oss branch from 700aae9 to 1171136 Compare June 2, 2026 15:25

arpannookala-12 added 2 commits June 5, 2026 12:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cld2labs/sglang-gpt-oss#113

cld2labs/sglang-gpt-oss#113
arpannookala-12 wants to merge 20 commits into
opea-project:mainfrom
cld2labs:cld2labs/sglang-gpt-oss

arpannookala-12 commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

arpannookala-12 commented Jun 2, 2026

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant