Skip to content

cld2labs/sglang-gpt-oss#113

Open
arpannookala-12 wants to merge 20 commits into
opea-project:mainfrom
cld2labs:cld2labs/sglang-gpt-oss
Open

cld2labs/sglang-gpt-oss#113
arpannookala-12 wants to merge 20 commits into
opea-project:mainfrom
cld2labs:cld2labs/sglang-gpt-oss

Conversation

@arpannookala-12

Copy link
Copy Markdown
Contributor

Summary

  • Adds SGLang Helm chart (core/helm-charts/sglang/) for deploying gpt-oss-20b on Intel Xeon CPU, including image-build scripts, Dockerfile, and model-patch overlays for MXFP4/FP32/MoE/dequant paths
  • Adds scripts/bootstrap-k3s.sh helper to bring up an EI-shape cluster where generate-token.sh works out of the box
  • Adds model-deployment card + Helm-based deployment guide for gpt-oss-20b under third_party/Dell/model-deployment/gpt-oss-20b/
  • Adds shared third_party/Dell/model-deployment/sglang-troubleshooting.md
  • Chart auto-detects nerdctl (kubeadm) vs k3s container runtimes and pulls BuildKit from the upstream GitHub release on demand

Standalone Helm chart at core/helm-charts/sglang/ that deploys
lmsysorg/sglang:v0.5.11-xeon serving openai/gpt-oss-20b on a Xeon CPU node.

Follows the same standalone pattern as core/helm-charts/ovms (no Ansible
playbook wiring): a single helm install/upgrade command brings up the
server. Mirrors the OVMS chart's OIDC + APISIX + nginx ingress topology so
it slots into the existing auth-apisix stack when those are enabled,
and can be deployed bare for smoke tests by disabling them.

Defaults: PVC-backed HuggingFace cache (80Gi) so weights survive pod
restarts, /dev/shm sized for CPU IPC, OpenAI-compatible API on port 30000,
liveness/readiness on /health.

Signed-off-by: arpannookala-12 <ganesh.arpan.nookala@cloud2labs.com>
…eon CPU

Adds eight cumulative patches on top of lmsysorg/sglang:v0.5.12-xeon:
- fix1: sgl-kernel rebuild with -mavx512bf16 / -mamx-bf16 / -mamx-tile.
  The published binary has 0 AVX-512 BF16 instructions, causing
  `tinygemm_kernel_nn: scalar path not implemented!` on the first bf16
  forward pass. Genuine upstream bug.
- fix2: register mxfp4 for CPU + extend GptOss attention-backend
  allowlist to include intel_amx / torch_native.
- fix3: guard hardcoded .cuda() calls in gpt_oss.py weight loaders so
  CPU-only torch doesn't abort.
- fix4: add `_process_weights_for_cpu` + `forward_cpu` to Mxfp4MoEMethod
  so MXFP4 weights are dequantized to bf16 and the MoE forward routes
  through CPU instead of triton_kernels.
- fix5b: add sinks-attention forward (the gpt-oss-specific scalar added
  to softmax denominator) to torch_native_backend via an _sdpa_with_sinks
  wrapper.
- fix6: route Mxfp4MoEMethod.apply through forward_cpu on CPU so the
  CPU path is actually reached from FusedMoE.run_moe_core.
- fix7: self-contained MXFP4 dequantizer with MXFP4_NIBBLE_ORDER=low_first
  (gpt-oss's actual packing). Fixes random-vocab output that fix6
  produced due to wrong nibble order.
- fix8: delegate forward_cpu to moe_forward_native, which already handles
  gpt-oss's swiglu_gpt_oss_sigmoid_alpha + W13/W2 biases. Produces
  coherent output.

Chart now serves:
- Qwen2.5-7B end-to-end on Xeon with fix1 alone.
- openai/gpt-oss-20b end-to-end on Xeon with the full fix1..fix8 stack
  (short-form coherent; long-form degrades into repetition due to
  accumulating numerical error in the pure-Python CPU MoE path).

Build artifacts:
- core/helm-charts/sglang/image-build/Dockerfile + 7 anchored patch
  scripts. `build-and-import.sh` installs docker, builds the image,
  and imports it into k3s containerd.
- scripts/bootstrap-k3s.sh installs a single-node k3s + helm + kubectl
  on a fresh Ubuntu host for chart smoke-testing.

Signed-off-by: arpannookala-12 <ganesh.arpan.nookala@cloud2labs.com>
…xRoute templating

Image-build (fix9-fix11, all behind env-var or flag gates):
- enable-fp32-override-debug.py:  allow --dtype float32 with mxfp4 models
                                  via ALLOW_FP32_MXFP4=1
- enable-dequant-dtype-debug.py:  make MXFP4 dequant output dtype
                                  env-controlled via MXFP4_OUT_DTYPE
- enable-fp32-moe-promotion-debug.py: promote per-expert moe_forward_native
                                  intermediates to fp32 via FP32_PROMOTE_MOE=1
- enable-fp32-kv-cache-debug.py:  patch sglang's --kv-cache-dtype allowlist,
                                  configure_kv_cache_dtype mapping, and
                                  torch_native_backend dtype-mismatch handler
                                  so fp32 KV cache flows end-to-end

Tag bumped to v0.5.12-xeon-fix11-debug.

Chart:
- values.yaml: default image is now the patched build; MXFP4_NIBBLE_ORDER=low_first
               baked into extraEnv (required for correct MXFP4 weight decode)
- gpt-oss-20b-values.yaml: canonical helm-upgrade override for this model
- templates/apisixroute.yaml: ingressClassName field templated

All debug patches are no-ops unless the corresponding env var or flag
is set; default chart behavior is byte-identical to upstream for the
unpatched code paths.

Signed-off-by: arpannookala-12 <ganesh.arpan.nookala@cloud2labs.com>
…only notes

Rewrite the chart README to match the conventions of
core/scripts/vllm-quickstart/README.md — emoji section headers,
configuration tables, troubleshooting matrix, project-structure tree.

The new README covers:
- Build the patched image (image-build/build-and-import.sh)
- Deploy on a stock OPEA cluster (single helm upgrade with
  gpt-oss-20b-values.yaml)
- Smoke-test and auth-routed inference curls
- Configuration tables for chart values and debug env vars
- What each of the 11 patches does and why
- Known limitations (long-form drift, throughput, no-TP)
- Troubleshooting matrix
- From-scratch single-node bootstrap appendix (k3s, nginx, Keycloak,
  APISIX, TLS) for setups without the OPEA Ansible playbooks

Add .gitignore for two local-only working notes that should not be
shared (REMAINING_WORK.md, UPSTREAM_BUG_REPORT.md).

Signed-off-by: arpannookala-12 <ganesh.arpan.nookala@cloud2labs.com>
Lead the README with the framework: SGLang on Xeon CPU, default model
Qwen3-8B, any HF model SGLang supports works. Move gpt-oss-20b content
into a single "Noteworthy" section that explains why the model is the
driver of the patch stack and links to the full deployment recipe under
third_party/Dell/model-deployment/gpt-oss-20b/.

What's Patched table now annotates each patch with its scope (all bf16
models / MXFP4 only / gpt-oss specific / debug knob) so it's clear which
patches actually apply to a given deployment.

Troubleshooting moved out to a symptom-indexed sibling doc at
third_party/Dell/model-deployment/sglang-troubleshooting.md; the README
links to it.

values.yaml: tighten the comment on MXFP4_NIBBLE_ORDER so it reads as a
chart default that is a no-op for non-MXFP4 models, not a gpt-oss-only
override.

Signed-off-by: arpannookala-12 <ganesh.arpan.nookala@cloud2labs.com>
…ng troubleshooting

Add deployment recipe for openai/gpt-oss-20b on the SGLang chart in the
same shape as the llama-3.1-8b-instruct card on cld2labs/llama-3.1-8b-instruct:

  third_party/Dell/model-deployment/gpt-oss-20b/
  ├── model-card.md   — model metadata, license (Apache 2.0), intended
  │                     use, limitations
  └── deployment.md   — step-by-step Keycloak token, image build, helm
                        install, verify, test, undeploy, parameter table

Add sibling troubleshooting doc covering issues specific to SGLang
deployments (Gateway Timeout 504, content:null with Harmony format,
MXFP4 quantization gate errors, scalar-path crashes, nibble-order
gibberish, long-form drift, APISIX issuer-claim mismatches).

Signed-off-by: arpannookala-12 <ganesh.arpan.nookala@cloud2labs.com>
…odelName explicitly

The Qwen3-8B chart default was a leftover from when this branch had
concluded gpt-oss-CPU was impossible. After fix1-fix8 made gpt-oss work
the default was never revisited.

Make the chart fully opinion-free on model selection:
- values.yaml: modelSource/modelName both default to "" with a comment
  pointing at the canonical values file pattern
- templates/deployment.yaml: fail loudly at render time if either is
  unset, with an error message pointing to gpt-oss-20b-values.yaml as
  a working example

Strip the section-header emojis from the README at the same time.

Signed-off-by: arpannookala-12 <ganesh.arpan.nookala@cloud2labs.com>
…deployment dir

The chart at core/helm-charts/sglang/ is model-agnostic; a
gpt-oss-20b-specific values file living inside it was inconsistent with
that framing and with the llama precedent on cld2labs/llama-3.1-8b-instruct
(generic chart + per-model recipes living elsewhere).

Move via git mv so the rename is preserved in history:
  core/helm-charts/sglang/gpt-oss-20b-values.yaml
    -> third_party/Dell/model-deployment/gpt-oss-20b/values.yaml

The model-deployment/gpt-oss-20b/ directory now holds the complete
per-model recipe in one place:
  - model-card.md   (metadata, license, intended use, limitations)
  - deployment.md   (step-by-step deploy guide)
  - values.yaml     (canonical chart overrides)

Update all references: README install example, project-structure tree,
the deployment.md helm command + parameter table, and the
deployment.yaml fail message.

Signed-off-by: arpannookala-12 <ganesh.arpan.nookala@cloud2labs.com>
Found during end-to-end revalidation by following the deployment guide
verbatim against a fresh redeploy.

Step 1 (Prerequisites): generate-token.sh hits https://${BASE_URL}/token
and assumes that hostname resolves on port 443 with a real TLS cert. That
works on production OPEA clusters on Dell hardware but silently returns an
empty TOKEN on a single-node k3s lab where api.example.com isn't in DNS
and nginx is on a NodePort. Add a callout pointing lab/single-node users
at the cluster-internal token-fetch recipe in sglang-troubleshooting.md
issue opea-project#7.

Step 4 (Verify the Deployment): the expected-output block was copied
from the llama deployment.md and showed keycloak-0 / keycloak-postgresql-0
(StatefulSet + Postgres backend). On lab installs Keycloak is often a
single Deployment pod with H2 embedded, and APISIX/nginx pod names also
depend on how those components were rolled out. Generalize the block so
the sglang pod is the only thing called out, with a note that other
component pod names depend on the cluster's deployment shape.

Signed-off-by: arpannookala-12 <ganesh.arpan.nookala@cloud2labs.com>
…ags, rewrite Step 1, repair appendix

Match the cld2labs/llama-3.1-8b-instruct precedent and drop the
per-model values.yaml: there is no longer a values file sitting next to
deployment.md. All gpt-oss-specific runtime flags (Harmony parsers,
torch_native CPU attention backend) come through as --set overrides on
the helm install command directly.

Deletions:
  - third_party/Dell/model-deployment/gpt-oss-20b/values.yaml
  - third_party/Dell/model-deployment/README.md  (placeholder)

deployment.md:
- Step 1 rewritten to be context-free with two explicit paths:
    Path A — production OPEA cluster (generate-token.sh)
    Path B — single-node lab (cluster-internal token one-liner)
  Both paths declare the same four exports (BASE_URL,
  KEYCLOAK_CLIENT_ID, KEYCLOAK_CLIENT_SECRET, TOKEN) so later steps are
  shell-state-portable across the two paths.
- Step 3 helm install is now the canonical recipe — no --values flag,
  all model-specific knobs as --set, including
  'server.extraArgs={--attention-backend,torch_native,--reasoning-parser,gpt-oss,--tool-call-parser,gpt-oss}'.
- Step 4 includes the upfront kubectl patch to bump the ApisixRoute
  60s default timeout (otherwise inference past ~240 tokens 504s).
- Step 5 adds a callout for --resolve when running against a lab
  NodePort instead of a real DNS hostname.

core/helm-charts/sglang/README.md:
- Drop the model-values column from the model-specific recipes table.
- Appendix A.3: replace <your-client-secret> with the lab-default
  secret that matches what the chart and deployment.md actually consume,
  and add a verification curl that round-trips the client_credentials
  grant before moving on.
- Appendix A.4: expand from a vague "also needs GatewayProxy and
  IngressClass parameters" note into the actual commands. The APISIX v2
  ingress controller silently drops every ApisixRoute without these,
  which was the largest gap in the prior appendix.
- Appendix A.5: fix the TLS secret namespace (was 'default', actually
  needs to be 'auth-apisix' to match where the chart-rendered Ingress
  lives) and the secret name (was 'api-example-com-tls', actually needs
  to equal the BASE_URL because the chart passes
  --set ingress.secretName=${BASE_URL}).
- Add Appendix A.6 documenting the /etc/hosts vs --resolve trade-off for
  lab clusters where api.example.com isn't in real DNS.

This pass was driven by an honest audit of the cluster vs the appendix.
Phase 2 — actually rebuild from a true blank slate following these
fixed docs to validate them end-to-end — comes next.

Signed-off-by: arpannookala-12 <ganesh.arpan.nookala@cloud2labs.com>
Found during the Phase 2 from-scratch validation: chart values.yaml
default image.tag was still v0.5.12-xeon-fix10-debug but
build-and-import.sh imports v0.5.12-xeon-fix11-debug. A fresh user
running `helm install ./core/helm-charts/sglang ...` after the build
script hit ImagePullBackOff because kubelet treated the missing local
fix10 tag as a remote pull request and got "pull access denied" from
docker.io.

Bump tag and tighten the surrounding comment to say:
- fix1 is generic (benefits any bf16 model)
- fix2..fix8 are gpt-oss specific and runtime no-ops for others
- fix9..fix11 are debug knobs
- the tag MUST match build-and-import.sh's IMAGE_TAG verbatim because
  the locally-imported image is the only place this tag exists

Signed-off-by: arpannookala-12 <ganesh.arpan.nookala@cloud2labs.com>
…a cluster where generate-token.sh works

The previous Step 1 split into "Path A — production OPEA" and "Path B —
single-node lab" was a framing I invented; the rest of the repo just
talks about "the Enterprise Inference stack" and a "Single Node
Deployment". The dual-path layout also implied that lab users had a
different runtime to learn, which they don't — Intel testers will go
through generate-token.sh and the docs need to make that path land
end-to-end on whatever cluster the appendix produces.

deployment.md:
- Step 1 collapses to the llama-3.1-8b-instruct precedent: a single
  "Ensure the EI stack is deployed" paragraph and `source
  generate-token.sh`. No paths. No alternative one-liner. The
  troubleshooting doc now owns the cluster-internal recovery recipe.
- Reword the BASE_URL note in Step 5 to talk about what generate-token.sh
  actually exports instead of "lab clusters where ...".

core/helm-charts/sglang/README.md:
- A.3: pin Keycloak's issuer with KC_HOSTNAME=http://keycloak.default.svc.cluster.local
  (+ KC_HOSTNAME_STRICT=false, KC_HOSTNAME_BACKCHANNEL_DYNAMIC=false).
  Without this, tokens issued via the edge route in A.7 carry
  iss=https://api.example.com:30443/... and APISIX's bearer_only check
  against the cluster-internal discovery URL returns 401. Switched
  KC_PROXY (deprecated in v26+) to KC_PROXY_HEADERS=xforwarded.
- A.5: create the TLS secret in both auth-apisix (for the chart's
  Ingress) and default (for the Keycloak-edge Ingresses in A.7).
- A.6: simplified — `/etc/hosts` only, plus a note that BASE_URL needs
  the :30443 NodePort since nginx isn't on :443 in the appendix.
- A.7 (new): two nginx Ingresses that publish Keycloak under
  api.example.com:30443 — pass-through for /realms and /admin, rewrite
  for /token to Keycloak's openid-connect token endpoint. Without these
  generate-token.sh can't reach Keycloak's admin REST API to fetch the
  client secret. A.7 ends with a verification curl that round-trips the
  client_credentials grant against api.example.com:30443/token.

troubleshooting.md:
- Rewrite issue opea-project#7 from "fetch the token via kubectl run" (Path B
  workaround) to the actual root cause (missing KC_HOSTNAME) and the
  permanent fix.

Phase 3 commits the docs; the from-scratch rebuild that exercises them
happens next.

Signed-off-by: arpannookala-12 <ganesh.arpan.nookala@cloud2labs.com>
…I-shape cluster

Make the framing explicit so an Intel tester knows immediately whether
they need the appendix at all:

- On an OPEA-Ansible-deployed cluster, the appendix is unnecessary. Go
  straight to "Build the Image". The same deployment guide applies.
- For users without Ansible bootstrap, the appendix produces the same
  component shape (Keycloak with KC_HOSTNAME pinned, edge routes for
  /realms/admin/token, OIDC client `my-client-id`, TLS secret in both
  consuming namespaces, APISIX GatewayProxy wiring). After it runs,
  generate-token.sh and the deploy work the same way as on an OPEA
  cluster.

This is the same logical scope the previous "From-Scratch Bootstrap"
header implied, but the previous prose left it ambiguous whether
production OPEA users would land here too.

Signed-off-by: arpannookala-12 <ganesh.arpan.nookala@cloud2labs.com>
…eadm) vs k3s

The build script previously assumed k3s and ran:
  docker build  →  docker save  |  k3s ctr images import -

That fails on OPEA-deployed (kubeadm + containerd) clusters where
neither docker nor k3s is present, but nerdctl is.

Detect the runtime and pick the right path:
- nerdctl present: nerdctl --namespace k8s.io build  (single step;
  containerd's image store IS where kubelet pulls from)
- k3s present: keep the existing docker → k3s ctr import path
- neither: hard fail with a helpful message

Caught when validating Path A against a real inference-stack-deploy.sh
cluster (kubeadm 1.31, containerd 1.7.24).

Signed-off-by: arpannookala-12 <ganesh.arpan.nookala@cloud2labs.com>
…ctl path

nerdctl needs buildkitd to satisfy `nerdctl build`. OPEA-deployed
clusters (kubeadm + containerd) ship nerdctl but not buildkit, so the
first `nerdctl build` invocation errors out with "buildctl needs to be
installed".

Make build-and-import.sh install + start buildkit on the fly if it's
missing — same one-shot ergonomic pattern the k3s branch already uses
for docker.io. Falls back to a background buildkitd if the system
doesn't have a `buildkit` systemd unit.

Caught when validating Path A against a real inference-stack-deploy.sh
cluster.

Signed-off-by: arpannookala-12 <ganesh.arpan.nookala@cloud2labs.com>
Ubuntu 22.04 doesn't ship a `buildkit` apt package — `apt install buildkit`
errors with "Unable to locate package".

Fetch buildkit (~30 MB) from moby/buildkit GitHub releases and install
/usr/local/bin/buildctl + /usr/local/bin/buildkitd directly. Default
pinned version v0.18.1; override with BUILDKIT_VERSION env var.

Also tighten the buildkitd-startup poll: wait up to 10 s for the
unix socket, hard-fail with a pointer to the log file if it never
appears (better than the previous silent continue).

Signed-off-by: arpannookala-12 <ganesh.arpan.nookala@cloud2labs.com>
…yment.md

The build script already auto-detects nerdctl (kubeadm/containerd) vs
k3s, but the docs around it still said "imports into k3s containerd"
and gave a `k3s ctr images ls` as the only verify command. On a real
OPEA cluster (kubeadm + containerd, no k3s binary) that command errors
and there's no signal that a different runtime is supported.

Rephrase both the chart README and the gpt-oss deployment guide:
- Replace "k3s containerd" with "local containerd image store" /
  generic phrasing in the prose.
- Replace the single k3s verify line with a dual block that shows the
  nerdctl form (kubeadm) and the k3s ctr form, with a one-line
  explanation of which to use.
- Prerequisites: list both kubeadm/containerd and k3s as validated
  targets instead of saying k3s.
- Project-structure tree comment for build-and-import.sh: "local
  containerd (kubeadm or k3s)" instead of "k3s containerd".

The appendix (which explicitly bootstraps k3s as a convenience) stays
k3s-specific; that's a deliberate choice for the self-bootstrap path,
not a claim about how the chart runs.

Signed-off-by: arpannookala-12 <ganesh.arpan.nookala@cloud2labs.com>
Signed-off-by: arpannookala-12 <ganesh.arpan.nookala@cloud2labs.com>
@arpannookala-12 arpannookala-12 force-pushed the cld2labs/sglang-gpt-oss branch from 700aae9 to 1171136 Compare June 2, 2026 15:25
…) and Scenario 2 (k3s bootstrap)

Add a decision table at the top so readers go directly to the right
path. Elevate the former appendix to a first-class Scenario 2 section
(S2.1–S2.7) with consistent step numbering. Mark the convergence point
at "Build the Image" explicitly so both paths meet cleanly.

Signed-off-by: arpannookala-12 <ganesh.arpan.nookala@cloud2labs.com>
…s-20b deployment.md

The kubectl patch apisixroute block was under Step 4 (Verify), but
the timeout only surfaces when actually testing inference. Move the
callout to Step 5 (Test), framed as a reaction to a 504 rather than
a pre-emptive action, and link to the full fix in sglang-troubleshooting.md.

Signed-off-by: arpannookala-12 <ganesh.arpan.nookala@cloud2labs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant