Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions .claude/CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@ internal/ # Shared utilities not part of the public API
charts/topograph/ # Helm chart (with node-data-broker subchart); tests/ holds the helm-unittest suites + snapshots
docs/ # Public-facing docs — overview.md, architecture.md, api.md + providers/, engines/, reference/ subdirectories
tests/models/ # YAML simulation fixtures
tests/chainsaw/ # Chainsaw E2E test suites (label-application, label-truncation, node-observer, slinky)
config/ # Sample topograph-config.yaml
scripts/ # Build scripts (deb, rpm, SSL, clean)
localdev/ # Developer-local workspace — not tracked; personal scratch files
Expand Down Expand Up @@ -94,6 +95,20 @@ make coverage # human-readable per-package summary

Run `make qualify` before pushing. The individual targets are available if you want to run a single check during iteration. Run `make chart-test` when you change `charts/topograph/` or its subcharts; CI runs it on every workflow trigger.

### E2E tests (Chainsaw)

Chainsaw conformance tests live in `tests/chainsaw/` and exercise the full Helm deploy → generate → assert cycle against a real cluster.

```bash
make e2e-local # build image, create kind cluster, run all suites, delete cluster
make kind-load KIND_CLUSTER=<name> # load image into an existing kind cluster (run before make e2e)
make e2e # run suites against current KUBECONFIG context
```

`make e2e` uses `E2E_IMAGE_TAG` (defaults to the short commit SHA) as the image tag. For a local kind cluster, run `make image-build && make kind-load KIND_CLUSTER=<name>` before each `make e2e` — the tag changes with every commit, so both steps are needed after any new commit. Prerequisites: `chainsaw`, `kind`, `helm`, `kubectl`, `docker`. See `tests/chainsaw/README.md` for details.

These tests are triggered manually via `.github/workflows/e2e.yml` (`workflow_dispatch`). Run them before merging changes to the Helm chart, Node Observer, or engine output.

### Coverage policy

From `codecov.yml`:
Expand All @@ -109,6 +124,7 @@ Coverage checks run on pull requests. A drop below target with no matching uplif
- `.github/workflows/docker.yml` — container image build (manual trigger)
- `.github/workflows/docker-ib.yml` — InfiniBand-variant container (manual trigger)
- `.github/workflows/helm-release.yaml` — Helm chart release (manual trigger)
- `.github/workflows/e2e.yml` — Chainsaw E2E suite against a kind cluster (manual trigger via `workflow_dispatch`)

### Deployment surfaces

Expand Down
100 changes: 100 additions & 0 deletions .github/workflows/e2e.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
# Copyright 2026 NVIDIA CORPORATION
# SPDX-License-Identifier: Apache-2.0

name: E2E

on:
workflow_dispatch:
inputs:
chainsaw_version:
description: "Chainsaw version to install (e.g. v0.2.12)"
required: false
default: "latest"

env:
KIND_CLUSTER: topograph-e2e
IMAGE_REPO: ghcr.io/nvidia/topograph
CHAINSAW_VERSION: ${{ github.event.inputs.chainsaw_version || 'latest' }}

jobs:
e2e:
name: Chainsaw E2E
runs-on: ubuntu-latest
permissions:
contents: read

steps:
- uses: actions/checkout@v5

- name: Set up Go
uses: actions/setup-go@v6
with:
go-version: '1.25.9'

- name: Install kind
run: go install sigs.k8s.io/kind@latest

- name: Install Chainsaw
run: |
if [ "$CHAINSAW_VERSION" = "latest" ]; then
TAG=$(curl -s https://api.github.com/repos/kyverno/chainsaw/releases/latest \
| grep '"tag_name"' | cut -d'"' -f4)
else
TAG="$CHAINSAW_VERSION"
fi
echo "Installing Chainsaw $TAG"
BASE_URL="https://github.com/kyverno/chainsaw/releases/download/${TAG}"
curl -fsSL "${BASE_URL}/chainsaw_linux_amd64.tar.gz" -o chainsaw.tar.gz
curl -fsSL "${BASE_URL}/chainsaw_checksums.txt" -o chainsaw_checksums.txt
grep "chainsaw_linux_amd64.tar.gz" chainsaw_checksums.txt | sha256sum -c -
tar xz -f chainsaw.tar.gz chainsaw
sudo mv chainsaw /usr/local/bin/
rm -f chainsaw.tar.gz chainsaw_checksums.txt
chainsaw version

- name: Create kind cluster
run: |
kind create cluster \
--name "$KIND_CLUSTER" \
--config tests/chainsaw/kind-config.yaml \
--wait 120s

- name: Build Linux/amd64 image
run: make build-linux-amd64

- name: Build container image
env:
GOOS: linux
GOARCH: amd64
run: |
# Use the short commit SHA as the image tag: always a valid Docker tag,
# works regardless of branch naming conventions.
IMAGE_TAG=$(git rev-parse --short HEAD)
make image-build IMAGE_TAG="$IMAGE_TAG"
echo "IMAGE_TAG=$IMAGE_TAG" >> "$GITHUB_ENV"

- name: Load image into kind
run: |
kind load docker-image "${IMAGE_REPO}:${IMAGE_TAG}" \
--name "$KIND_CLUSTER"

- name: Run E2E tests
env:
TOPOGRAPH_IMAGE_REPO: ${{ env.IMAGE_REPO }}
TOPOGRAPH_IMAGE_PULL_POLICY: Never
run: |
make e2e E2E_IMAGE_TAG="$IMAGE_TAG"

- name: Collect diagnostic logs on failure
if: failure()
run: |
echo "=== kind nodes ==="
kubectl get nodes -o wide
echo "=== all pods ==="
kubectl get pods -A -o wide
echo "=== recent events ==="
kubectl get events -A --sort-by='.lastTimestamp' | tail -50

- name: Delete kind cluster
if: always()
run: kind delete cluster --name "$KIND_CLUSTER"
16 changes: 16 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@ internal/ # Shared utilities not part of the public API
charts/topograph/ # Helm chart (with node-data-broker subchart); tests/ holds the helm-unittest suites + snapshots
docs/ # Public-facing docs — overview.md, architecture.md, api.md + providers/, engines/, reference/ subdirectories
tests/models/ # YAML simulation fixtures
tests/chainsaw/ # Chainsaw E2E test suites (label-application, label-truncation, node-observer, slinky)
config/ # Sample topograph-config.yaml
scripts/ # Build scripts (deb, rpm, SSL, clean)
localdev/ # Developer-local workspace — not tracked; personal scratch files
Expand Down Expand Up @@ -94,6 +95,20 @@ make coverage # human-readable per-package summary

Run `make qualify` before pushing. The individual targets are available if you want to run a single check during iteration. Run `make chart-test` when you change `charts/topograph/` or its subcharts; CI runs it on every workflow trigger.

### E2E tests (Chainsaw)

Chainsaw conformance tests live in `tests/chainsaw/` and exercise the full Helm deploy → generate → assert cycle against a real cluster.

```bash
make e2e-local # build image, create kind cluster, run all suites, delete cluster
make kind-load KIND_CLUSTER=<name> # load image into an existing kind cluster (run before make e2e)
make e2e # run suites against current KUBECONFIG context
```

`make e2e` uses `E2E_IMAGE_TAG` (defaults to the short commit SHA) as the image tag. For a local kind cluster, run `make image-build && make kind-load KIND_CLUSTER=<name>` before each `make e2e` — the tag changes with every commit, so both steps are needed after any new commit. Prerequisites: `chainsaw`, `kind`, `helm`, `kubectl`, `docker`. See `tests/chainsaw/README.md` for details.

These tests are triggered manually via `.github/workflows/e2e.yml` (`workflow_dispatch`). Run them before merging changes to the Helm chart, Node Observer, or engine output.

### Coverage policy

From `codecov.yml`:
Expand All @@ -109,6 +124,7 @@ Coverage checks run on pull requests. A drop below target with no matching uplif
- `.github/workflows/docker.yml` — container image build (manual trigger)
- `.github/workflows/docker-ib.yml` — InfiniBand-variant container (manual trigger)
- `.github/workflows/helm-release.yaml` — Helm chart release (manual trigger)
- `.github/workflows/e2e.yml` — Chainsaw E2E suite against a kind cluster (manual trigger via `workflow_dispatch`)

### Deployment surfaces

Expand Down
47 changes: 45 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ OUTPUT_DIR := ./bin

IMAGE_REPO ?=ghcr.io/nvidia/topograph
GIT_REF ?=$(shell git rev-parse --abbrev-ref HEAD)
IMAGE_TAG ?=$(GIT_REF)
IMAGE_TAG ?=$(shell git rev-parse --short HEAD)

.PHONY: build
build:
Expand Down Expand Up @@ -102,7 +102,7 @@ coverage: test

.PHONY: image-build
image-build:
$(DOCKER_BIN) build --build-arg TARGETOS=$(GOOS) --build-arg TARGETARCH=$(GOARCH) -t $(IMAGE_REPO):$(IMAGE_TAG) -f ./Dockerfile .
$(DOCKER_BIN) build --build-arg TARGETOS=linux --build-arg TARGETARCH=$(GOARCH) -t $(IMAGE_REPO):$(IMAGE_TAG) -f ./Dockerfile .

.PHONY: image-push
image-push: image-build
Expand All @@ -115,6 +115,49 @@ docker-buildx:
$(DOCKER_BIN) buildx build --platform $(PLATFORMS) -t $(IMAGE_REPO):$(IMAGE_TAG) -f ./Dockerfile --push .
- $(DOCKER_BIN) buildx rm topograph-builder

CHAINSAW_BIN ?= chainsaw
KIND_CLUSTER ?= topograph-e2e
E2E_IMAGE_TAG ?= $(IMAGE_TAG)

# Check that chainsaw is installed; print install hint if not.
.PHONY: chainsaw-install
chainsaw-install:
@which $(CHAINSAW_BIN) >/dev/null 2>&1 || \
(echo "chainsaw not found — install from https://kyverno.github.io/chainsaw/latest/quick-start/install/"; exit 1)

# Load the locally-built image into an existing kind cluster with the correct
# E2E_IMAGE_TAG. Use this before running make e2e against a local kind cluster:
# make kind-load KIND_CLUSTER=topograph-test && make e2e
.PHONY: kind-load
kind-load:
kind load docker-image $(IMAGE_REPO):$(E2E_IMAGE_TAG) --name $(KIND_CLUSTER)

# Run all Chainsaw E2E suites against the current KUBECONFIG context.
# For a pre-pushed registry image: set TOPOGRAPH_IMAGE_REPO and TOPOGRAPH_IMAGE_TAG.
# For a local kind cluster: run "make kind-load KIND_CLUSTER=<cluster>" first.
.PHONY: e2e
e2e: chainsaw-install
TOPOGRAPH_IMAGE_REPO=$(IMAGE_REPO) \
TOPOGRAPH_IMAGE_TAG=$(E2E_IMAGE_TAG) \
$(CHAINSAW_BIN) test --test-dir tests/chainsaw

# Build the image, create a 4-worker kind cluster, load the image, run all
# Chainsaw suites, and destroy the cluster. Requires kind and chainsaw.
.PHONY: e2e-local
e2e-local: chainsaw-install image-build
kind create cluster --name $(KIND_CLUSTER) \
--config tests/chainsaw/kind-config.yaml --wait 120s \
|| kind get clusters | grep -q "^$(KIND_CLUSTER)$$"
kind load docker-image $(IMAGE_REPO):$(E2E_IMAGE_TAG) --name $(KIND_CLUSTER)
KUBECONFIG="$$(kind get kubeconfig --name $(KIND_CLUSTER))" \
TOPOGRAPH_IMAGE_REPO=$(IMAGE_REPO) \
TOPOGRAPH_IMAGE_TAG=$(E2E_IMAGE_TAG) \
TOPOGRAPH_IMAGE_PULL_POLICY=Never \
$(CHAINSAW_BIN) test --test-dir tests/chainsaw; \
E2E_STATUS=$$?; \
kind delete cluster --name $(KIND_CLUSTER); \
exit $$E2E_STATUS

.PHONY: ssl
ssl:
SSL_DIR=ssl ./scripts/configure-ssl.sh
Expand Down
46 changes: 46 additions & 0 deletions docs/engines/k8s.md
Original file line number Diff line number Diff line change
Expand Up @@ -344,6 +344,52 @@ tests:
enabled: false
```

### Conformance testing with Chainsaw

`helm test` verifies that a deployed instance is healthy. To verify that the engine actually **applies correct topology labels** to nodes, use the Chainsaw E2E suite in `tests/chainsaw/`.

[Chainsaw](https://kyverno.github.io/chainsaw/) is Kyverno's declarative E2E framework. Each suite drives `apply → wait → assert → cleanup` against a real cluster using the built-in **test provider** — no cloud credentials required.

#### Test suites

| Suite | What it checks |
|---|---|
| `k8s/label-application` | `leaf`, `spine`, and `accelerator` labels applied to nodes after generation |
| `k8s/label-truncation` | Switch names >63 chars replaced with an FNV64a hash (valid label value) |
| `slinky/tree-topology` | Slinky engine writes correct `topology.conf` (tree topology) into a ConfigMap |
| `slinky/dra-provider` | DRA provider discovers NVLink clique topology; Slinky engine writes correct `topology.conf` (block topology) into a ConfigMap |

#### How suites map topology to nodes

Each suite ships a `topology-model.yaml` in its directory. The suite creates
fake K8s Node objects whose names match the node IDs in that file, loads the
file into a ConfigMap, and mounts it at `/etc/topograph/models/` inside the
pod. The `/v1/generate` request passes `modelFileName` pointing at the mounted
file. No node annotations are required.

#### Running locally

```bash
# Prerequisites: chainsaw, kind, helm, kubectl, docker

# Full lifecycle — build, create cluster, run all suites, delete cluster:
make e2e-local

# Against an existing local kind cluster (repeat after each commit):
make image-build # rebuild with the current commit SHA tag
make kind-load KIND_CLUSTER=<cluster-name> # load into the cluster
make e2e

# Single suite only:
chainsaw test --test-dir tests/chainsaw/k8s/label-application
```

See `tests/chainsaw/README.md` for full prerequisites and environment variable reference.

#### Running in CI

The `.github/workflows/e2e.yml` workflow runs on `workflow_dispatch`. Trigger it manually from the GitHub UI before merging changes to the Helm chart, Node Observer, or engine output code paths.

### Chart README

For installation, prerequisites, values reference, and configuration examples, see [`charts/topograph/README.md`](../../charts/topograph/README.md) — also surfaced via `helm show readme topograph/topograph`.
Loading