Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 1 addition & 2 deletions .claude/CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ These structures propagate across every provider and engine. Changing them in a
- **make**
- **golangci-lint** — `brew install golangci-lint` or via `go install`
- **helm 3.10+ or 4.x** — required for `make chart-test`; the `helm-unittest` plugin is installed automatically by the target (`brew install helm`). CI pins helm `v4.1.1` in `.github/workflows/chart-test.yaml`.
- **docker** — only for container image builds and the IB variant
- **docker** — for container image builds (the main image includes `rdma-core` / `ibnetdiscover` for InfiniBand deployments)

### Clone and build

Expand Down Expand Up @@ -107,7 +107,6 @@ Coverage checks run on pull requests. A drop below target with no matching uplif
- `.github/workflows/go.yml` — build, test, and lint on every push and PR
- `.github/workflows/chart-test.yaml` — Helm chart lint + helm-unittest suites (`make chart-test`) on every push and PR
- `.github/workflows/docker.yml` — container image build (manual trigger)
- `.github/workflows/docker-ib.yml` — InfiniBand-variant container (manual trigger)
- `.github/workflows/helm-release.yaml` — Helm chart release (manual trigger)

### Deployment surfaces
Expand Down
70 changes: 0 additions & 70 deletions .github/workflows/docker-ib.yml

This file was deleted.

4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
/coverage.out
/bin
# Binaries built at the repo root by `go build ./cmd/...`
/topograph
/node-observer
/node-data-broker-initc
/ssl
/deb/topograph/DEBIAN/control
/deb/topograph/usr/local/bin/
Expand Down
3 changes: 1 addition & 2 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ These structures propagate across every provider and engine. Changing them in a
- **make**
- **golangci-lint** — `brew install golangci-lint` or via `go install`
- **helm 3.10+ or 4.x** — required for `make chart-test`; the `helm-unittest` plugin is installed automatically by the target (`brew install helm`). CI pins helm `v4.1.1` in `.github/workflows/chart-test.yaml`.
- **docker** — only for container image builds and the IB variant
- **docker** — for container image builds (the main image includes `rdma-core` / `ibnetdiscover` for InfiniBand deployments)

### Clone and build

Expand Down Expand Up @@ -107,7 +107,6 @@ Coverage checks run on pull requests. A drop below target with no matching uplif
- `.github/workflows/go.yml` — build, test, and lint on every push and PR
- `.github/workflows/chart-test.yaml` — Helm chart lint + helm-unittest suites (`make chart-test`) on every push and PR
- `.github/workflows/docker.yml` — container image build (manual trigger)
- `.github/workflows/docker-ib.yml` — InfiniBand-variant container (manual trigger)
- `.github/workflows/helm-release.yaml` — Helm chart release (manual trigger)

### Deployment surfaces
Expand Down
2 changes: 2 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -10,4 +10,6 @@ RUN make build-${TARGETOS}-${TARGETARCH}

FROM alpine:3

RUN apk add --no-cache rdma-core

COPY --from=builder /go/src/github.com/NVIDIA/topograph/bin/* /usr/local/bin/
8 changes: 0 additions & 8 deletions Dockerfile.ib

This file was deleted.

2 changes: 1 addition & 1 deletion charts/topograph/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,7 +93,7 @@ Both test pods are removed automatically on success (`helm.sh/hook-delete-policy

By default, the test pods reuse the main topograph image. Topograph's default image is Alpine-based and ships with busybox `wget`, which the test probes use — so `helm test` works without pulling any additional image, including in air-gapped environments where only mirrored images are reachable.

If you run a topograph image variant without busybox `wget` (for example, the IB variant built on `ubuntu`), override the test image to point at one that does, via `tests.image.repository` and `tests.image.tag`. You can also disable the tests entirely with `tests.enabled=false`.
If your mirrored image lacks busybox `wget`, override the test image to point at one that does, via `tests.image.repository` and `tests.image.tag`. You can also disable the tests entirely with `tests.enabled=false`.

## Subcharts

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -37,13 +37,6 @@ Container image reference. The tag defaults to the chart appVersion when unset.
{{- printf "%s:%s" .Values.image.repository (.Values.image.tag | default .Chart.AppVersion) -}}
{{- end }}

{{/*
Init container image reference. The tag defaults to the chart appVersion when unset.
*/}}
{{- define "node-data-broker.initImage" -}}
{{- printf "%s:%s" .Values.initc.image.repository (.Values.initc.image.tag | default .Chart.AppVersion) -}}
{{- end }}

{{/*
Common labels
*/}}
Expand Down
63 changes: 32 additions & 31 deletions charts/topograph/charts/node-data-broker/templates/daemonset.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -25,52 +25,53 @@ spec:
serviceAccountName: {{ include "node-data-broker.serviceAccountName" . }}
securityContext:
{{- toYaml .Values.podSecurityContext | nindent 8 }}
{{- if .Values.initc.enabled }}
initContainers:
- name: init-node-labels
image: {{ include "node-data-broker.initImage" . | quote }}
imagePullPolicy: {{ .Values.initc.image.pullPolicy }}
containers:
- name: {{ .Chart.Name }}
securityContext:
{{- toYaml .Values.securityContext | nindent 12 }}
image: {{ include "node-data-broker.image" . | quote }}
imagePullPolicy: {{ .Values.image.pullPolicy }}
command:
- /usr/local/bin/node-data-broker-initc
args:
- --provider={{ .Values.global.provider.name }}
- -v={{ .Values.verbosity }}
- --port={{ .Values.port }}
- --refresh-interval={{ .Values.refreshInterval }}
{{- if $useGpuCliqueLabel }}
- --set=useGpuCliqueLabel=true
{{- end }}
{{- range .Values.initc.extraArgs }}
{{- range .Values.extraArgs }}
- --set={{ . }}
{{- end }}
env:
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
{{- if or $configMapMounts .Values.volumeMounts }}
volumeMounts:
{{- range $configMapMounts }}
- name: {{ include "node-data-broker.configMapMountVolumeName" (dict "name" .name) }}
mountPath: {{ required "node-data-broker.configMapMounts[].mountPath is required" .mountPath | quote }}
{{- with .subPath }}
subPath: {{ . | quote }}
{{- end }}
readOnly: true
{{- end }}
{{- with .Values.volumeMounts }}
{{- toYaml . | nindent 12 }}
{{- end }}
{{- end }}
{{- end }}
containers:
- name: {{ .Chart.Name }}
securityContext:
{{- toYaml .Values.securityContext | nindent 12 }}
image: {{ include "node-data-broker.image" . | quote }}
imagePullPolicy: {{ .Values.image.pullPolicy }}
{{- with .Values.command }}
command:
{{ toYaml . | nindent 12 }}
{{- end }}
ports:
- name: http
containerPort: {{ .Values.port }}
protocol: TCP
# The broker applies node annotations before it starts serving
# /healthz. The startup probe gates liveness/readiness so slow
# providers (e.g. infiniband ibnetdiscover) have up to
# failureThreshold * periodSeconds to finish before the container can
# be restarted.
startupProbe:
httpGet:
path: /healthz
port: http
failureThreshold: {{ .Values.startupProbe.failureThreshold }}
periodSeconds: {{ .Values.startupProbe.periodSeconds }}
livenessProbe:
httpGet:
path: /healthz
port: http
readinessProbe:
httpGet:
path: /healthz
port: http
resources:
{{- toYaml .Values.resources | nindent 12 }}
{{- if or $configMapMounts .Values.volumeMounts }}
Expand Down
45 changes: 27 additions & 18 deletions charts/topograph/charts/node-data-broker/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,26 +3,35 @@
# Declare variables to be passed into your templates.

image:
repository: curlimages/curl
repository: ghcr.io/nvidia/topograph
pullPolicy: IfNotPresent
tag: 8.13.0
# Overrides the image tag whose default is the chart appVersion.
tag: ""

enabled: true

command:
- tail
- -f
- /dev/null

initc:
enabled: true
image:
repository: ghcr.io/nvidia/topograph
pullPolicy: IfNotPresent
# Overrides the image tag whose default is the chart appVersion.
tag: ""
extraArgs:
# - key=val
# Port the node-data-broker serves its /healthz endpoint on after applying
# node annotations.
port: 8080

# How often to re-apply node annotations after the initial startup apply.
# Slow providers (e.g. infiniband ibnetdiscover) may need a longer interval.
# Set to 0 to disable periodic refresh.
refreshInterval: 5m

# Extra key=value parameters passed to node-data-broker-initc via --set
# (e.g. gpu-operator-namespace / device-plugin-daemonset for infiniband-k8s).
extraArgs: []
# - key=val

# Startup probe for the node-data-broker container. The broker applies node
# annotations before it starts serving /healthz, so the startup probe gates the
# liveness/readiness probes and gives slow providers (e.g. infiniband
# ibnetdiscover) time to finish. The total startup budget is
# failureThreshold * periodSeconds (default 30 * 10s = 5m).
startupProbe:
failureThreshold: 30
periodSeconds: 10

imagePullSecrets: []
nameOverride: ""
Expand Down Expand Up @@ -67,8 +76,8 @@ resources:
cpu: 100m
memory: 128Mi

# Optional ConfigMaps rendered by the chart and mounted into both the
# node-data-broker container and init container. Set subPath for file mounts;
# Optional ConfigMaps rendered by the chart and mounted into the
# node-data-broker container. Set subPath for file mounts;
# omit it to mount the whole ConfigMap as a directory.
configMapMounts: []
# name is used as a Kubernetes resource and volume suffix; keep it DNS-label-safe.
Expand Down
Loading
Loading