Skip to content

test(e2e): e2e k8s environment tests using chainsaw#365

Open
ravisoundar wants to merge 1 commit into
mainfrom
rs-chainsaw
Open

test(e2e): e2e k8s environment tests using chainsaw#365
ravisoundar wants to merge 1 commit into
mainfrom
rs-chainsaw

Conversation

@ravisoundar

Copy link
Copy Markdown
Collaborator

Description

End-2-end tests for the Kubernetes environment using Chainsaw.
Addresses issue #263

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • All commits are signed off per DCO (git commit -s).

@ravisoundar ravisoundar requested a review from dmitsh as a code owner June 26, 2026 03:52
@copy-pr-bot

copy-pr-bot Bot commented Jun 26, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps

greptile-apps Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR introduces a Chainsaw-based E2E test suite for the Kubernetes and Slinky engines, covering six test scenarios (label application, label truncation, tree topology, DRA provider, block complement, and dynamic nodes) against a real kind cluster. It also ships a GitHub Actions workflow (e2e.yml) triggered manually via workflow_dispatch, and adds make e2e, make e2e-local, and make kind-load convenience targets to the Makefile.

  • Image tagging: IMAGE_TAG is now derived from the short commit SHA (git rev-parse --short HEAD) instead of the branch name, eliminating the prior slash-in-tag problem; the CI step correctly passes the tag via a make command-line assignment (make e2e E2E_IMAGE_TAG=\"$IMAGE_TAG\") so the Makefile ?= variable is properly overridden.
  • Chainsaw install: the workflow downloads chainsaw_checksums.txt and verifies the tarball's SHA-256 before extraction, addressing the prior integrity concern.
  • Test structure: all six suites follow a consistent prepare → install (Helm) → assert → finally (cleanup) pattern with catch blocks that emit diagnostic logs on failure; fake slurmd pods are status-patched to Ready so the Slinky engine's pod-based node-resolution path is exercised correctly.

Confidence Score: 5/5

Safe to merge; all tests are additive and the workflow is manually triggered, so no risk to existing CI pipelines.

The two previously flagged image-tag correctness issues are now resolved — the workflow uses the short commit SHA throughout and passes the tag via a make command-line assignment. The Chainsaw binary is now verified against a downloaded checksum file before installation. The test suites follow a disciplined prepare/install/assert/cleanup pattern with diagnostic catch blocks. The remaining notes are minor quality items (unpinned kind version, hardcoded TARGETOS) that do not affect test correctness.

Makefile: the image-build target now hardcodes TARGETOS=linux, which silently drops any GOOS override callers might pass.

Important Files Changed

Filename Overview
.github/workflows/e2e.yml New workflow for Chainsaw E2E tests; uses short commit SHA for image tags (addressing prior slash-in-tag issue), verifies Chainsaw binary checksums, and passes image tag via make command-line override. Minor: kind is installed from @latest without version pinning.
Makefile Adds e2e, e2e-local, kind-load, and chainsaw-install targets; switches IMAGE_TAG to short commit SHA; hardcodes TARGETOS=linux in image-build, silently ignoring any GOOS override.
tests/chainsaw/chainsaw-config.yaml Global Chainsaw config with sequential execution (parallel:1) and reasonable timeouts; fullName:true aids diagnostics.
tests/chainsaw/k8s/label-application/chainsaw-test.yaml Correct Chainsaw test for topology label application; uses Node Observer to auto-trigger generation, asserts leaf/spine labels on fake nodes, and includes good diagnostic catch blocks.
tests/chainsaw/k8s/label-truncation/chainsaw-test.yaml Correctly tests FNV64a hash truncation of switch names exceeding 63 chars; polling loop with length and prefix checks is well-structured.
tests/chainsaw/slinky/dra-provider/chainsaw-test.yaml Tests DRA provider NVLink clique topology discovery end-to-end; fake slurmd pods are correctly status-patched to Ready, and namespace is properly injected at install time.
tests/chainsaw/slinky/block-complement/chainsaw-test.yaml Tests block-complementing with absent nodes across three NVLink cliques; asserts correct placeholder block004 is emitted; cleanup is thorough.
tests/chainsaw/slinky/dynamic-nodes/chainsaw-test.yaml Tests skeleton-only ConfigMap and per-node topology annotations for dynamic nodes; asserts both ConfigMap structure and per-node annotation values correctly.
tests/chainsaw/slinky/tree-topology/chainsaw-test.yaml Tests Slinky tree topology ConfigMap generation; correctly relies on real kind worker nodes to trigger the Node Observer while asserting model-driven output.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A([workflow_dispatch]) --> B[actions/checkout]
    B --> C[Set up Go]
    C --> D[Install kind via go install]
    D --> E[Install Chainsaw + verify SHA256]
    E --> F[kind create cluster]
    F --> G[make build-linux-amd64]
    G --> H[make image-build IMAGE_TAG=short-sha]
    H --> I[kind load docker-image]
    I --> J[make e2e E2E_IMAGE_TAG=short-sha]

    J --> K{Chainsaw suites}
    K --> L[k8s/label-application]
    K --> M[k8s/label-truncation]
    K --> N[slinky/tree-topology]
    K --> O[slinky/dra-provider]
    K --> P[slinky/block-complement]
    K --> Q[slinky/dynamic-nodes]

    L --> R[prepare fake nodes + ConfigMap]
    R --> S[helm upgrade --install topograph]
    S --> T[Node Observer fires → /v1/generate]
    T --> U{assert labels / ConfigMap}
    U -->|pass| V[finally: helm uninstall + cleanup]
    U -->|fail| W[catch: kubectl logs + describe]
    W --> V

    J -->|always| X[kind delete cluster]
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A([workflow_dispatch]) --> B[actions/checkout]
    B --> C[Set up Go]
    C --> D[Install kind via go install]
    D --> E[Install Chainsaw + verify SHA256]
    E --> F[kind create cluster]
    F --> G[make build-linux-amd64]
    G --> H[make image-build IMAGE_TAG=short-sha]
    H --> I[kind load docker-image]
    I --> J[make e2e E2E_IMAGE_TAG=short-sha]

    J --> K{Chainsaw suites}
    K --> L[k8s/label-application]
    K --> M[k8s/label-truncation]
    K --> N[slinky/tree-topology]
    K --> O[slinky/dra-provider]
    K --> P[slinky/block-complement]
    K --> Q[slinky/dynamic-nodes]

    L --> R[prepare fake nodes + ConfigMap]
    R --> S[helm upgrade --install topograph]
    S --> T[Node Observer fires → /v1/generate]
    T --> U{assert labels / ConfigMap}
    U -->|pass| V[finally: helm uninstall + cleanup]
    U -->|fail| W[catch: kubectl logs + describe]
    W --> V

    J -->|always| X[kind delete cluster]
Loading

Reviews (8): Last reviewed commit: "test(e2e): e2e k8s environment tests usi..." | Re-trigger Greptile

@ravisoundar

Copy link
Copy Markdown
Collaborator Author

/ok-to-test 9fee2e8

@codecov

codecov Bot commented Jun 26, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.36%. Comparing base (1875ab8) to head (5a07060).
⚠️ Report is 86 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #365      +/-   ##
==========================================
+ Coverage   68.46%   72.36%   +3.90%     
==========================================
  Files          82       86       +4     
  Lines        4842     5312     +470     
==========================================
+ Hits         3315     3844     +529     
+ Misses       1395     1278     -117     
- Partials      132      190      +58     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ravisoundar ravisoundar force-pushed the rs-chainsaw branch 2 times, most recently from 31d1489 to 91eea39 Compare June 26, 2026 04:45
@ravisoundar

Copy link
Copy Markdown
Collaborator Author

/ok-to-test 91eea39

@github-actions

Copy link
Copy Markdown

@ravisoundar

Copy link
Copy Markdown
Collaborator Author

/ok-to-test a10bc7d

@ravisoundar ravisoundar force-pushed the rs-chainsaw branch 2 times, most recently from b7ac4d8 to 5a07060 Compare June 27, 2026 01:08
@ravisoundar

Copy link
Copy Markdown
Collaborator Author

/ok-to-test 5a07060

Signed-off-by: Ravi Shankar <ravish@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant