Skip to content

[WIP]feat(ci): add testgrid-publish workflow (TG5)#1477

Open
srao-nv wants to merge 1 commit into
mainfrom
feat/tg5-testgrid-ci
Open

[WIP]feat(ci): add testgrid-publish workflow (TG5)#1477
srao-nv wants to merge 1 commit into
mainfrom
feat/tg5-testgrid-ci

Conversation

@srao-nv

@srao-nv srao-nv commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds .github/workflows/testgrid-publish.yml — the GitHub Actions workflow that automatically publishes AICR UAT evidence bundles to the TestGrid dashboard after every completed cluster run.

Motivation / Context

TG5 from the AICR TestGrid epic (#1263). Once TG2 (tools/testgrid-publish) merges, this wires it into CI so every UAT run appears as a column in the TestGrid dashboard without manual intervention.

Fixes: #1268
Related: #1263, #1267

Type of Change

  • Build/CI/tooling

Component(s) Affected

  • Other: .github/workflows/testgrid-publish.yml (new CI workflow)

Implementation Notes

Trigger: workflow_run on "UAT GCP" and "UAT AWS" (success or failure — failing runs are valid TestGrid columns showing which tests broke). Also supports workflow_dispatch for manual backfills.

Pipeline:

UAT completes → download evidence pointer artifact
             → extract OCI bundle ref + sha256 digest from pointer.yaml
             → GCP WIF auth (publish SA, groups/ prefix write-only)
             → go build ./tools/testgrid-publish
             → testgrid-publish --bundle <oci-ref>@<digest> --bucket aicr-testgrid

Auth: WIF pool aicr-testgrid-prod-github (provisioned in aicr-testgrid Terraform). Workflow filename starts with testgrid-publish to satisfy the WIF attribute_condition in wif-publish.tf.

Environment: Defaults to prod (gs://aicr-testgrid). Override to staging via workflow_dispatch(environment=staging) for testing. WIF pool, publish SA, and bucket are all derived from the environment input.

Dependency: Requires TG2 (tools/testgrid-publish) — PR #1447.

Before merge:

Testing

Tested the full pipeline manually today:

# Built testgrid-publish from source
make testgrid-publish

# Ran against real GKE H100 cluster data
./dist/testgrid-publish \
  --bundle-dir /tmp/tg-bundle \
  --bucket aicr-testgrid-staging \
  --source-class uat

# Verified GCS objects landed in correct order
# Manually ran config-gen + updater + tabulator + summarizer
# Confirmed gke-h100-cos/training tab appeared in TestGrid UI

Risk Assessment

  • Low — New workflow file only; no existing code modified. Cannot break existing CI. Will be a no-op until prod Terraform is applied.

Rollout notes: Workflow is inert until prod WIF pool and SA exist. Safe to merge ahead of Terraform apply.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

@srao-nv srao-nv requested a review from a team as a code owner June 25, 2026 20:19
@srao-nv srao-nv added the theme/ci-dx CI pipelines, developer experience, and build tooling label Jun 25, 2026
@github-actions

Copy link
Copy Markdown
Contributor

Welcome to AICR, @srao-nv! Thanks for your first pull request.

Before review, please ensure:

  • All commits are signed off per the DCO
  • CI checks pass (tests, lint, security scan)
  • The PR description explains the why behind your changes

A maintainer will review this soon.

@github-actions

github-actions Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Coverage Report ✅

Metric Value
Coverage 77.3%
Threshold 75%
Status Pass
Coverage Badge
![Coverage](https://img.shields.io/badge/coverage-77.3%25-green)

No Go source files changed in this PR.

@coderabbitai

coderabbitai Bot commented Jun 25, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a new GitHub Actions workflow for TestGrid publishing. It runs after completed UAT GCP or AWS workflows and supports manual dispatch. The workflow validates the producer repository, resolves environment-scoped GCP settings, downloads and parses evidence pointer artifacts when present, conditionally authenticates to GCP and GHCR, builds tools/testgrid-publish, and runs it with the resolved bundle and bucket.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • NVIDIA/aicr#1479: Changes the UAT evidence OCI coordinates that this workflow reads from pointer.yaml and validates before publishing.

Suggested labels

theme/validation

Suggested reviewers

  • mchmarny
  • njhensley
🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Linked Issues check ⚠️ Warning The PR adds a TestGrid publishing workflow, but #1268 requires TG3 processing workers and config-gen, so the objectives do not match. Align the PR to #1268 by implementing TG3 worker deployment, AICR config-gen, and the TG2↔TG3 metadata-key contract test.
Out of Scope Changes check ⚠️ Warning The workflow is outside the linked issue scope because #1268 focuses on processing workers and config generation, not publishing automation. Remove the publishing workflow from this PR or retarget it to the TG3 processing deliverables described in #1268.
✅ Passed checks (2 passed)
Check name Status Explanation
Description check ✅ Passed The description describes adding the TestGrid publish workflow and its trigger/pipeline, which matches the changeset.
Title check ✅ Passed The title clearly matches the main change: adding the TestGrid publish GitHub Actions workflow.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/tg5-testgrid-ci

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/testgrid-publish.yml:
- Around line 56-58: The TestGrid publishing workflow currently maps the prod
environment to suffixed resource names, which conflicts with the intended
unsuffixed prod resources. Update the environment-to-resource expressions in the
workflow so the prod branch uses the plain TestGrid bucket, GitHub secret, and
publish secret names, while only non-prod keeps the suffixed staging-style
values; check the environment mapping near the environment input and the related
expressions referenced in the later section of the same workflow.
- Around line 68-71: The TestGrid publish trigger is too broad because it runs
on failed workflow_run conclusions even though the producer jobs only upload
pointer.yaml on success, so the download will fail. Update the workflow
condition in the publish job to match the artifact contract, or change the UAT
producer steps that emit the pointer artifact to always publish a failure
evidence pointer; use the existing workflow gate and the
conformance/pointer.yaml upload path to locate the affected logic.
- Around line 167-170: The Publish to TestGrid step in the workflow is
interpolating steps.bundle.outputs.bundle_ref directly into a shell command,
which can allow shell injection. Update the Publish to TestGrid run block to
pass the bundle reference through env first, then reference the shell variable
inside the command instead; use the existing Publish to TestGrid job step and
bundle output wiring to locate the fix.
- Around line 53-55: The publish workflow is ignoring the `source_class`
dispatch input and always using `uat`, which misclassifies manual backfills.
Update the testgrid publish command in the workflow to read from the
`source_class` input instead of hardcoding `uat`, and make sure the same input
is honored anywhere else the publish step is assembled so the value passed to
the publish action matches the manual dispatch request.
- Around line 134-142: The bundle reference assembly in the workflow step that
reads `OCI` and `DIGEST` from `POINTER` only validates `OCI`, so a missing or
null digest can still produce an invalid `BUNDLE_REF`. Add a digest check
alongside the existing OCI validation before constructing `bundle_ref`, and fail
early with a clear error message if `.attestations[0].bundle.digest` is missing
or null. Keep the fix in the same shell block that sets `OCI`, `DIGEST`, and
`BUNDLE_REF` so the pointer contract is enforced before publishing.
- Around line 42-44: The workflow_run trigger is referencing the wrong producer
workflow names, so the publish job will never fire. Update the workflows list in
the testgrid publish workflow to match the exact producer names used elsewhere,
specifically the UAT GCP and UAT AWS workflow identifiers in the workflow_run
configuration.
- Around line 68-75: The workflow_run job currently grants id-token: write and
publishes artifacts without verifying the completed producer run came from the
expected repository/default branch. Add a trusted-ref guard to the job’s if
condition in testgrid-publish so it only runs for workflow_run events from the
intended repo and default branch, while still allowing workflow_dispatch; use
the existing workflow_run and permissions block to locate the fix.
- Around line 108-143: The manual backfill path is still tied to
`github.event.workflow_run.id`, so `workflow_dispatch` cannot download the
artifact and the bundle extraction ignores the required `inputs.bundle_ref`.
Update the `Download evidence pointer` / `Extract bundle ref` logic in
`testgrid-publish.yml` to branch on the trigger: use the artifact-based
`pointer.yaml` flow only for workflow-run driven executions, and for manual
dispatch set `bundle_ref` directly from `inputs.bundle_ref` so the later steps
consume the provided value instead of trying to derive it.
- Around line 73-75: The workflow job is missing the `actions: read` permission
needed by `actions/download-artifact` when using `run-id` and `github-token`.
Update the permissions block in the testgrid publish job to include `actions:
read` alongside the existing `contents: read` and `id-token: write` entries so
the artifact download can access the triggering run.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: a01c8ebc-fd54-489b-944b-da7f2a68750b

📥 Commits

Reviewing files that changed from the base of the PR and between 6fc3444 and 21b9f86.

📒 Files selected for processing (1)
  • .github/workflows/testgrid-publish.yml

Comment thread .github/workflows/testgrid-publish.yml Outdated
Comment thread .github/workflows/testgrid-publish.yml
Comment thread .github/workflows/testgrid-publish.yml
Comment thread .github/workflows/testgrid-publish.yml
Comment thread .github/workflows/testgrid-publish.yml Outdated
Comment thread .github/workflows/testgrid-publish.yml Outdated
Comment thread .github/workflows/testgrid-publish.yml
Comment thread .github/workflows/testgrid-publish.yml Outdated
Comment thread .github/workflows/testgrid-publish.yml Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
.github/workflows/testgrid-publish.yml (1)

131-149: 🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Move bundle_ref input out of the shell template.

Line 133 still injects the manual bundle_ref input directly into run:. A value containing shell metacharacters can break out before WIF-backed publishing. Pass it via env and validate the final ref before writing GITHUB_OUTPUT.

🛡️ Proposed fix
       - name: Resolve bundle ref
         id: bundle
+        env:
+          EVENT_NAME: ${{ github.event_name }}
+          INPUT_BUNDLE_REF: ${{ inputs.bundle_ref }}
         run: |
-          if [ "${{ github.event_name }}" = "workflow_dispatch" ]; then
-            BUNDLE_REF="${{ inputs.bundle_ref }}"
+          if [ "${EVENT_NAME}" = "workflow_dispatch" ]; then
+            BUNDLE_REF="${INPUT_BUNDLE_REF}"
           else
             POINTER="$(find evidence-pointer/ -name 'pointer.yaml' | head -1)"
             if [ -z "${POINTER}" ]; then
               echo "::error::No pointer.yaml found in evidence pointer artifact"
@@
             fi
             BUNDLE_REF="${OCI}@${DIGEST}"
           fi
-          echo "bundle_ref=${BUNDLE_REF}" >> "${GITHUB_OUTPUT}"
+          if [[ ! "${BUNDLE_REF}" =~ ^ghcr\.io/nvidia/[^[:space:]]+@sha256:[0-9a-fA-F]{64}$ ]]; then
+            echo "::error::bundle_ref must be ghcr.io/nvidia/...@sha256:<64 hex chars>"
+            exit 1
+          fi
+          printf 'bundle_ref=%s\n' "${BUNDLE_REF}" >> "${GITHUB_OUTPUT}"
           echo "Resolved bundle: ${BUNDLE_REF}"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/testgrid-publish.yml around lines 131 - 149, The workflow
step still interpolates the manual bundle_ref input directly inside the shell
script, which can allow shell metacharacters to affect execution. Update the
publishing step to pass bundle_ref through env instead of embedding it in run,
then use that variable in the workflow_dispatch branch and validate the final
BUNDLE_REF before echoing it to GITHUB_OUTPUT. Keep the fix localized to the
workflow step that sets BUNDLE_REF and writes bundle_ref.

Source: Linters/SAST tools

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In @.github/workflows/testgrid-publish.yml:
- Around line 131-149: The workflow step still interpolates the manual
bundle_ref input directly inside the shell script, which can allow shell
metacharacters to affect execution. Update the publishing step to pass
bundle_ref through env instead of embedding it in run, then use that variable in
the workflow_dispatch branch and validate the final BUNDLE_REF before echoing it
to GITHUB_OUTPUT. Keep the fix localized to the workflow step that sets
BUNDLE_REF and writes bundle_ref.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: a872862d-dd25-4b4e-a7fd-d1982f392cc9

📥 Commits

Reviewing files that changed from the base of the PR and between 21b9f86 and 01fee9b.

📒 Files selected for processing (1)
  • .github/workflows/testgrid-publish.yml

@github-actions github-actions Bot added size/L and removed size/M labels Jun 25, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
.github/workflows/testgrid-publish.yml (1)

171-194: 🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Validate BUNDLE_REF before writing it to GITHUB_OUTPUT.

Manual input can be empty/malformed, and any newline in BUNDLE_REF can inject additional outputs. Enforce the digest-pinned ref shape before Line 194.

🛡️ Proposed fix
           else
             POINTER="$(find evidence-pointer/ -name 'pointer.yaml' 2>/dev/null | head -1)"
@@
             fi
             BUNDLE_REF="${OCI}@${DIGEST}"
           fi
+          if ! printf '%s' "${BUNDLE_REF}" | grep -Eq '^ghcr\.io/nvidia/[^[:space:]]+@sha256:[0-9a-fA-F]{64}$'; then
+            echo "::error::Invalid bundle ref '${BUNDLE_REF}'"
+            exit 1
+          fi
           echo "bundle_ref=${BUNDLE_REF}" >> "${GITHUB_OUTPUT}"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/testgrid-publish.yml around lines 171 - 194, The TestGrid
publish step writes BUNDLE_REF to GITHUB_OUTPUT without validating the final
value, so malformed manual input or embedded newlines could inject extra
outputs. In the workflow logic around the BUNDLE_REF assignment in the shell
block, validate that BUNDLE_REF is non-empty, matches a digest-pinned OCI ref
shape, and contains no newline characters before echoing it to GITHUB_OUTPUT.
Apply this check for both the workflow_dispatch path and the pointer-derived
path, and fail fast with an error if validation does not pass.
♻️ Duplicate comments (1)
.github/workflows/testgrid-publish.yml (1)

95-107: 🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Reject untrusted producer heads before publishing.

This guard still publishes non-main runs and checks workflow_run.repository.full_name instead of the producer head repository. Use github.event.workflow_run.head_repository.full_name and fail on non-main so fork/branch runs cannot reach WIF-backed publishing.

🛡️ Proposed fix
-          PRODUCER_REPO: ${{ github.event.workflow_run.repository.full_name }}
+          PRODUCER_REPO: ${{ github.event.workflow_run.head_repository.full_name }}
           PRODUCER_BRANCH: ${{ github.event.workflow_run.head_branch }}
@@
           if [ "${PRODUCER_BRANCH}" != "main" ]; then
-            echo "::warning::Producer ran on branch '${PRODUCER_BRANCH}' (not main) — publishing anyway"
+            echo "::error::Producer ran on branch '${PRODUCER_BRANCH}' (not main) — rejecting"
+            exit 1
           fi
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/testgrid-publish.yml around lines 95 - 107, In the
Validate producer step, the workflow is still allowing non-main producer runs to
publish and is checking the wrong repository field. Update the guard in the
workflow_run path to use github.event.workflow_run.head_repository.full_name
instead of github.event.workflow_run.repository.full_name, and make the branch
check on github.event.workflow_run.head_branch fail the job for anything other
than main so untrusted fork/branch heads cannot reach the publishing flow.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/testgrid-publish.yml:
- Around line 220-234: The release lookup in the testgrid-publish workflow
currently falls back to the latest GitHub release, which can reuse an older
binary for untagged commits. Update the logic around the RELEASE check in the
testgrid-publish step so it only downloads when HEAD is an exact tag (for
example by verifying the current ref/tag before calling gh release view);
otherwise, always take the existing source-build path for
./tools/testgrid-publish.

In @.goreleaser.yaml:
- Around line 201-207: The testgrid-publish release artifact is being shipped as
an unverified tarball, so update the goreleaser config for the testgrid-publish
build artifact and the .github/workflows/testgrid-publish.yml consumer to add an
integrity verification step before extraction. Either publish and verify signed
checksum/attestation metadata for the testgrid-publish artifact using the
existing release artifact pattern as a guide, or remove this binary download
path and build from source in the workflow instead.

---

Outside diff comments:
In @.github/workflows/testgrid-publish.yml:
- Around line 171-194: The TestGrid publish step writes BUNDLE_REF to
GITHUB_OUTPUT without validating the final value, so malformed manual input or
embedded newlines could inject extra outputs. In the workflow logic around the
BUNDLE_REF assignment in the shell block, validate that BUNDLE_REF is non-empty,
matches a digest-pinned OCI ref shape, and contains no newline characters before
echoing it to GITHUB_OUTPUT. Apply this check for both the workflow_dispatch
path and the pointer-derived path, and fail fast with an error if validation
does not pass.

---

Duplicate comments:
In @.github/workflows/testgrid-publish.yml:
- Around line 95-107: In the Validate producer step, the workflow is still
allowing non-main producer runs to publish and is checking the wrong repository
field. Update the guard in the workflow_run path to use
github.event.workflow_run.head_repository.full_name instead of
github.event.workflow_run.repository.full_name, and make the branch check on
github.event.workflow_run.head_branch fail the job for anything other than main
so untrusted fork/branch heads cannot reach the publishing flow.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 6f188cb3-9041-43c5-9752-7fe9041bf5e0

📥 Commits

Reviewing files that changed from the base of the PR and between 01fee9b and 03dac49.

📒 Files selected for processing (2)
  • .github/workflows/testgrid-publish.yml
  • .goreleaser.yaml

Comment thread .github/workflows/testgrid-publish.yml Outdated
Comment thread .goreleaser.yaml Outdated
@srao-nv srao-nv force-pushed the feat/tg5-testgrid-ci branch from 03dac49 to b5dd3dd Compare June 25, 2026 21:56

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (3)
.goreleaser.yaml (1)

201-207: 🔒 Security & Privacy | 🟠 Major | 🏗️ Heavy lift

Release artifact ships without integrity/attestation material.

This testgrid-publish archive is downloaded and executed by .github/workflows/testgrid-publish.yml under WIF-issued GCP credentials, yet—unlike the aicr archive (lines 214-222)—it carries no checksum/attestation bundle. Either publish and verify signed integrity metadata for this artifact, or keep the workflow source-built only.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.goreleaser.yaml around lines 201 - 207, The testgrid-publish release
artifact is being shipped without any integrity or attestation material, unlike
the aicr artifact setup. Update the goreleaser configuration for the
testgrid-publish entry so it publishes signed checksum/attestation metadata, and
make sure the .github/workflows/testgrid-publish.yml flow verifies that metadata
before execution. If signed verification cannot be added, adjust the workflow to
build from source instead of downloading and running the release archive.
.github/workflows/testgrid-publish.yml (2)

164-178: 🎯 Functional Correctness | 🟠 Major | 🏗️ Heavy lift

Don’t silently skip failed UAT runs that should produce columns.

The workflow advertises failure columns, but missing pointer artifacts on failed producer runs become a successful skip. Either make producers always upload a failure evidence pointer, or restrict this workflow to conclusions with a guaranteed pointer contract.

Also applies to: 207-214

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/testgrid-publish.yml around lines 164 - 178, The “Download
evidence pointer” step in the testgrid publish workflow is treating missing
pointer artifacts from failed UAT runs as a benign skip, which can hide failure
columns. Update the workflow logic around the download-pointer step and the
downstream bundle resolution so it only runs when the producer guarantees a
pointer contract, or change the producer side to always upload a failure
evidence pointer for failed runs; use the existing workflow_run gating and the
download-pointer step’s outcome handling to make the failure path explicit
instead of silently succeeding.

87-90: 🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Reject non-main producer runs before WIF-capable publishing.

Line 121 only warns for non-main producers, but the job still proceeds to download attacker-influenced artifacts and later obtains GCP credentials. Tighten the job condition and make the producer validation fail closed; workflow_run is specifically called out by GitHub as risky when it can grant write/secrets access from triggering-workflow data. (docs.github.com)

🛡️ Proposed fix
     if: >
       (github.event_name == 'workflow_dispatch' && github.ref == 'refs/heads/main') ||
-      github.event.workflow_run.conclusion == 'success' ||
-      github.event.workflow_run.conclusion == 'failure'
+      (
+        github.event_name == 'workflow_run' &&
+        github.event.workflow_run.head_repository.full_name == github.repository &&
+        github.event.workflow_run.head_branch == 'main' &&
+        (
+          github.event.workflow_run.conclusion == 'success' ||
+          github.event.workflow_run.conclusion == 'failure'
+        )
+      )
@@
-          PRODUCER_REPO: ${{ github.event.workflow_run.repository.full_name }}
+          PRODUCER_REPO: ${{ github.event.workflow_run.head_repository.full_name }}
           PRODUCER_BRANCH: ${{ github.event.workflow_run.head_branch }}
@@
           if [ "${PRODUCER_BRANCH}" != "main" ]; then
-            echo "::warning::Producer ran on branch '${PRODUCER_BRANCH}' (not main) — publishing anyway"
+            echo "::error::Producer ran on branch '${PRODUCER_BRANCH}' (not main) — rejecting"
+            exit 1
           fi

Also applies to: 113-123

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/testgrid-publish.yml around lines 87 - 90, The testgrid
publish workflow currently allows workflow_run executions from non-main producer
branches to continue, which can lead to downloading untrusted artifacts before
GCP/WIF access is granted. Tighten the job-level `if` in testgrid-publish.yml so
`workflow_run` only proceeds when the producing workflow is on `main`, and make
the producer validation step in the publish job fail closed instead of only
warning. Use the existing `workflow_run` gating and the validation logic around
the producer branch check to locate the change.

Source: Linters/SAST tools

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/testgrid-publish.yml:
- Around line 64-69: The workflow dispatch inputs for source_class and
environment are currently free-form, which can lead to invalid publish target
derivation. Constrain these inputs in the workflow definition used by
testgrid-publish so only the supported values are accepted, and update the
target derivation logic to assume validated values rather than handling
arbitrary strings. Make the same fix in the publish-target mapping section
referenced by the testgrid-publish workflow so the WIF provider, bucket, and
backfill classification are selected only from known source_class/environment
combinations.
- Around line 199-230: The bundle reference is written to GITHUB_OUTPUT without
enforcing the expected ghcr.io/nvidia/...@sha256:<64-hex> contract, so validate
BUNDLE_REF once before the final echo in this workflow step. Add a single
validation point covering both DISPATCH_BUNDLE_REF and the OCI/DIGEST values
derived from pointer.yaml, and reject any ref containing newlines or not
matching the required registry/digest format. Use the existing bundle_ref
assembly block and the pointer.yaml extraction logic to keep the fix localized.

---

Duplicate comments:
In @.github/workflows/testgrid-publish.yml:
- Around line 164-178: The “Download evidence pointer” step in the testgrid
publish workflow is treating missing pointer artifacts from failed UAT runs as a
benign skip, which can hide failure columns. Update the workflow logic around
the download-pointer step and the downstream bundle resolution so it only runs
when the producer guarantees a pointer contract, or change the producer side to
always upload a failure evidence pointer for failed runs; use the existing
workflow_run gating and the download-pointer step’s outcome handling to make the
failure path explicit instead of silently succeeding.
- Around line 87-90: The testgrid publish workflow currently allows workflow_run
executions from non-main producer branches to continue, which can lead to
downloading untrusted artifacts before GCP/WIF access is granted. Tighten the
job-level `if` in testgrid-publish.yml so `workflow_run` only proceeds when the
producing workflow is on `main`, and make the producer validation step in the
publish job fail closed instead of only warning. Use the existing `workflow_run`
gating and the validation logic around the producer branch check to locate the
change.

In @.goreleaser.yaml:
- Around line 201-207: The testgrid-publish release artifact is being shipped
without any integrity or attestation material, unlike the aicr artifact setup.
Update the goreleaser configuration for the testgrid-publish entry so it
publishes signed checksum/attestation metadata, and make sure the
.github/workflows/testgrid-publish.yml flow verifies that metadata before
execution. If signed verification cannot be added, adjust the workflow to build
from source instead of downloading and running the release archive.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: cb309592-598b-4012-9a40-c98c2e2ffeb6

📥 Commits

Reviewing files that changed from the base of the PR and between 03dac49 and b5dd3dd.

📒 Files selected for processing (2)
  • .github/workflows/testgrid-publish.yml
  • .goreleaser.yaml

Comment thread .github/workflows/testgrid-publish.yml
Comment thread .github/workflows/testgrid-publish.yml
Wires tools/testgrid-publish into CI so every completed UAT run
(GCP or AWS) automatically publishes to the AICR TestGrid dashboard.

Trigger: workflow_run on "UAT - GCP" / "UAT - AWS" (success or failure).
Also supports workflow_dispatch (main branch only) for backfills.

Pipeline:
  1. Validate producer run is from this repository (pre-credential guard)
  2. Download evidence pointer artifact; distinguish network errors from
     benign "conformance did not complete" skips
  3. Extract OCI bundle ref + sha256 digest from pointer.yaml; validate
     both fields and enforce ghcr.io/nvidia/...@sha256:<64hex> format
     before writing to GITHUB_OUTPUT (guards against newline injection)
  4. Authenticate to GCP via WIF (publish SA, objectCreator on groups/ only)
  5. Login to GHCR (packages: read) so testgrid-publish can pull the bundle
  6. Build testgrid-publish from source (vendor — no integrity risk from
     downloading a release artifact into a privileged WIF context)
  7. Publish: pull OCI bundle → parse predicate → write GCS in order

Security hardening:
  - Top-level permissions: contents: read (least privilege default)
  - Job-level: contents:read + actions:read + packages:read + id-token:write
  - No project IDs/numbers in source — all from vars.GCP_PROJECT_NUMBER
    and vars.GCP_PROJECT_ID (repo variables, not secrets)
  - All actions pinned to SHA
  - All user-controlled values through env: (no inline ${{ }} in run:)
  - set -euo pipefail in every shell script
  - workflow_dispatch restricted to refs/heads/main
  - workflow_dispatch inputs use type: choice / type: string to prevent
    typos silently deriving non-existent WIF pools
  - BUNDLE_REF validated against ghcr.io/nvidia/...@sha256:<64hex> regex
  - concurrency block (one publish per triggering run, no cancel)
  - timeout-minutes: 15
  - Release branches intentionally allowed (qualify release → publish)

Environment-aware: defaults to prod (gs://aicr-testgrid). Override to
staging via workflow_dispatch(environment=staging).

Pending before this can merge:
  - TG2 (#1447) must be merged first (tools/testgrid-publish must exist)
  - Apply prod Terraform in aicr-testgrid (creates aicr-testgrid-prod-github
    WIF pool and aicr-testgrid-prod-publish SA)
  - Set vars.GCP_PROJECT_NUMBER and vars.GCP_PROJECT_ID in repo settings
@srao-nv srao-nv force-pushed the feat/tg5-testgrid-ci branch from b5dd3dd to 37366ea Compare June 26, 2026 14:58
@srao-nv srao-nv changed the title feat(ci): add testgrid-publish workflow (TG5) [WIP]feat(ci): add testgrid-publish workflow (TG5) Jun 26, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
.github/workflows/testgrid-publish.yml (2)

131-135: 🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Reject unexpected same-repo branches, not just forks.

Line 133 currently allows every non-main branch and only logs a notice, so any same-repo feature branch running a matching UAT workflow can drive a privileged prod TestGrid publish. If release branches are intended, allowlist them explicitly and fail everything else before WIF auth.

🛡️ Proposed fix
-          # Release branches (e.g. release/v1.x) intentionally publish —
-          # their UAT runs qualify the release and belong in TestGrid.
-          if [ "${PRODUCER_BRANCH}" != "main" ]; then
-            echo "::notice::Producer ran on branch '${PRODUCER_BRANCH}' — publishing (release branches are intentional)"
-          fi
+          case "${PRODUCER_BRANCH}" in
+            main|release/*)
+              echo "::notice::Producer ran on branch '${PRODUCER_BRANCH}' — publishing"
+              ;;
+            *)
+              echo "::error::Producer ran on unexpected branch '${PRODUCER_BRANCH}' — rejecting"
+              exit 1
+              ;;
+          esac
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/testgrid-publish.yml around lines 131 - 135, The branch
gate in the publish workflow is too broad because it treats every non-main
PRODUCER_BRANCH as publishable; tighten the check so only explicitly allowlisted
release branches are accepted and all other same-repo branches fail before WIF
auth. Update the branch validation logic in the testgrid publish workflow around
the PRODUCER_BRANCH notice to reject unexpected branches, keeping the intent for
release branches clear while preventing privileged publishes from feature
branches.

Source: Linters/SAST tools


242-248: 🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Validate the whole bundle ref before writing GITHUB_OUTPUT.

grep -qE is line-based, so valid-ref\nextra=value can pass because one line matches, then Line 248 writes injected output lines. Use Bash regex on the variable and a stricter path character class.

🛡️ Proposed fix
-          if ! echo "${BUNDLE_REF}" | grep -qE '^ghcr\.io/nvidia/[^@]+@sha256:[0-9a-f]{64}$'; then
+          if [[ ! "${BUNDLE_REF}" =~ ^ghcr\.io/nvidia/[A-Za-z0-9._/-]+@sha256:[0-9a-f]{64}$ ]]; then
             echo "::error::BUNDLE_REF '${BUNDLE_REF}' does not match expected format ghcr.io/nvidia/...@sha256:<64hex>"
             exit 1
           fi
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/testgrid-publish.yml around lines 242 - 248, The bundle
ref check in the testgrid publish workflow is only validating a single line with
grep, so a multiline value can still inject extra GITHUB_OUTPUT entries. Update
the validation around BUNDLE_REF to use a Bash regex match on the full variable
before the echo to GITHUB_OUTPUT, and tighten the allowed ghcr.io/nvidia path
character class so only a single safe bundle ref can pass.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/testgrid-publish.yml:
- Around line 217-226: The evidence-pointer lookup in the TestGrid publish
workflow is too brittle under set -euo pipefail because the find-based POINTER
assignment can terminate the script before the missing-artifact skip logic runs.
Update the pointer discovery in the workflow step to tolerate a absent
evidence-pointer/ directory while preserving the existing branch that checks
POINTER, DOWNLOAD_OUTCOME, and UAT_CONCLUSION so the skip=true path remains
reachable.

---

Duplicate comments:
In @.github/workflows/testgrid-publish.yml:
- Around line 131-135: The branch gate in the publish workflow is too broad
because it treats every non-main PRODUCER_BRANCH as publishable; tighten the
check so only explicitly allowlisted release branches are accepted and all other
same-repo branches fail before WIF auth. Update the branch validation logic in
the testgrid publish workflow around the PRODUCER_BRANCH notice to reject
unexpected branches, keeping the intent for release branches clear while
preventing privileged publishes from feature branches.
- Around line 242-248: The bundle ref check in the testgrid publish workflow is
only validating a single line with grep, so a multiline value can still inject
extra GITHUB_OUTPUT entries. Update the validation around BUNDLE_REF to use a
Bash regex match on the full variable before the echo to GITHUB_OUTPUT, and
tighten the allowed ghcr.io/nvidia path character class so only a single safe
bundle ref can pass.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 175178f3-cd87-4a45-ac0f-d4db1f89650e

📥 Commits

Reviewing files that changed from the base of the PR and between b5dd3dd and 37366ea.

📒 Files selected for processing (1)
  • .github/workflows/testgrid-publish.yml

Comment on lines +217 to +226
POINTER="$(find evidence-pointer/ -name 'pointer.yaml' 2>/dev/null | head -1)"
DOWNLOAD_OUTCOME="${DOWNLOAD_POINTER_OUTCOME}"
if [ -z "${POINTER}" ]; then
if [ "${DOWNLOAD_OUTCOME}" = "failure" ] && [ "${UAT_CONCLUSION}" = "success" ]; then
echo "::error::UAT succeeded but evidence pointer download failed — possible network error"
exit 1
fi
echo "::notice::No evidence pointer found (conformance did not complete) — skipping TestGrid publish"
echo "skip=true" >> "${GITHUB_OUTPUT}"
exit 0

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Keep the missing-artifact skip path reachable.

With set -euo pipefail, POINTER="$(find evidence-pointer/ ... | head -1)" can exit the script when evidence-pointer/ does not exist, before Line 219 handles the missing pointer. Make the lookup tolerate the absent directory.

🐛 Proposed fix
-            POINTER="$(find evidence-pointer/ -name 'pointer.yaml' 2>/dev/null | head -1)"
+            POINTER="$(find evidence-pointer/ -name 'pointer.yaml' -print -quit 2>/dev/null || true)"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
POINTER="$(find evidence-pointer/ -name 'pointer.yaml' 2>/dev/null | head -1)"
DOWNLOAD_OUTCOME="${DOWNLOAD_POINTER_OUTCOME}"
if [ -z "${POINTER}" ]; then
if [ "${DOWNLOAD_OUTCOME}" = "failure" ] && [ "${UAT_CONCLUSION}" = "success" ]; then
echo "::error::UAT succeeded but evidence pointer download failed — possible network error"
exit 1
fi
echo "::notice::No evidence pointer found (conformance did not complete) — skipping TestGrid publish"
echo "skip=true" >> "${GITHUB_OUTPUT}"
exit 0
POINTER="$(find evidence-pointer/ -name 'pointer.yaml' -print -quit 2>/dev/null || true)"
DOWNLOAD_OUTCOME="${DOWNLOAD_POINTER_OUTCOME}"
if [ -z "${POINTER}" ]; then
if [ "${DOWNLOAD_OUTCOME}" = "failure" ] && [ "${UAT_CONCLUSION}" = "success" ]; then
echo "::error::UAT succeeded but evidence pointer download failed — possible network error"
exit 1
fi
echo "::notice::No evidence pointer found (conformance did not complete) — skipping TestGrid publish"
echo "skip=true" >> "${GITHUB_OUTPUT}"
exit 0
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/testgrid-publish.yml around lines 217 - 226, The
evidence-pointer lookup in the TestGrid publish workflow is too brittle under
set -euo pipefail because the find-based POINTER assignment can terminate the
script before the missing-artifact skip logic runs. Update the pointer discovery
in the workflow step to tolerate a absent evidence-pointer/ directory while
preserving the existing branch that checks POINTER, DOWNLOAD_OUTCOME, and
UAT_CONCLUSION so the skip=true path remains reachable.

@mchmarny mchmarny left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean, well-structured CI workflow — injection hygiene is solid (all user-controlled values flow through env, anchored format regex, digest-pinned bundle refs), actions are SHA-pinned, permissions are least-privilege at both workflow and job level, and cancel-in-progress: false correctly protects in-flight GCS writes. Verified against the repo: UAT workflow names, the uat-*-evidence-pointer-<run_id> artifact naming, the attestations[0].bundle.oci/.digest pointer schema, and the referenced composite actions all line up.

No blocking issues. Three non-critical notes inline: a community source_class option that the nvidia-only bundle regex makes unusable (medium), a producer-guard comment that overstates what the repo-name check actually does (low — moot since UAT can't be fork-triggered), and unguarded vars.GCP_PROJECT_* lookups (low).

WIP as titled — go build ./tools/testgrid-publish will fail until TG2 (#1447) merges, and the workflow is inert until the prod WIF pool/SA exist; both are already documented in the description. Nothing here blocks merge once those land.

fi
# Validate format: ghcr.io/nvidia/...@sha256:<64 hex chars>
# Also guards against newline injection into GITHUB_OUTPUT.
if ! echo "${BUNDLE_REF}" | grep -qE '^ghcr\.io/nvidia/[^@]+@sha256:[0-9a-f]{64}$'; then

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

source_class: community (line 71) is offered as a dispatch option, but this regex hard-codes ^ghcr\.io/nvidia/. Committed community evidence bundles live under contributor namespaces — e.g. recipes/evidence/gb200-eks-ubuntu-training.yaml points at ghcr.io/yuanchen8911/aicr-evidence:.... So a dispatch with source_class: community and a real community bundle_ref always fails this check. Either the community option is dead, or the regex needs to permit non-nvidia namespaces (while still requiring @sha256:<64hex>).

Comment on lines +116 to +128
# Guard: reject workflow_run events from forks or unexpected branches
# before any credential exchange. WIF is gated by the attribute_condition
# in wif-publish.tf, but an explicit check here fails fast and produces
# a clear error message rather than a cryptic WIF rejection.
- name: Validate producer
if: github.event_name == 'workflow_run'
env:
PRODUCER_REPO: ${{ github.event.workflow_run.repository.full_name }}
PRODUCER_BRANCH: ${{ github.event.workflow_run.head_branch }}
run: |
set -euo pipefail
if [ "${PRODUCER_REPO}" != "${GITHUB_REPOSITORY}" ]; then
echo "::error::Producer workflow is from '${PRODUCER_REPO}', expected '${GITHUB_REPOSITORY}' — rejecting"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says this "rejects workflow_run events from forks," but github.event.workflow_run.repository.full_name is always the repo that hosts the workflow (the base repo) — never the fork. For a fork-originated run the fork identity is in workflow_run.head_repository.full_name, so this != GITHUB_REPOSITORY check can never fire on a fork.

In practice this is moot: UAT only triggers on workflow_dispatch + schedule, never fork PRs, so no fork can ever produce a UAT - GCP/AWS run. But the comment overstates the guard. Either soften the comment to match what the check actually does (base-repo sanity check) or, if fork defense-in-depth is the intent, check head_repository.full_name.

Comment on lines +144 to +156
GCP_PROJECT_NUMBER: ${{ vars.GCP_PROJECT_NUMBER }}
GCP_PROJECT_ID: ${{ vars.GCP_PROJECT_ID }}
run: |
set -euo pipefail
if [ "${TG_ENV}" = "prod" ]; then
CLUSTER="aicr-testgrid"
else
CLUSTER="aicr-testgrid-${TG_ENV}"
fi
echo "cluster=${CLUSTER}" >> "${GITHUB_OUTPUT}"
echo "bucket=${CLUSTER}" >> "${GITHUB_OUTPUT}"
echo "wif_provider=projects/${GCP_PROJECT_NUMBER}/locations/global/workloadIdentityPools/${CLUSTER}-github/providers/github-actions" >> "${GITHUB_OUTPUT}"
echo "wif_sa=${CLUSTER}-publish@${GCP_PROJECT_ID}.iam.gserviceaccount.com" >> "${GITHUB_OUTPUT}"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vars.GCP_PROJECT_NUMBER / GCP_PROJECT_ID are consumed unguarded. If either repo var is unset, wif_provider silently becomes projects//locations/global/... and wif_sa becomes ...-publish@.iam.gserviceaccount.com, so the failure surfaces later as a cryptic WIF rejection. A quick [ -n "${GCP_PROJECT_NUMBER}" ]/[ -n "${GCP_PROJECT_ID}" ] guard with an ::error:: would fail fast here — same "fail fast with a clear message" rationale you gave for the producer guard.

@njhensley njhensley left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔍 Reconciled multi-persona review

Method: 4 independent persona reviewers (Security/CI-supply-chain · Operability/CI-DX · Correctness · Domain & Architecture) → adversarial senior meta-reviewer re-derived every finding from the resolved code, then reconciled against all prior reviews on this PR (CodeRabbit ×5 runs, @mchmarny's human review, pre-merge bot + zizmor). Findings are deduplicated and re-tiered to the cross-source consensus; line links pin to head 37366ea.

Legend: 🔴 Blocker · 🟠 Major · 🟡 Minor · 🔵 Nitpick

Overall

This is a carefully built, security-conscious workflow — all three review sources agree the logic is safe (injection hygiene via env-routing, SHA-pinned actions, least-privilege per-job permissions, persist-credentials: false, cancel-in-progress: false, build-from-vendor) and that the live risks are operability + the prod-Terraform sequencing, not the workflow logic. @mchmarny found no blocking issues; this reconciliation concurs — no hard blocker remains now that TG2 (#1447) has merged.

Recommendation: Approve-with-comments, gated on two operational items before prod traffic:

  1. Reconcile the prod WIF pool/SA naming against the actual wif-publish.tf (the inline 🟠 at the Resolve-GCP-config step) — with TG2 merged this is the most likely rollout-breaker.
  2. Merge after the prod Terraform apply, or add a feature-flag guard — otherwise every passing UAT goes red at WIF auth until the pool/SA exist.

The other two 🟠 (line-based grep + its inaccurate comment; unguarded vars.GCP_PROJECT_*) are quick fixes worth landing in this PR.

TG2 status (confirmed on main)

tools/testgrid-publish is now merged, and its flags match the workflow exactly — --bundle / --bucket / --source-class (main.go:71-74), --source-class defaulting to uat and accepting community. So the earlier flag-drift concern is resolved; the build step will now succeed.

Reconciliation with existing reviews

Already addressed by the author on earlier commits (CodeRabbit threads marked ✅): producer workflow names "UAT - GCP"/"UAT - AWS", source_class wired through env, bundle_ref moved to env, actions: read added, prod bucket un-suffixed.

Moot for the current PR: CodeRabbit's .goreleaser.yaml release-artifact-integrity / unsigned-tarball findings — the PR no longer touches .goreleaser.yaml and the workflow builds from source only. zizmor concurrency-limits is resolved (concurrency block present); zizmor's workflow_run dangerous-triggers is informational and mitigated by the producer guard.

Process note — linked-issue / scope mismatch: the PR says Fixes #1268, but the pre-merge bot flags #1268 as TG3 (processing workers + config-gen), not TG5 (publishing) → "Out of Scope" + "Linked Issues" warnings. Worth correcting the linked issue or the Fixes target.

Correction to my own panel: my Security/meta pass initially called the grep regex a newline-injection guard — CodeRabbit is right that it is not (see the inline 🟠). Reconciliation defers to CodeRabbit there.

Confirmed non-issues (examined, agreed by panel + @mchmarny)

Script-injection hygiene (all event/user values env-routed, never inline ${{ }} in run:), SHA-pinned actions, least-privilege per-job permissions, persist-credentials: false, cancel-in-progress: false on the upload job, build-from-vendor in the privileged context, and the producer-repo guard running before any credential exchange (defense-in-depth — moot since UAT can't be fork-triggered). Digest construction, yq null/empty handling, the skip output coverage, and the top-level if: matrix were all traced correct.

Tier tally (this review)

🔴 Blocker 🟠 Major 🟡 Minor 🔵 Nitpick
0 5 7 2

🤖 Posted via the reviewing-prs-multipersona skill (multi-persona + adversarial meta-review, reconciled across CodeRabbit and human reviews).

GCP_PROJECT_ID: ${{ vars.GCP_PROJECT_ID }}
run: |
set -euo pipefail
if [ "${TG_ENV}" = "prod" ]; then

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Major — prod WIF pool/SA naming contradicts the PR's stated Terraform

The prod branch emits unsuffixed names: WIF pool aicr-testgrid-github, SA aicr-testgrid-publish, bucket aicr-testgrid. The PR body states prod Terraform creates aicr-testgrid-prod-github and aicr-testgrid-prod-publish (only the bucket matches). CodeRabbit flagged the inverse on an earlier commit; the bucket got reconciled but the pool/SA names still disagree.

Blast radius: Once prod Terraform is applied, a name mismatch = silent WIF auth failure on every publish → the dashboard never updates. With TG2 now merged, this is the single most likely rollout-breaker.

Fix: Reconcile against the actual wif-publish.tf in the aicr-testgrid repo before/with the apply. Drop the -prod suffix in Terraform, or set CLUSTER="aicr-testgrid-prod" here. (Source: persona panel + CodeRabbit history.)

fi
# Validate format: ghcr.io/nvidia/...@sha256:<64 hex chars>
# Also guards against newline injection into GITHUB_OUTPUT.
if ! echo "${BUNDLE_REF}" | grep -qE '^ghcr\.io/nvidia/[^@]+@sha256:[0-9a-f]{64}$'; then

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Major — grep -qE is line-based — does NOT prevent GITHUB_OUTPUT newline injection; the comment is wrong

echo "${BUNDLE_REF}" | grep -qE '^...$' matches per line, so a value like valid-ref\nfoo=bar passes (line 1 matches) and the subsequent echo "bundle_ref=${BUNDLE_REF}" >> GITHUB_OUTPUT writes both lines, injecting foo=bar as a second output. The inline comment claiming this 'guards against newline injection into GITHUB_OUTPUT' is therefore inaccurate.

Blast radius: Bounded — exploitable via the workflow_dispatch bundle_ref input by a main-branch actor (or a newline reaching the trusted pointer). Low likelihood, but the guard does not do what its comment says.

Fix: Use a whole-string Bash match instead of line-based grep: if [[ ! "${BUNDLE_REF}" =~ ^ghcr\.io/nvidia/[A-Za-z0-9._/-]+@sha256:[0-9a-f]{64}$ ]]; then .... Then fix the comment. (Source: CodeRabbit — corrects an earlier claim in my own panel.)

id: gcp
env:
TG_ENV: ${{ inputs.environment || 'prod' }}
GCP_PROJECT_NUMBER: ${{ vars.GCP_PROJECT_NUMBER }}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Major — Unguarded vars.GCP_PROJECT_NUMBER / GCP_PROJECT_ID → malformed WIF, cryptic failure

No non-empty guard (set -u does not catch empty-but-set env vars). If either repo var is unset, the step still succeeds and emits projects//locations/.../aicr-testgrid-github/... and ...-publish@.iam.gserviceaccount.com, which then fail opaquely inside google-github-actions/auth three steps later.

Blast radius: Operator sees a cryptic WIF rejection instead of 'GCP_PROJECT_NUMBER is unset' — the same legibility the Validate-producer step provides, missing for config vars.

Fix: Fail closed early: : "${GCP_PROJECT_NUMBER:?repo var GCP_PROJECT_NUMBER is unset}" (and GCP_PROJECT_ID) with an ::error::. (Source: persona panel + mchmarny + CodeRabbit.)

# a failure run is a valid TestGrid column showing what broke.
# For workflow_dispatch: restrict to main to prevent feature-branch
# code from obtaining WIF credentials (see comment on trigger above).
if: >

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Major — Job if: silently drops cancelled / timed_out / skipped UAT conclusions

The gate admits only workflow_dispatch-on-main or workflow_run conclusion success/failure. A cancelled, timed_out, action_required, or skipped UAT (e.g. an operator-aborted or 180-min-capped run) makes the whole job skip with no step output.

Blast radius: Observability gap that contradicts the header comment ('after every completed cluster run'): no TestGrid column AND no log explaining the no-op. On-call investigating 'why no column for run X' finds a job that simply didn't run.

Fix: Either broaden the gate and let Resolve-bundle-ref decide (it already handles no-pointer → skip), or add a ::notice:: documenting that non-terminal conclusions are intentionally not published. (Source: persona panel — net-new, not raised by prior reviewers.)


# Authenticate to GCP via WIF using the testgrid publish SA.
# This SA has objectCreator on gs://aicr-testgrid[-<env>]/groups/ only.
- name: Authenticate to GCP

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Major — Not 'inert until Terraform': successful UATs go red at WIF auth until prod Terraform applies

TG2 is now merged so go build succeeds, but on a passing UAT the job still reaches Authenticate to GCP and fails there until the prod WIF pool/SA exist. The skip/no-pointer path is genuinely green; the success path is not.

Blast radius: Every passing UAT produces a red TestGrid Publish run (after exercising prod auth) between merge and the Terraform apply — alert noise that erodes CI signal.

Fix: Merge after the prod Terraform apply, OR gate the job behind a vars.TESTGRID_PUBLISH_ENABLED == 'true' feature flag (also makes 'inert until Terraform' literally true and gives on-call one switch). (Source: persona panel + mchmarny; re-pointed from the TG2 gate now that TG2 merged.)

- name: Resolve GCP config
id: gcp
env:
TG_ENV: ${{ inputs.environment || 'prod' }}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Minor — environment input selects bucket/WIF but is never passed to the tool

environment (prod/staging) flows only into bucket/WIF resource names; the binary call passes --bucket and --source-class but no --environment. Architecturally fine iff the bucket fully encodes env (it does: bucket == cluster == aicr-testgrid[-]).

Blast radius: None today; a coupling worth documenting so a future change doesn't assume the tool sees the env dimension.

Fix: Add a one-line comment that env is encoded in the bucket. (Source: persona panel.)

# workflow code could otherwise obtain GCS write credentials.
workflow_dispatch:
inputs:
bundle_ref:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Minor — workflow_dispatch bundle_ref provenance unverified beyond prefix+digest

The regex constrains namespace+digest shape but not provenance. A main-branch operator (or anyone with push to ghcr.io/nvidia/*) could feed an arbitrary conforming bundle to the publish binary, which pulls and parses it under WIF creds.

Blast radius: Bounded by the SA's groups/-prefix-only objectCreator scope and the main-only dispatch gate. Real but well-contained.

Fix: Optionally verify the bundle's attestation/signature before pull, or document workflow_dispatch as trusted-operator-only. (Source: persona panel.)

POINTER="$(find evidence-pointer/ -name 'pointer.yaml' 2>/dev/null | head -1)"
DOWNLOAD_OUTCOME="${DOWNLOAD_POINTER_OUTCOME}"
if [ -z "${POINTER}" ]; then
if [ "${DOWNLOAD_OUTCOME}" = "failure" ] && [ "${UAT_CONCLUSION}" = "success" ]; then

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Minor — Narrow false hard-error on a manual skip_tests dispatch

Producer uploads the pointer only if conformance succeeded. The DOWNLOAD_OUTCOME=='failure' && UAT_CONCLUSION=='success' branch is meant to catch a network error, but a deliberate skip_tests UAT concludes success with no pointer → this branch fires exit 1. (Downgraded from Major by meta-review: a normal conformance failure forces conclusion=failure, so this path is reachable only via the manual skip_tests dispatch, which has no evidence to publish anyway.)

Blast radius: A skip_tests UAT shows a confusing red publish run; the not-found-vs-network distinction remains imperfect (both surface as outcome=failure).

Fix: Accept the narrow edge, or have the producer always upload a sentinel pointer on success so absence unambiguously means failure. (Source: persona panel.)

# If the download step failed, distinguish "artifact not found"
# (conformance did not complete — benign skip) from a real error
# (network failure — should not silently drop the publish).
POINTER="$(find evidence-pointer/ -name 'pointer.yaml' 2>/dev/null | head -1)"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 Nitpick — find evidence-pointer/ ... | head -1 brittle under set -euo pipefail

If the download dir is ever absent, find <missing-dir> exits 1 and pipefail+set -e kills the step before the -z disambiguation; head -1 closing the pipe can also SIGPIPE find on many matches. In practice download-artifact@v8 pre-creates the path: dir and merge-multiple yields one file, so it's benign today.

Blast radius: Latent; would surface only on an unexpected layout/failure mode, producing an opaque exit-1 instead of the intended notice/error.

Fix: Cheap insurance: POINTER="$(find evidence-pointer/ -name pointer.yaml 2>/dev/null | head -1 || true)" (or guard [ -d evidence-pointer/ ] first). (Source: persona panel + CodeRabbit.)

echo "skip=true" >> "${GITHUB_OUTPUT}"
exit 0
fi
OCI=$(yq '.attestations[0].bundle.oci' "${POINTER}")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 Nitpick — yq .attestations[0] hardcodes index 0

Correct for schema V1 (BuildPointer always emits exactly one attestation), but couples to that invariant with no guard.

Blast radius: None today; a future V2 multi-attestation pointer could silently pick the wrong entry.

Fix: Optional fail-loud guard: assert yq '.attestations | length' == 1, or select by .bundle.predicateType. (Source: persona panel.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/ci size/L theme/ci-dx CI pipelines, developer experience, and build tooling

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TG3 — Processing (upstream workers + AICR-native config-gen + metadata-key contract)

3 participants