From 080abd8a133e8a9e4ca1acba46461f1aed534255 Mon Sep 17 00:00:00 2001 From: Ciaran Roche Date: Mon, 22 Jun 2026 12:58:43 +0100 Subject: [PATCH] HYPERFLEET-1268 - docs: add Konflux operations and release runbook Adds six engineer-facing operations docs under hyperfleet/docs/release/operations/ to complement the existing design and process docs: - configuration-map.md: every config file across konflux-release-data, the component repos, hyperfleet-release, and openshift/release - pipeline-anatomy.md: build vs release run, DAG, version flow, UI reading guide, skopeo confirmation - debugging.md: failure-mode runbook organised by symptom - notifications.md: #hyperfleet-e2e-status, webhook rotation, Pyxis, GitHub PR checks, Prow signals - release-runbook.md: copy-paste commands for RC -> GA, fix cycle, and hotfix - support.md: Slack channels (#forum-conforma, not the renamed #forum-konflux-contract), JIRA queues, escalation contacts README index updated with an Operations section linking the new docs. Per-repo RELEASING.md stubs (hyperfleet-api, -sentinel, -adapter, -release) will follow in separate PRs. Co-Authored-By: Claude Opus 4.7 (1M context) --- hyperfleet/docs/release/README.md | 15 +- .../release/operations/configuration-map.md | 133 +++++++++ .../docs/release/operations/debugging.md | 158 ++++++++++ .../docs/release/operations/notifications.md | 132 +++++++++ .../release/operations/pipeline-anatomy.md | 165 +++++++++++ .../release/operations/release-runbook.md | 275 ++++++++++++++++++ hyperfleet/docs/release/operations/support.md | 88 ++++++ 7 files changed, 965 insertions(+), 1 deletion(-) create mode 100644 hyperfleet/docs/release/operations/configuration-map.md create mode 100644 hyperfleet/docs/release/operations/debugging.md create mode 100644 hyperfleet/docs/release/operations/notifications.md create mode 100644 hyperfleet/docs/release/operations/pipeline-anatomy.md create mode 100644 hyperfleet/docs/release/operations/release-runbook.md create mode 100644 hyperfleet/docs/release/operations/support.md diff --git a/hyperfleet/docs/release/README.md b/hyperfleet/docs/release/README.md index 0868ff0b..40d05e22 100644 --- a/hyperfleet/docs/release/README.md +++ b/hyperfleet/docs/release/README.md @@ -1,7 +1,7 @@ --- Status: Active Owner: HyperFleet Team -Last Updated: 2026-05-11 +Last Updated: 2026-06-22 --- # Release Documentation @@ -17,6 +17,19 @@ Last Updated: 2026-05-11 | [ADR 0014](../../adrs/0014-konflux-build-and-release.md) | Decision record for adopting Konflux | | [ADR 0016](../../adrs/0016-helm-oci-distribution.md) | Decision record for Helm OCI distribution | +## Operations + +Engineer-facing operational docs — what to read *during* a release or *when something fails*. The `operations/` subdirectory. + +| Document | Purpose | +|----------|---------| +| [Release Runbook](./operations/release-runbook.md) | Copy-paste command sequence for RC → GA, fix cycle, and hotfix | +| [Pipeline Anatomy](./operations/pipeline-anatomy.md) | Reading a Konflux PipelineRun, the build-vs-release distinction, where to look in the UI | +| [Debugging](./operations/debugging.md) | Failure-mode runbook organized by symptom | +| [Configuration Map](./operations/configuration-map.md) | Every release-related config file across the six repos — what it does, who reviews it | +| [Notifications](./operations/notifications.md) | Slack `#hyperfleet-e2e-status`, Pyxis, GitHub PR checks, and Prow status signals | +| [Support](./operations/support.md) | Slack channels, JIRA queues, Konflux UI, escalation contacts | + ## Prow Test and Release The `test-release/` subdirectory contains Prow-specific docs for CI job setup and E2E testing infrastructure. diff --git a/hyperfleet/docs/release/operations/configuration-map.md b/hyperfleet/docs/release/operations/configuration-map.md new file mode 100644 index 00000000..73d70b8b --- /dev/null +++ b/hyperfleet/docs/release/operations/configuration-map.md @@ -0,0 +1,133 @@ +--- +Status: Active +Owner: HyperFleet Team +Last Updated: 2026-06-22 +--- + +# Configuration Map + +> **Audience:** HyperFleet engineers who need to find, read, or change a release-related config file. Tells you which repo holds what, what it does, and who reviews changes. + +The HyperFleet Konflux setup is split across six repos. This page is the index. For the WHY behind the design, see [Konflux Release Pipeline Design](../konflux-release-pipeline-design.md) and [ADR 0014](../../../adrs/0014-konflux-build-and-release.md). + +--- + +## At a glance + +```mermaid +flowchart LR + subgraph github["GitHub: openshift-hyperfleet"] + api["hyperfleet-api"] + sen["hyperfleet-sentinel"] + adp["hyperfleet-adapter"] + rel["hyperfleet-release"] + end + subgraph gitlab["GitLab: releng"] + krd["konflux-release-data"] + end + subgraph os["GitHub: openshift"] + ocr["openshift/release"] + end + api -- ".tekton/*" --> kf["Konflux
kflux-prd-rh02"] + sen -- ".tekton/*" --> kf + adp -- ".tekton/*" --> kf + krd -- RPA, EC policy, tenant --> kf + kf --> quay["Quay.io"] + quay --> ocr + rel -- RC E2E trigger --> ocr +``` + +--- + +## `konflux-release-data` (GitLab) + +URL: . This is the GitOps source of truth for everything the Konflux platform applies to our tenant. Changes go via MR; ArgoCD syncs them onto `kflux-prd-rh02`. CI runs `tox` — see the repo's `CLAUDE.md` for the lint/test matrix. + +| File | Purpose | +|------|---------| +| `config/kflux-prd-rh02.0fk9.p1/service/ReleasePlanAdmission/hyperfleet/hyperfleet.yaml` | RPA for the three component images. Maps Snapshots to Quay paths, applies tags, references the Slack webhook secret. | +| `config/kflux-prd-rh02.0fk9.p1/service/ReleasePlanAdmission/hyperfleet/hyperfleet-charts.yaml` | RPA for Helm chart OCI releases. Uses `push-to-external-registry`; targets `…/hyperfleet-api-chart`. | +| `constraints/service/hyperfleet.yaml` | JSON-schema constraint that validates our RPAs (origin, policy, registry URL prefix, pipeline source, service account). | +| `config/kflux-prd-rh02.0fk9.p1/service/EnterpriseContractPolicy/registry-hyperfleet-chart-prod.yaml` | EC policy for chart releases. Derived from `app-interface-standard`; excludes container-only checks. | +| `tenants-config/cluster/kflux-prd-rh02/tenants/hyperfleet-tenant/` | Tenant namespace, RBAC, Application (`hyperfleet`), three Components, ReleasePlan. Source files only — never edit `auto-generated/`. | +| `CODEOWNERS` | Approval routing. HyperFleet paths require team approval. | + +The container RPA uses policy `app-interface-standard`; the chart RPA uses `registry-hyperfleet-chart-prod`. Both auto-release (`block-releases: false`). Service account for releases is `release-app-interface-prod`. + +--- + +## Component repos: `hyperfleet-api`, `hyperfleet-sentinel`, `hyperfleet-adapter` + +Each repo has the same shape. Replace `` with `api`, `sentinel`, or `adapter`. + +| File | Triggers on | Notes | +|------|-------------|-------| +| `.tekton/hyperfleet--push.yaml` | Merge to `main` | Builds with `APP_VERSION=0.0.0-dev` default. Powers nightly. | +| `.tekton/hyperfleet--tag.yaml` | Push of a semver tag (`vX.Y.Z` or `vX.Y.Z-rcN`) | CEL match in PaC annotation — see [Pipeline Anatomy](./pipeline-anatomy.md#what-triggers-what) for the exact pattern. `extract-version` task strips `refs/tags/v` → injects `APP_VERSION`. | +| `.tekton/hyperfleet--chart-push.yaml` | Merge to `main` (chart path) | Builds and releases the component's Helm chart alongside the image, via the chart RPA. | +| `Dockerfile` | — | Contract: `ARG APP_VERSION="0.0.0-dev"` and `LABEL version="${APP_VERSION}"`. The label is what the RPA's `{{ labels.version }}` template reads. | + +If you change the CEL regex or the Dockerfile `APP_VERSION` contract, you change the release flow. See [Pipeline Anatomy](./pipeline-anatomy.md) for the version chain. + +--- + +## Helm chart artifacts + +Each component repo ships its own Helm chart through its `.tekton/hyperfleet--chart-push.yaml` pipeline, and the chart artifact is released alongside the image via the `hyperfleet-charts.yaml` RPA. + +For the chart distribution design see [Helm OCI Distribution Design](../helm-oci-distribution-design.md) and [ADR 0016](../../../adrs/0016-helm-oci-distribution.md). + +--- + +## `hyperfleet-release` + +The release coordination repo. Holds the manifest that pins which component versions form a release candidate or GA. + +| File | Purpose | +|------|---------| +| `RELEASE_MANIFEST.yaml` | Pinned component versions for the current release. Schema: `release`, `e2e_ref`, `components.{hyperfleet-api,hyperfleet-sentinel,hyperfleet-adapter}`. | +| `scripts/trigger-rc-e2e.sh` | Reads the manifest, verifies each image exists in Quay, calls Gangway to start the Prow RC E2E job. | +| `scripts/README.md` | Prerequisites and dry-run usage for the trigger script. | +| `RELEASE` | Top-level release status / notes (per-release). | + +The manifest is the source of truth for *which combination of images is under test* — Konflux Snapshots are the build source of truth. + +--- + +## `openshift/release` (Prow) + +| Path | Purpose | +|------|---------| +| `ci-operator/config/openshift-hyperfleet/` | ci-operator configs per repo (PR presubmits, nightly E2E, RC E2E). | +| `ci-operator/jobs/openshift-hyperfleet/` | Generated Prow job YAML (regenerated from configs). | +| `ci-operator/step-registry/hyperfleet/` | Reusable Hyperfleet steps (commitlint, risk-scorer). | +| `ci-operator/step-registry/openshift-hyperfleet/chart-deployment/` | Chart deployment step for E2E. | + +For how to add a new E2E job or modify the Gangway trigger flow, see [Add Hyperfleet E2E CI Job in Prow](../test-release/add-hyperfleet-e2e-ci-job-in-prow.md) and [Trigger HyperFleet E2E Jobs via Gangway API](../test-release/trigger-e2e-jobs-via-gangway.md). + +--- + +## Konflux cluster + +Things that exist on the cluster, not in a repo. The repos above declare them via GitOps. + +| Resource | Location | +|----------|----------| +| Konflux UI | | +| Tenant namespace | `hyperfleet-tenant` on `kflux-prd-rh02` | +| Application | `hyperfleet` (single app, three Components) | +| Slack webhook secret | `hyperfleet-slack-webhook-notification-secret` (key `webhook-url`) in `rhtap-releng-tenant`. Created and rotated by RelEng. | +| Pyxis secret | Shared RelEng secret in `rhtap-releng-tenant`. | + +--- + +## Who reviews what + +| Area | Reviewer source | +|------|-----------------| +| `konflux-release-data` HyperFleet paths | `CODEOWNERS` in that repo — HyperFleet team | +| Component repo `.tekton/` and `Dockerfile` | Repo `CODEOWNERS` / `OWNERS` | +| `openshift/release` HyperFleet configs | `OWNERS` under `ci-operator/config/openshift-hyperfleet/` | +| Cluster-side resources (secrets in `rhtap-releng-tenant`) | RelEng — coordinate via `#forum-konflux-release` | + +For escalation contacts and Slack channels see [Support](./support.md). diff --git a/hyperfleet/docs/release/operations/debugging.md b/hyperfleet/docs/release/operations/debugging.md new file mode 100644 index 00000000..38a9a401 --- /dev/null +++ b/hyperfleet/docs/release/operations/debugging.md @@ -0,0 +1,158 @@ +--- +Status: Active +Owner: HyperFleet Team +Last Updated: 2026-06-22 +--- + +# Debugging: When the Release Pipeline Breaks + +> **Audience:** HyperFleet engineers who pushed a tag, merged a PR, or triggered an RC E2E and something didn't happen. Organized by symptom. + +Each entry: what you see → where to look → what to check → who to ping. For the happy-path mental model, read [Pipeline Anatomy](./pipeline-anatomy.md) first. For where each config lives, see [Configuration Map](./configuration-map.md). + +--- + +## I pushed a tag and no PipelineRun started + +**Where to look** + +- Konflux UI → Applications → `hyperfleet` → Pipeline runs (filter by component). +- The component repo's GitHub: **Actions** tab is NOT where PaC reports; check the commit status (green check / yellow dot near the commit). + +**What to check** + +1. **The tag actually pushed.** `git push origin v0.3.0-rc1` is one tag; `git push --tags` pushes everything. Verify with `git ls-remote --tags origin | grep v0.3.0-rc1`. +2. **The CEL regex matches your tag.** Open `.tekton/hyperfleet--tag.yaml` and look for the `on-cel-expression` annotation. It must match `^refs/tags/v[0-9]+\.[0-9]+\.[0-9]+(-rc[0-9]+)?$`. A tag like `v0.3` or `release-0.3-rc1` won't match. +3. **The PaC GitHub App is installed on the org.** Visit and confirm "Red Hat Konflux" has access to the component repo. +4. **The Component is wired in `konflux-release-data`.** `tenants-config/cluster/kflux-prd-rh02/tenants/hyperfleet-tenant/applications/hyperfleet/components/.yaml` should have `build.appstudio.openshift.io/request: configure-pac`. + +**Who to ping:** PaC install issues — `#forum-konflux-release` on Red Hat Slack. + +--- + +## A build PipelineRun failed + +**Where to look** + +- The failing task in the Konflux UI → **Logs** tab. + +**What to check** + +1. **Is it advisory or blocking?** See the task table in [Pipeline Anatomy](./pipeline-anatomy.md#the-build-dag). Under `app-interface-standard`, scan tasks (Snyk, Clair, ClamAV, shell/unicode SAST) are advisory — they surface findings but don't block the release. A red badge on `sast-snyk-check` does **not** stop the build pipeline from being marked failed in the UI even when the EC verdict treats it as advisory — read the actual error. +2. **`build-container` failed → Dockerfile issue.** Local reproduce: `buildah build --build-arg APP_VERSION=0.3.0-rc1 .` from a clean checkout of the tag. +3. **`extract-version` failed.** The task expects `target_branch` to start with `refs/tags/v`. A non-tag trigger or a malformed tag breaks it. +4. **`prefetch-dependencies` failed.** Cachi2 — non-hermetic today, so most failures are transient. Re-run the PipelineRun from the UI. +5. **Snyk SARIF inspection.** Pull the SARIF from the task results; the UI columns mislead. See the spike note `Verifying Snyk SAST Outputs` for the skopeo-based extraction. + +**Who to ping:** Konflux build infra — `#konflux-users`. EC policy questions — `#forum-conforma`. + +--- + +## Build green but image not in Quay + +The classic. The build PipelineRun and the release PipelineRun are separate. Build green ≠ image pushed. + +**Where to look** + +- Konflux UI → **Releases** tab (NOT pipeline runs). There should be a Release CR created from the Snapshot. +- If the Release exists but is incomplete, click into it and inspect the managed PipelineRun (`rh-push-to-external-registry`). + +**What to check** + +1. **Snapshot created.** Build run → Snapshots → the Snapshot for this commit. +2. **EC verdict.** Snapshot → Enterprise Contract result. Failed EC → release will not auto-fire. Read the violation messages; if they're advisory-tagged, they shouldn't block (`app-interface-standard` excludes most). +3. **Release CR exists.** If no Release was created, the Snapshot's `auto-release` is off — check the ReleasePlan label `release.appstudio.openshift.io/auto-release: 'true'`. +4. **`rh-push-to-external-registry` ran and failed.** Open the managed PipelineRun; most failures are missing secrets (see next section). +5. **`skopeo list-tags` empty 60s after release run finishes.** Almost always a Quay propagation delay — wait two minutes and retry. + +**Who to ping:** RPA / managed pipeline behavior — `#forum-konflux-release`. + +--- + +## Release pipeline failed + +Once the build Snapshot passes EC, `rh-push-to-external-registry` runs. The common failures: + +| Symptom in the managed PipelineRun | Likely cause | Fix | +|------------------------------------|--------------|-----| +| `verify-access` task failed | RPA / ReleasePlan mismatch, or the service account lost Quay push access | Check `serviceAccountName: release-app-interface-prod` in the RPA matches what's provisioned. Ping RelEng. | +| `collect-data` task failed | RPA `data.mapping` references a component not in the Snapshot | Confirm the Application's Components and the RPA mapping list are aligned | +| `push-snapshot` task failed | Quay push auth failed (`konflux-release-service-access-management-token` rotated) | RelEng — `#forum-konflux-release` | +| `create-pyxis-image` task failed | Pyxis secret missing or expired in `rhtap-releng-tenant` | RelEng | +| `slack-notification` task failed | `hyperfleet-slack-webhook-notification-secret` missing, or webhook URL revoked | See [Notifications](./notifications.md#rotating-the-slack-webhook). The release itself usually still completes — the notification is in the `finally` block. | +| EC violation in the release run (different from build EC) | Policy `app-interface-standard` constraint we don't meet | Inspect; if the violation is real-engineering, fix. If it's an exemption candidate, file an exception in `konflux-release-data/exceptions/`. | + +--- + +## Wrong version baked into the image + +Symptom: `skopeo inspect …:0.3.0-rc1 | jq -r '.Labels.version'` returns `0.0.0-dev` or empty. + +**Where to look** + +- The build run's `extract-version` task → **Logs**. +- The build run's `build-container` task → **Logs** (look for `--build-arg APP_VERSION=…`). + +**What to check** + +1. The `extract-version` output is the version string with the `v` stripped. If it printed `0.0.0-dev`, the trigger was a `push` event on main, not a tag — the run name will be `…-on-push-…`. The build did what it was told. +2. If the trigger was the tag and `extract-version` printed correctly but the LABEL is `0.0.0-dev`: the Dockerfile lost its `ARG APP_VERSION` or its `LABEL version="${APP_VERSION}"`. Re-check the Dockerfile contract — see [Configuration Map: component repos](./configuration-map.md#component-repos-hyperfleet-api-hyperfleet-sentinel-hyperfleet-adapter). +3. Multi-stage Dockerfiles: `ARG APP_VERSION` must be **redeclared in every stage** that references it. Missing redeclaration is the #1 cause. + +--- + +## RC E2E didn't trigger after I tagged hyperfleet-release + +**Where to look** + +- `hyperfleet-release/scripts/trigger-rc-e2e.sh` — run with `--dry-run` first. +- The Gangway HTTP response from the script's log output. +- The Prow dashboard for our job. + +**What to check** + +1. **Manifest images exist in Quay.** The script verifies each component image at the version pinned in `RELEASE_MANIFEST.yaml`. If any image is missing, the script bails before calling Gangway. `skopeo list-tags` to confirm. +2. **`oc` token from app.ci is fresh.** The Gangway call needs a token; if it 401s, refresh from . +3. **Job config drift in `openshift/release`.** If the periodic / triggered job name doesn't match what the script sends, Gangway accepts the call but no job appears. Compare against the latest ci-operator config under `ci-operator/config/openshift-hyperfleet/`. +4. **GitHub Action not yet wired.** The interim workflow is manual via the script — there's no auto-trigger on tag push yet (see [HYPERFLEET-1038](https://redhat.atlassian.net/browse/HYPERFLEET-1038)). + +For full mechanics see [Trigger HyperFleet E2E Jobs via Gangway API](../test-release/trigger-e2e-jobs-via-gangway.md). + +--- + +## RC E2E ran but pulled the wrong image tags + +**Where to look** + +- The Prow job → environment variables → `*_IMAGE_TAG`, `*_IMAGE_REPO`. + +**What to check** + +1. **`RELEASE_MANIFEST.yaml` was current when the script ran.** The script reads the file at invocation time; if you committed the manifest after pushing the tag, the previous version was used. +2. **The script strips the `v` prefix.** Quay tags have no `v`; the manifest entries do. The script handles the conversion — if you bypass the script and call Gangway directly, you have to do this yourself. +3. **Manifest typo.** Component keys must be `hyperfleet-api`, `hyperfleet-sentinel`, `hyperfleet-adapter` exactly. + +--- + +## Enterprise Contract violation + +EC runs in two places: in the build PipelineRun's `verify-enterprise-contract` task, and again in the release pipeline. The build-side verdict gates the Snapshot; the release-side verdict gates the push. + +**What to check** + +1. **Policy applied.** Container RPA uses `app-interface-standard`; chart RPA uses `registry-hyperfleet-chart-prod`. Mismatch → wrong rules applied → spurious failures. +2. **RPA `origin` matches the tenant.** `origin: hyperfleet-tenant`. Required for the constraint to bind. +3. **Constraint regex.** `constraints/service/hyperfleet.yaml` enforces the Quay URL prefix. A typo in the RPA's `mapping.components[].repositories[].url` will trip the constraint at MR-validation time (`tox -e test`), not at runtime. +4. **Genuine policy gap.** If the violation is real and unfixable in the short term, file an exception under `konflux-release-data/exceptions/` with rationale. + +--- + +## Escalation + +- General Konflux platform — `#konflux-users` (Red Hat Slack) +- Release pipeline (RPA, managed pipelines) — `#forum-konflux-release` +- Enterprise Contract — `#forum-conforma` +- Hyperfleet release coordination — `#hyperfleet-e2e-status`, then page the Release Owner +- File a Konflux support ticket: project `KFLUXSPRT` on JIRA + +See [Support](./support.md) for the full list with links. diff --git a/hyperfleet/docs/release/operations/notifications.md b/hyperfleet/docs/release/operations/notifications.md new file mode 100644 index 00000000..a6e02fa4 --- /dev/null +++ b/hyperfleet/docs/release/operations/notifications.md @@ -0,0 +1,132 @@ +--- +Status: Active +Owner: HyperFleet Team +Last Updated: 2026-06-22 +--- + +# Notifications and Status Signals + +> **Audience:** HyperFleet engineers wiring or troubleshooting release notifications, or trying to find out where to watch a release land. + +The HyperFleet release pipeline emits signals to five places. This page covers what they are, what posts there, and how to change or rotate each one. + +--- + +## Where to watch a release + +| Surface | What lands there | Driven by | +|---------|-----------------|-----------| +| **Slack `#hyperfleet-e2e-status`** | Per-component release success/failure messages from `rh-push-to-external-registry` | `data.slack` in the RPA | +| **Konflux UI** | Pipeline runs, Snapshots, Releases, EC verdicts, scan results | Konflux platform | +| **Pyxis / Red Hat Container Catalog** | Image entry + metadata + CVE tracking | `create-pyxis-image` task in the release pipeline | +| **GitHub commit status** | PaC reports build pass/fail back to the commit | PaC controller | +| **Prow dashboard** | E2E results (nightly + RC) | Prow / ci-operator | + +--- + +## Slack: `#hyperfleet-e2e-status` + +The release pipeline posts to `#hyperfleet-e2e-status` on Red Hat Slack on each release run. The notification is part of the `rh-push-to-external-registry` pipeline's `finally` block — even if the push fails, the message goes out. If the message itself fails to send, the release still records as complete (the notification is best-effort). + +What a successful message contains: + +- Release CR name +- Application (`hyperfleet`) +- Snapshot +- Component images and the tags they were pushed with +- Link back to the Konflux UI + +What a failed message contains: + +- Same as success, plus the failed task name and a UI link to its logs + +### Configuration + +The RPA references two values in `data.slack`: + +```yaml +data: + slack: + slack-notification-secret: hyperfleet-slack-webhook-notification-secret + slack-webhook-notification-secret-keyname: webhook-url +``` + +The secret lives in the `rhtap-releng-tenant` namespace (the same namespace as the RPA). The HyperFleet team does not have direct write access there — RelEng manages it. The secret name and keyname are referenced by the RPA; the secret's value is an incoming-webhook URL for the channel. + +File: `konflux-release-data/config/kflux-prd-rh02.0fk9.p1/service/ReleasePlanAdmission/hyperfleet/hyperfleet.yaml`. + +### Adding a second channel + +Konflux supports one webhook per RPA. To post to a second channel: + +1. Create a second incoming webhook in Slack for the new channel. +2. Coordinate with RelEng to create a second secret in `rhtap-releng-tenant`. +3. The Konflux platform doesn't natively fan out — options are (a) make the webhook a Slack workflow that re-broadcasts, or (b) configure a Slack channel mirror. Option (a) is simpler. + +If you need different content per channel, that's a custom finally-task — out of scope for `rh-push-to-external-registry`. + +### Rotating the Slack webhook + +1. Create a new incoming webhook in Slack for `#hyperfleet-e2e-status`. +2. Open an MR or ticket with RelEng asking them to update the `webhook-url` key on `hyperfleet-slack-webhook-notification-secret` in `rhtap-releng-tenant` on `kflux-prd-rh02`. +3. Revoke the old webhook in Slack. + +No `konflux-release-data` change is needed — the RPA references the secret by name, not the URL. + +### Webhook missing / silent + +If `#hyperfleet-e2e-status` stops getting messages: + +1. Confirm a release actually fired: Konflux UI → Releases. +2. Open the release's managed PipelineRun → look for the `slack-notification` task in the `finally` block. +3. Common failures: secret rotated incorrectly, webhook URL revoked, Slack rate-limited. All show up in the task logs. + +--- + +## Pyxis + +Pyxis is Red Hat's container metadata catalog. The release pipeline registers each image so that: + +- The image appears in the Red Hat Container Catalog. +- Continuous CVE scanning runs against it. +- RPM-level metadata is tracked (where applicable). + +Configured via `data.pyxis` in the RPA (the secret itself is shared infrastructure managed by RelEng). To confirm a freshly-released image is in Pyxis, search by the SHA or repository name in the Red Hat Container Catalog UI — there's a propagation delay of a few minutes. + +If `create-pyxis-image` fails in the release run, the image is still in Quay but missing from Pyxis. Re-running the release usually fixes it; persistent failures are a RelEng ticket. + +--- + +## GitHub commit status (PaC) + +PaC reports build pass/fail back to the GitHub commit. You'll see: + +- A green check / red X next to the commit on the component repo. +- A "Konflux / hyperfleet--on-push" or `…-on-tag` status line. +- Click → opens the PipelineRun in the Konflux UI. + +This is the fastest way to confirm "did my push trigger a build" without opening the Konflux UI. If the status line is missing, the PaC trigger didn't fire — see [Debugging: I pushed a tag and no PipelineRun started](./debugging.md#i-pushed-a-tag-and-no-pipelinerun-started). + +--- + +## Prow + +E2E results land on the Prow dashboard, not in Slack (by default). Nightly E2E results are accessible from the periodic job; RC E2E results from the Gangway-triggered job. The `trigger-rc-e2e.sh` script prints a Prow job URL on submission — open that to follow the run. + +For details on configuring or extending Prow notifications, see [Add Hyperfleet E2E CI Job in Prow](../test-release/add-hyperfleet-e2e-ci-job-in-prow.md). + +--- + +## Konflux UI + +- URL: +- The cluster is `kflux-prd-rh02`. Bookmark the URL — it's the single pane for everything that runs on our tenant. +- Sign-in is Red Hat SSO; you'll need access to the `hyperfleet-tenant` namespace via the team Rover group. + +--- + +## Related + +- [Configuration Map](./configuration-map.md) — exact file paths for the RPA and Slack config +- [Debugging](./debugging.md) — what to do when a release fails silently +- [Support](./support.md) — escalation contacts diff --git a/hyperfleet/docs/release/operations/pipeline-anatomy.md b/hyperfleet/docs/release/operations/pipeline-anatomy.md new file mode 100644 index 00000000..21122a6c --- /dev/null +++ b/hyperfleet/docs/release/operations/pipeline-anatomy.md @@ -0,0 +1,165 @@ +--- +Status: Active +Owner: HyperFleet Team +Last Updated: 2026-06-22 +--- + +# Pipeline Anatomy: Reading a Release Build + +> **Audience:** HyperFleet engineers staring at a Konflux PipelineRun for the first time. Tells you what each task does, how to find what you need in the UI, and the one trap that catches everyone. + +For the architectural rationale (why two pipeline files per component, why CEL, why one Application), see [Konflux Release Pipeline Design](../konflux-release-pipeline-design.md). This page is operational: what runs, where to look, how to confirm. + +--- + +## The one thing to remember + +**The build PipelineRun is not the release.** They are two separate runs: + +```mermaid +flowchart LR + push["Push tag
v0.3.0-rc1"] --> build["BUILD PipelineRun
(tag.yaml)"] + build --> snap["Snapshot"] + snap --> ec["EC verdict"] + ec --> rel["RELEASE PipelineRun
(rh-push-to-external-registry)"] + rel --> quay["Image in Quay"] +``` + +When the build run goes green, the image is *not in the target registry yet*. A second run — the Release — pushes it. People watch the build finish, run `skopeo`, see nothing, and panic. Give it another minute or two and look under the **Releases** tab in the Konflux UI. + +--- + +## What triggers what + +| Action | PaC matches | Build pipeline | `APP_VERSION` value | +|--------|-------------|----------------|---------------------| +| Merge to `main` | `.tekton/hyperfleet--push.yaml` | docker-build-oci-ta | `0.0.0-dev` (Dockerfile default) | +| Push `vX.Y.Z` tag | `.tekton/hyperfleet--tag.yaml` | docker-build-oci-ta | extracted from tag (e.g. `0.3.0`) | +| Push `vX.Y.Z-rcN` tag | same | same | e.g. `0.3.0-rc1` | +| Chart path on `main` | `.tekton/hyperfleet--chart-push.yaml` | helm chart build | n/a | + +CEL match for the tag pipeline (used in the PaC annotation): + +```text +event == "push" && target_branch.matches("^refs/tags/v[0-9]+\\.[0-9]+\\.[0-9]+(-rc[0-9]+)?$") +``` + +The run name in the UI is `hyperfleet--on-tag-` (or `…-on-push-…`). Find it under **Applications → hyperfleet → Pipeline runs**. + +--- + +## The build DAG + +What the graph shows, left to right (from the `v0.3.0-rc1` reference run): + +```text +init + ├─ clone-repository ─┐ + └─ extract-version ──┤ + ▼ + prefetch-dependencies + ▼ + build-container + ▼ + build-image-index + ▼ + ┌──────────────┴──────────── fan-out (parallel) ──────────────┐ + build-source-image sast-snyk-check apply-tags + deprecated-base-image-check clamav-scan push-dockerfile + clair-scan sast-shell-check rpms-signature-scan + sast-unicode-check ecosystem-cert-preflight-checks +``` + +| Task | What it does | Blocking? | +|------|--------------|-----------| +| `extract-version` | Strips `refs/tags/v` → produces the version string (e.g. `v0.3.0-rc1` → `0.3.0-rc1`). Feeds `APP_VERSION`. HyperFleet-specific. | Blocking | +| `clone-repository` | Pulls source at the tagged commit. | Blocking | +| `prefetch-dependencies` | Cachi2 prefetch (hermetic support). Non-hermetic today. | Blocking | +| `build-container` | Buildah build, injects `APP_VERSION` build-arg. | Blocking | +| `build-image-index` | Manifest-list / image index. | Blocking | +| `apply-tags` | Build-time tags on the internal image. | Blocking | +| `build-source-image` | Source-container image. RPA's `pushSourceContainer: false` means the release doesn't push it. | Blocking on build | +| `clair-scan` | CVE scan. | Advisory (`app-interface-standard` excludes the gate) | +| `sast-snyk-check` | Snyk Code SAST. | Advisory | +| `clamav-scan` | Malware scan. | Advisory | +| `sast-shell-check`, `sast-unicode-check` | Additional SAST. | Advisory | +| `deprecated-base-image-check`, `rpms-signature-scan`, `ecosystem-cert-preflight-checks` | Supply-chain checks. | Mostly advisory | + +Most scan tasks run on every build and surface findings without stopping the release. A failed advisory task is signal, not a stop-the-line event — investigate, but the release continues. + +--- + +## How the version gets into the Quay tag + +```text +git tag v0.3.0-rc1 + → extract-version: VERSION = "${TAG_REF#refs/tags/v}" = 0.3.0-rc1 + → build-container: --build-arg APP_VERSION=0.3.0-rc1 + → Dockerfile: ARG APP_VERSION="0.0.0-dev" (overridden) → LABEL version="0.3.0-rc1" + → RPA tag template: {{ labels.version }} = 0.3.0-rc1 + → Quay tag: hyperfleet-api:0.3.0-rc1 +``` + +The `v` prefix is intentionally stripped — git tag `v0.3.0-rc1`, Quay tag `0.3.0-rc1`. The Dockerfile default `0.0.0-dev` is what nightly main builds get, because push.yaml doesn't pass `APP_VERSION`. + +--- + +## The release run + +Once the build run's Snapshot passes Enterprise Contract, the Release Service auto-runs `rh-push-to-external-registry` (because `block-releases: false` on the RPA). Find it under the **Releases** or **Snapshots** tab. + +It pushes to: + +```text +quay.io/redhat-services-prod/hyperfleet-tenant/hyperfleet/hyperfleet- +``` + +with tags from the RPA's `defaults.tags` block: + +- `{{ labels.version }}` — e.g. `0.3.0-rc1` +- `{{ labels.version }}-{{ timestamp }}` — uniqued +- `{{ git_sha }}` — commit pin +- `latest` — moves on every successful release + +The release run also handles Pyxis registration and (when wired) Slack notification — see [Notifications](./notifications.md). + +--- + +## Reading the UI: where to look + +| You want to see... | Go to | +|--------------------|-------| +| Was the build triggered? | Applications → `hyperfleet` → **Pipeline runs** → `…-on-tag-…` or `…-on-push-…` | +| What version was extracted | the run → `extract-version` task → **Logs** | +| Why a scan failed | the run → the scan task → **Logs** (Snyk SARIF artifact under task results) | +| Whether the release fired | **Releases** or **Snapshots** tab → the auto-created Release | +| EC verdict | Snapshot details → Enterprise Contract result | +| The actual image | `skopeo list-tags` against the prod registry (below) | + +--- + +## Confirming it landed + +```bash +skopeo list-tags docker://quay.io/redhat-services-prod/hyperfleet-tenant/hyperfleet/hyperfleet-api \ + | jq -r '.Tags[]' | grep '^0.3.0-rc1$' +``` + +Inspect the label to confirm the version baked in correctly: + +```bash +skopeo inspect docker://quay.io/redhat-services-prod/hyperfleet-tenant/hyperfleet/hyperfleet-api:0.3.0-rc1 \ + | jq -r '.Labels.version' +# → 0.3.0-rc1 +``` + +If the build ran but `skopeo list-tags` shows nothing matching, the release run hasn't fired or failed. See [Debugging: build green but image not in Quay](./debugging.md#build-green-but-image-not-in-quay). + +--- + +## Related + +- [Configuration Map](./configuration-map.md) — where each file in the diagram lives +- [Debugging](./debugging.md) — what to do when any of this fails +- [Notifications](./notifications.md) — Slack/Pyxis behavior on success/failure +- [Release Runbook](./release-runbook.md) — copy-paste release commands diff --git a/hyperfleet/docs/release/operations/release-runbook.md b/hyperfleet/docs/release/operations/release-runbook.md new file mode 100644 index 00000000..b3c3127e --- /dev/null +++ b/hyperfleet/docs/release/operations/release-runbook.md @@ -0,0 +1,275 @@ +--- +Status: Active +Owner: HyperFleet Team +Last Updated: 2026-06-22 +--- + +# Release Runbook + +> **Audience:** Any HyperFleet engineer cutting a release. This is the command sequence — copy, paste, verify. For the broader process (Release Owner duties, communication, sign-offs), read [HyperFleet Release Process](../hyperfleet-release-process.md). For *what happens* between commands, read [Pipeline Anatomy](./pipeline-anatomy.md). For *when things break*, read [Debugging](./debugging.md). + +This runbook covers the four real flows: + +1. [Cut a release (RC → GA)](#1-cut-a-release-rc--ga) +2. [Fix cycle during RC](#2-fix-cycle-during-rc) +3. [Hotfix after GA](#3-hotfix-after-ga) +4. [Verify what's in Quay](#verification-snippets) + +Examples assume Release 1.5 with `hyperfleet-api` going to `v1.5.0`, `hyperfleet-sentinel` to `v1.4.2`, `hyperfleet-adapter` to `v2.0.0`. Substitute your own versions. Independent versioning is intentional — see [Konflux Release Pipeline Design §6](../konflux-release-pipeline-design.md#6-multi-component-configuration). + +--- + +## Preflight + +Before you start, confirm: + +- [ ] All intended PRs are merged to `main` on each component repo. +- [ ] Nightly E2E has been green for at least 24 hours. +- [ ] You know the target version per component (use existing tags + semver as the guide). +- [ ] You have push access to the component repos and `hyperfleet-release`. +- [ ] You can reach the Konflux UI: . +- [ ] `skopeo` and `jq` are installed locally. + +--- + +## 1. Cut a release (RC → GA) + +### 1.1 Create release branches + +Per component that has changes since the last release. Components without changes for this release reuse their existing release branch. + +```bash +# API: new features → minor bump +cd hyperfleet-api && git checkout main && git pull +git checkout -b release-1.5 && git push origin release-1.5 + +# Sentinel: bug fixes only → patch bump +cd ../hyperfleet-sentinel && git checkout main && git pull +git checkout -b release-1.4 && git push origin release-1.4 + +# Adapter: breaking changes → major bump +cd ../hyperfleet-adapter && git checkout main && git pull +git checkout -b release-2.0 && git push origin release-2.0 + +# Release coordination repo +cd ../hyperfleet-release && git checkout main && git pull +git checkout -b release-1.5 && git push origin release-1.5 +``` + +Nothing happens in Konflux yet — `push.yaml` only triggers on `main`. + +### 1.2 Push RC tags per component + +```bash +cd hyperfleet-api && git checkout release-1.5 +git tag v1.5.0-rc1 && git push origin v1.5.0-rc1 + +cd ../hyperfleet-sentinel && git checkout release-1.4 +git tag v1.4.2-rc1 && git push origin v1.4.2-rc1 + +cd ../hyperfleet-adapter && git checkout release-2.0 +git tag v2.0.0-rc1 && git push origin v2.0.0-rc1 +``` + +Each tag triggers PaC → `tag.yaml` → Konflux build → auto-release. After ~15-20 minutes confirm: + +```bash +for svc in api sentinel adapter; do + case "$svc" in + api) tag=1.5.0-rc1 ;; + sentinel) tag=1.4.2-rc1 ;; + adapter) tag=2.0.0-rc1 ;; + esac + echo "== hyperfleet-$svc:$tag ==" + skopeo inspect docker://quay.io/redhat-services-prod/hyperfleet-tenant/hyperfleet/hyperfleet-$svc:$tag \ + | jq -r '.Labels.version' +done +``` + +If any image is missing or the version label is wrong, see [Debugging: build green but image not in Quay](./debugging.md#build-green-but-image-not-in-quay) or [wrong version baked into the image](./debugging.md#wrong-version-baked-into-the-image). + +### 1.3 Update the release manifest and trigger RC E2E + +```bash +cd hyperfleet-release && git checkout release-1.5 + +# Edit RELEASE_MANIFEST.yaml — set per-component versions and e2e_ref +cat > RELEASE_MANIFEST.yaml <<'EOF' +release: "1.5" +e2e_ref: release-1.5 +components: + hyperfleet-api: v1.5.0-rc1 + hyperfleet-sentinel: v1.4.2-rc1 + hyperfleet-adapter: v2.0.0-rc1 +EOF + +git add RELEASE_MANIFEST.yaml +git commit -m "RC1: API v1.5.0-rc1, Sentinel v1.4.2-rc1, Adapter v2.0.0-rc1" +git push origin release-1.5 + +# Dry-run the trigger first +./scripts/trigger-rc-e2e.sh --dry-run + +# Real run +./scripts/trigger-rc-e2e.sh +``` + +The script prints the Prow job URL. Watch for tier0 + tier1 pass. If it doesn't start or pulls the wrong tags, see [Debugging: RC E2E didn't trigger](./debugging.md#rc-e2e-didnt-trigger-after-i-tagged-hyperfleet-release). + +### 1.4 GA tags + +Only after the RC E2E run is green and the Release Owner has signed off. The tag push *is* the gate. + +```bash +cd hyperfleet-api && git checkout release-1.5 +git tag v1.5.0 && git push origin v1.5.0 + +cd ../hyperfleet-sentinel && git checkout release-1.4 +git tag v1.4.2 && git push origin v1.4.2 + +cd ../hyperfleet-adapter && git checkout release-2.0 +git tag v2.0.0 && git push origin v2.0.0 +``` + +Verify (see [verification snippets](#verification-snippets)). + +### 1.5 Finalize the release + +```bash +cd hyperfleet-release && git checkout release-1.5 + +# Update manifest to GA versions +cat > RELEASE_MANIFEST.yaml <<'EOF' +release: "1.5" +e2e_ref: release-1.5 +components: + hyperfleet-api: v1.5.0 + hyperfleet-sentinel: v1.4.2 + hyperfleet-adapter: v2.0.0 +EOF + +git add RELEASE_MANIFEST.yaml +git commit -m "GA: HyperFleet Release 1.5" +git tag release-1.5 +git push origin release-1.5 # push the branch +git push origin tag release-1.5 # push the tag explicitly (same name as branch) +``` + +Then: + +1. Create the GitHub Release on `hyperfleet-release` with notes and the compatibility matrix. +2. Run smoke tests against the GA images. +3. Post to `#hyperfleet-e2e-status` so partner teams pick up. + +--- + +## 2. Fix cycle during RC + +Bug found during RC E2E. Only the affected component gets a new RC — others stay pinned. + +```bash +# Fix on main first — always +cd hyperfleet-api && git checkout main +# ... open PR with fix + tests, get review, merge ... + +# Cherry-pick to the release branch +git checkout release-1.5 +git cherry-pick && git push origin release-1.5 + +# New RC tag for the affected component only +git tag v1.5.0-rc2 && git push origin v1.5.0-rc2 +``` + +Wait for the build → release run, confirm in Quay (see [verification snippets](#verification-snippets)). + +Update the manifest — only the affected component changes: + +```bash +cd ../hyperfleet-release && git checkout release-1.5 +# Edit RELEASE_MANIFEST.yaml — set hyperfleet-api: v1.5.0-rc2 +git add RELEASE_MANIFEST.yaml +git commit -m "RC2: API v1.5.0-rc2 — fix " +git push origin release-1.5 + +./scripts/trigger-rc-e2e.sh +``` + +Repeat until the E2E run is green. + +--- + +## 3. Hotfix after GA + +Target: 1 working day for Blocker / Critical CVE. Patches skip the RC cycle — the tag push is the gate, and the focused smoke test is the verification. + +```bash +# 1. Fix on main first +cd hyperfleet-api && git checkout main +# ... open PR with fix + tests, get Release Owner review, merge ... + +# 2. Cherry-pick to the existing release branch +git checkout release-1.5 +git cherry-pick && git push origin release-1.5 + +# 3. Push the patch tag — no RC +git tag v1.5.1 && git push origin v1.5.1 +``` + +After ~15-20 minutes: + +```bash +skopeo inspect docker://quay.io/redhat-services-prod/hyperfleet-tenant/hyperfleet/hyperfleet-api:1.5.1 \ + | jq -r '.Labels.version' +``` + +Then: + +1. Smoke test against `…/hyperfleet-api:1.5.1` only. +2. Update `hyperfleet-release/RELEASE_MANIFEST.yaml` — bump the affected component, leave others. +3. Tag `release-1.5.1` on `hyperfleet-release`. +4. Notify in `#hyperfleet-e2e-status` and tag partner teams. + +For the broader hotfix process (severity assessment, communication, retrospective triggers), see [HyperFleet Release Process §1.9](../hyperfleet-release-process.md#19-emergency-hotfix-process-post-ga). + +--- + +## Verification snippets + +### List all tags for a component + +```bash +skopeo list-tags docker://quay.io/redhat-services-prod/hyperfleet-tenant/hyperfleet/hyperfleet-api \ + | jq -r '.Tags[]' | sort -V | tail -20 +``` + +### Confirm a specific tag and its baked-in version label + +```bash +TAG=1.5.0-rc1 +skopeo inspect docker://quay.io/redhat-services-prod/hyperfleet-tenant/hyperfleet/hyperfleet-api:$TAG \ + | jq -r '.Labels.version' +# Should print exactly: 1.5.0-rc1 +``` + +### Spot-check all three components at the same version family + +```bash +for svc in api sentinel adapter; do + for tag in 1.5.0-rc1 1.5.0; do + echo "hyperfleet-$svc:$tag" + skopeo inspect docker://quay.io/redhat-services-prod/hyperfleet-tenant/hyperfleet/hyperfleet-$svc:$tag 2>/dev/null \ + | jq -r '.Labels.version // "MISSING"' + done +done +``` + +--- + +## If something fails + +- Tag pushed, no PipelineRun → [Debugging: I pushed a tag and no PipelineRun started](./debugging.md#i-pushed-a-tag-and-no-pipelinerun-started) +- Build red → [Debugging: A build PipelineRun failed](./debugging.md#a-build-pipelinerun-failed) +- Build green but `skopeo` empty → [Debugging: Build green but image not in Quay](./debugging.md#build-green-but-image-not-in-quay) +- Wrong version label → [Debugging: Wrong version baked into the image](./debugging.md#wrong-version-baked-into-the-image) +- RC E2E didn't trigger → [Debugging: RC E2E didn't trigger](./debugging.md#rc-e2e-didnt-trigger-after-i-tagged-hyperfleet-release) +- Anything else → [Support](./support.md) diff --git a/hyperfleet/docs/release/operations/support.md b/hyperfleet/docs/release/operations/support.md new file mode 100644 index 00000000..02f85ee6 --- /dev/null +++ b/hyperfleet/docs/release/operations/support.md @@ -0,0 +1,88 @@ +--- +Status: Active +Owner: HyperFleet Team +Last Updated: 2026-06-22 +--- + +# Support and Escalation + +> **Audience:** HyperFleet engineers who need to find the right channel, dashboard, or contact fast. No prose — just tables. + +For *why* something is configured a particular way, see [Konflux Release Pipeline Design](../konflux-release-pipeline-design.md). This page is a phone book. + +--- + +## Slack channels + +| Channel | When to use it | +|---------|----------------| +| `#hyperfleet-e2e-status` | Watch HyperFleet release notifications. Coordinate releases here. | +| `#konflux-users` | General Konflux platform questions, build infra issues | +| `#forum-konflux-release` | RPA, managed pipelines, release service behavior | +| `#forum-conforma` | Enterprise Contract policy questions, exceptions | + +All on Red Hat enterprise Slack. + +--- + +## JIRA queues + +| Project | Purpose | URL | +|---------|---------|-----| +| `HYPERFLEET` | Our own tickets | | +| `KFLUXSPRT` | Konflux support tickets (file when the platform team needs to act) | | +| `KFLUXMIG` | Konflux migration tracker (per-tenant onboarding tracking) | | + +--- + +## Konflux UI and source repos + +| Resource | URL | +|----------|-----| +| Konflux UI (`kflux-prd-rh02`) | | +| `konflux-release-data` (GitLab) | | +| Release service catalog (pipelines) | | +| Konflux docs | | +| App Interface onboarding walkthrough | | +| Enterprise Contract docs | | +| EC policy source | | +| Pipelines as Code docs | | + +--- + +## Prow + +| Resource | URL | +|----------|-----| +| Prow dashboard (`app.ci`) | | +| `openshift/release` repo | | +| Gangway token request | | + +--- + +## Access management + +| Need | How to get it | +|------|---------------| +| `hyperfleet-tenant` access on `kflux-prd-rh02` | Join the team Rover group — see | +| `konflux-release-data-users` (GitLab) | Self-service join on GitLab | +| `openshift-hyperfleet` GitHub team | Org admin grants | +| `rhtap-releng-tenant` secrets (Slack, Pyxis) | File a ticket with RelEng in `#forum-konflux-release` — HyperFleet does not have direct write access | + +--- + +## Escalation order + +1. **Search the docs in this directory first** — most operational answers are here. +2. **`#hyperfleet-e2e-status` or the team channel** — for HyperFleet-specific coordination. +3. **Forum channels** above — for platform-side issues. Tag your message with what you've already checked. +4. **File a `KFLUXSPRT` ticket** — when the platform team needs to act and Slack isn't moving fast enough. +5. **Page the Release Owner** — for in-flight release blockers. See the release process doc for the on-call rotation. + +--- + +## Related + +- [Configuration Map](./configuration-map.md) — where each file mentioned in support conversations lives +- [Debugging](./debugging.md) — try this before pinging +- [HyperFleet Release Process](../hyperfleet-release-process.md) — Release Owner role and the broader process