Skip to content

HYPERFLEET-1229 - feat: automate GCP developer cluster lifecycle enforcement#58

Open
rafabene wants to merge 1 commit into
openshift-hyperfleet:mainfrom
rafabene:HYPERFLEET-1229-lifecycle-enforcement
Open

HYPERFLEET-1229 - feat: automate GCP developer cluster lifecycle enforcement#58
rafabene wants to merge 1 commit into
openshift-hyperfleet:mainfrom
rafabene:HYPERFLEET-1229-lifecycle-enforcement

Conversation

@rafabene

@rafabene rafabene commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Cloud Function (Go) that runs hourly via Cloud Scheduler to enforce lifecycle rules on developer GKE clusters in hcm-hyperfleet
  • Terraform module (terraform/modules/lifecycle/) for Cloud Function Gen2, Cloud Scheduler, IAM, and GCS bucket
  • Adds ttl label (current date + 5 days via plantimestamp()) and environment variable to Terraform common_labels
  • Updates dev-prow.tfvars with environment = "cicd" (exempt from enforcement)
  • Makefile targets: test-lifecycle-function, build-lifecycle-function, lint-lifecycle-function, add-ttl-labels
  • Migration script (scripts/add-ttl-labels.sh) for adding TTL labels to existing clusters
  • Dry-run mode enabled by default for safe rollout

Enforcement rules

Rule Action Deletion
Idle nodes (>12h) Scale node pools to 0 No
TTL expired Scale to 0, set shutdown-date After 48h
Missing owner Scale to 0, set shutdown-date After 7 days
environment: cicd or hyperfleet-dev-ci-infra-* Skip Skip

Rollout

  1. Deploy with lifecycle_enforcer_dry_run = true (default)
  2. Validate decisions in Cloud Logging
  3. Run DRY_RUN=false make add-ttl-labels to label existing clusters
  4. Set lifecycle_enforcer_dry_run = false and re-apply

Test plan

  • 36 unit tests covering all enforcement scenarios (skip, shutdown, delete, label-only, edge cases)
  • make validate-terraform passes
  • make test-lifecycle-function passes
  • make lint-lifecycle-function passes
  • Deploy with enable_lifecycle_enforcer = true + lifecycle_enforcer_dry_run = true and validate logs
  • Run add-ttl-labels.sh in dry-run mode against hcm-hyperfleet
  • Disable dry-run and verify enforcement actions on a test cluster

Ref: https://redhat.atlassian.net/browse/HYPERFLEET-1229

@openshift-ci openshift-ci Bot requested review from ma-hill and pnguyen44 June 24, 2026 19:26
@openshift-ci

openshift-ci Bot commented Jun 24, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign ldornele for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds lifecycle enforcement documentation, Terraform inputs and module wiring, a Go Cloud Function, decision logic and tests, and a TTL labeling script. The new flow evaluates environment, owner, ttl, shutdown-date, node age, and node-pool size to skip, label, shut down, or delete clusters. Configuration adds environment, lifecycle toggles, and the archive provider. CWE-284, CWE-494.

Sequence Diagram(s)

sequenceDiagram
  participant CloudSchedulerJob
  participant EnforceLifecycle
  participant ContainerAPI
  participant ComputeAPI

  CloudSchedulerJob->>EnforceLifecycle: HTTP POST with OIDC
  EnforceLifecycle->>ContainerAPI: list clusters
  EnforceLifecycle->>ComputeAPI: list managed instances and creation timestamps
  EnforceLifecycle->>EnforceLifecycle: EvaluateCluster(clusterInfo, now)
  EnforceLifecycle->>ContainerAPI: update labels, scale node pools, or delete cluster
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 10 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (10 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: automating GCP developer cluster lifecycle enforcement.
Description check ✅ Passed The description matches the changeset and rollout plan for the lifecycle-enforcer work.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Sec-02: Secrets In Log Output ✅ Passed No log/print calls in changed non-test/example files emit token/password/credential/secret fields or interpolations.
No Hardcoded Secrets ✅ Passed PASS: no PR-added CWE-798/CWE-321 patterns found; scans of touched files showed no API keys, tokens, passwords, private keys, or long base64 strings.
No Weak Cryptography ✅ Passed No md5/des/rc4/SHA1-crypto, ECB, custom crypto, or secret compares in changed files; no CWE-327/CWE-208 findings.
No Injection Vectors ✅ Passed No CWE-78/79/89/502 sink was added; changed code lacks os/exec, SQL concat, template.HTML, or yaml.Unmarshal, and fmt.Sprintf only builds GCP resource paths.
No Privileged Containers ✅ Passed No added/modified manifests or Dockerfiles contain privileged/root settings; scans found no privileged flags (CWE-250/CWE-732).
No Pii Or Sensitive Data In Logs ✅ Passed PASS: logs only cluster/project metadata and errors; no request bodies, emails, tokens, or secrets were added. No CWE-532/CWE-200 exposure found.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
✨ Simplify code
  • Create PR with simplified code

Comment @coderabbitai help to get the list of available commands.

@rafabene rafabene force-pushed the HYPERFLEET-1229-lifecycle-enforcement branch from 3867ff6 to f7c3d46 Compare June 24, 2026 19:32

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@functions/lifecycle-enforcer/decision_test.go`:
- Around line 311-324: The test case names in decision_test.go are misleading
because they describe the wrong scenario or implementation details rather than
the asserted behavior. Rename the affected cases in TestEvaluateCluster and any
related shutdown test blocks so the names match the actual expected
action/reason (for example, describe empty labels leading to missing owner and
shutdown), and ensure the "shutdown sets shutdown-date label" case reflects what
it actually verifies instead of implying SetLabels is asserted when it is not.

In `@functions/lifecycle-enforcer/decision.go`:
- Around line 88-95: The TTL-based delete path in decision.go is still allowing
deletion when owner metadata is missing, which bypasses the intended 7-day owner
grace period. Update the Decision logic in the shutdown-date block so the TTL
delete branch is only considered when hasOwner is true, keeping the
missing-owner delete path exclusive to its own grace period. Use the existing
Decision, hasOwner, hasTTL, and ttlExpired checks to make the TTL branch
dependent on owner presence.

In `@functions/lifecycle-enforcer/function.go`:
- Around line 178-225: `EvaluateCluster()` is deriving `NodeCount` from stale
`NodePool.InitialNodeCount` and overwriting it per MIG instead of using live
membership across all `InstanceGroupUrls`. Update the `lifecycle-enforcer` loop
to aggregate `managed.ManagedInstances` from every MIG into `npInfo.NodeCount`,
and if any `computeSvc.InstanceGroupManagers.ListManagedInstances` or
`computeSvc.Instances.Get` call fails, return an error rather than continuing
with partial/stale data. Keep the change localized around the `NodePoolInfo`
population logic so the pool is classified from current MIG state.
- Around line 236-279: Wait for the GKE long-running operations returned by
Cluster.SetResourceLabels, Clusters.NodePools.SetSize, and Clusters.Delete
before treating the work as complete. In lifecycle-enforcer’s shutdown/delete
flow, poll each returned Operation to DONE before logging success, and make sure
shutdown-date is only written after every node pool has reached zero. Update the
ActionShutdown and ActionDelete handling to use the operation result instead of
assuming the request finished immediately.

In `@functions/lifecycle-enforcer/go.mod`:
- Around line 32-39: Bump the vulnerable indirect module entries in go.mod for
the lifecycle-enforcer function: update golang.org/x/crypto,
golang.org/x/oauth2, and google.golang.org/grpc to versions at or above the
fixed releases called out in the review. Locate the existing dependency lines in
the go.mod block and raise them to the minimum safe versions, then make sure the
module graph still resolves cleanly after the version changes.

In `@Makefile`:
- Line 251: The recipe shells in the Makefile are vulnerable because
$(LIFECYCLE_DIR) is interpolated unquoted in the commands. Update the affected
targets that use $(LIFECYCLE_DIR) (including the go test, and the other commands
at the same pattern) to wrap the variable in quotes so the shell treats it as a
single path value. Use the existing recipe commands as the place to apply the
fix, keeping the behavior unchanged aside from preventing shell metacharacter
expansion.

In `@scripts/add-ttl-labels.sh`:
- Around line 5-8: `add-ttl-labels.sh` currently treats `DRY_RUN` as false for
any non-literal true value and uses `TTL_DAYS` without validation, so tighten
the CLI boundary before any label-write path. In the script’s startup logic
around `TTL_DAYS`, `DRY_RUN`, and `TTL_DATE`, validate that `TTL_DAYS` is a
positive integer within an expected range, and normalize `DRY_RUN` to an
explicit boolean with only accepted values. Ensure the downstream label
application flow only runs when the sanitized values are valid, and keep the
same checks consistent in the write path referenced by the script’s main action
block.

In `@terraform/main.tf`:
- Around line 5-12: The ttl label is being recomputed from plantimestamp()
inside the local common_labels setup, which causes every plan/apply to push the
cutoff forward and create drift. Update the ttl handling in the locals that
build common_labels so it comes from a creation-time value persisted in state or
is only stamped once, and avoid continuously managing ttl through this shared
label map. Keep the change localized around the ttl_date/common_labels symbols
so the lifecycle-enforcer logic continues to read a stable shutdown/delete
cutoff.

In `@terraform/modules/lifecycle/main.tf`:
- Around line 1-4: The lifecycle module is creating project-global GCP resources
with fixed names, which will collide across multiple Terraform states and make
one state able to destroy shared infrastructure. Update the naming in lifecycle
module symbols like local.function_name, local.bucket_name, the service account
definitions, and the scheduler job resource to include a unique namespace such
as the project or developer identifier, or otherwise restrict this module to a
single shared state. Ensure all references to these names stay consistent
throughout the module so each deployment gets isolated resource names.
- Around line 9-17: The google_storage_bucket resource for function_source is
missing public access prevention, so add the bucket-level setting to enforce
blocking public IAM exposure. Update the function_source bucket definition to
set public_access_prevention to "enforced" alongside the existing
uniform_bucket_level_access and other attributes, using the
google_storage_bucket "function_source" block as the place to make the change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 998c6f8d-f89f-4e76-a917-cb86565f11be

📥 Commits

Reviewing files that changed from the base of the PR and between 931c935 and 3867ff6.

⛔ Files ignored due to path filters (1)
  • functions/lifecycle-enforcer/go.sum is excluded by !**/*.sum, !**/go.sum
📒 Files selected for processing (18)
  • AGENTS.md
  • Makefile
  • README.md
  • functions/lifecycle-enforcer/README.md
  • functions/lifecycle-enforcer/decision.go
  • functions/lifecycle-enforcer/decision_test.go
  • functions/lifecycle-enforcer/function.go
  • functions/lifecycle-enforcer/go.mod
  • scripts/add-ttl-labels.sh
  • terraform/README.md
  • terraform/envs/gke/dev-prow.tfvars
  • terraform/envs/gke/dev.tfvars.example
  • terraform/main.tf
  • terraform/modules/lifecycle/main.tf
  • terraform/modules/lifecycle/outputs.tf
  • terraform/modules/lifecycle/variables.tf
  • terraform/variables.tf
  • terraform/versions.tf
🔗 Linked repositories identified

CodeRabbit considers these linked repositories for cross-repo context during reviews:

  • openshift-hyperfleet/architecture (manual)
  • openshift-hyperfleet/hyperfleet-api (manual)
  • openshift-hyperfleet/hyperfleet-sentinel (manual)
  • openshift-hyperfleet/hyperfleet-adapter (manual)
  • openshift-hyperfleet/hyperfleet-broker (manual)

Comment thread functions/lifecycle-enforcer/decision_test.go
Comment thread functions/lifecycle-enforcer/decision.go
Comment thread functions/lifecycle-enforcer/function.go
Comment thread functions/lifecycle-enforcer/function.go Outdated
Comment thread functions/lifecycle-enforcer/go.mod Outdated
Comment thread Makefile Outdated
Comment thread scripts/add-ttl-labels.sh
Comment thread terraform/main.tf
Comment thread terraform/modules/lifecycle/main.tf
Comment thread terraform/modules/lifecycle/main.tf

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@Makefile`:
- Around line 251-276: The lifecycle Make targets currently treat a missing go
binary as a non-fatal warning under CI, which lets ci-validate pass without
running the checks. Update the go-missing branches in the lifecycle targets (the
test/build/lint recipes in Makefile, plus the ci-validate dependency chain) so
CI fails fast instead of skipping, while keeping the existing behavior only for
local non-CI usage if needed. Use the lifecycle target names and ci-validate
wiring to ensure the canonical pre-merge path always runs or errors when go is
unavailable.

In `@terraform/modules/lifecycle/main.tf`:
- Around line 80-82: Update the lifecycle function to stop using the deprecated
Go 1.23 runtime: in build_config for EnforceLifecycle, replace the go123 runtime
with a supported version such as go124 or newer, and make the matching Go
version change in functions/lifecycle-enforcer/go.mod so both places stay
aligned. Verify any lifecycle-enforcer build settings or related module
declarations reference the new Go version consistently.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 69ebe96a-0331-4efe-8007-b45ed875ca76

📥 Commits

Reviewing files that changed from the base of the PR and between 3867ff6 and f7c3d46.

⛔ Files ignored due to path filters (1)
  • functions/lifecycle-enforcer/go.sum is excluded by !**/*.sum, !**/go.sum
📒 Files selected for processing (18)
  • AGENTS.md
  • Makefile
  • README.md
  • functions/lifecycle-enforcer/README.md
  • functions/lifecycle-enforcer/decision.go
  • functions/lifecycle-enforcer/decision_test.go
  • functions/lifecycle-enforcer/function.go
  • functions/lifecycle-enforcer/go.mod
  • scripts/add-ttl-labels.sh
  • terraform/README.md
  • terraform/envs/gke/dev-prow.tfvars
  • terraform/envs/gke/dev.tfvars.example
  • terraform/main.tf
  • terraform/modules/lifecycle/main.tf
  • terraform/modules/lifecycle/outputs.tf
  • terraform/modules/lifecycle/variables.tf
  • terraform/variables.tf
  • terraform/versions.tf
🔗 Linked repositories identified

CodeRabbit considers these linked repositories for cross-repo context during reviews:

  • openshift-hyperfleet/architecture (manual)
  • openshift-hyperfleet/hyperfleet-api (manual)
  • openshift-hyperfleet/hyperfleet-sentinel (manual)
  • openshift-hyperfleet/hyperfleet-adapter (manual)
  • openshift-hyperfleet/hyperfleet-broker (manual)
✅ Files skipped from review due to trivial changes (4)
  • functions/lifecycle-enforcer/README.md
  • terraform/modules/lifecycle/variables.tf
  • terraform/README.md
  • README.md
🚧 Files skipped from review as they are similar to previous changes (11)
  • terraform/versions.tf
  • terraform/envs/gke/dev.tfvars.example
  • terraform/modules/lifecycle/outputs.tf
  • terraform/variables.tf
  • terraform/envs/gke/dev-prow.tfvars
  • AGENTS.md
  • functions/lifecycle-enforcer/decision_test.go
  • scripts/add-ttl-labels.sh
  • functions/lifecycle-enforcer/function.go
  • terraform/main.tf
  • functions/lifecycle-enforcer/decision.go

Comment thread Makefile Outdated
Comment thread terraform/modules/lifecycle/main.tf
@rafabene rafabene force-pushed the HYPERFLEET-1229-lifecycle-enforcement branch from f7c3d46 to 51bfcaf Compare June 24, 2026 19:50
}

env := cluster.Labels[LabelEnvironment]
if env == EnvCICD {

@ma-hill ma-hill Jun 25, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The management of ci-infra-* clusters is the responsibility of the Test Platform team. I don't think we should interfere with that in the scope of this ticket — the lifecycle enforcer intentionally skips them to avoid conflicts with Prow's own cleanup mechanism.

logger := slog.Default()
now := time.Now().UTC()

projectID := os.Getenv("GCP_PROJECT_ID")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we standardize this to be GCP_PROJECT_ID or PROJECT_ID? Because I think it's set to PROJECT_ID

https://github.com/openshift-hyperfleet/hyperfleet-infra/blob/main/env.gcp#L2
https://github.com/openshift-hyperfleet/hyperfleet-infra/blob/main/terraform/bootstrap/setup-backend.sh#L22

here but in the e2e tests its set as GCP_PROJECT_ID

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — standardized to PROJECT_ID across the Cloud Function code, Terraform module env vars, shell script, and README to match the existing convention in env.gcp and setup-backend.sh.

variable "schedule" {
description = "Cron schedule for the enforcement job (Cloud Scheduler format)"
type = string
default = "0 * * * *"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jira says to run at 19:00 UTC but this will be running hourly.
HF-1229

runs daily at 19:00 UTC via Cloud Scheduler, iterates all GKE clusters in hcm-hyperfleet, and enforces lifecycle rules:

Maybe update jira to match?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — updated the JIRA ticket with a comment explaining the change from nightly to hourly. Hourly is better because it catches idle clusters throughout the day and enforces TTL/owner checks more frequently. The schedule is configurable via lifecycle_enforcer_schedule (default: 0 * * * *).

instanceName := lastSegment(mi.Instance)
inst, err := computeSvc.Instances.Get(projectID, igZone, instanceName).Context(ctx).Do()
if err != nil {
slog.Warn("failed to get instance details",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we just warn and continue will get cause silent failures? Should we instead be collecting the failures and erroring?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — added a retry mechanism (3 attempts with linear backoff: 500ms, 1s, 1.5s) for the Instances.Get call. If all retries fail, it still logs a warning and continues (the instance count from ListManagedInstances remains accurate, only the creation timestamp is lost for that specific instance).

functions.HTTP("EnforceLifecycle", handleEnforceLifecycle)
}

func handleEnforceLifecycle(w http.ResponseWriter, r *http.Request) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validate that the request is an http.MethodPost?
Since that's what is defined here https://github.com/openshift-hyperfleet/hyperfleet-infra/pull/58/changes#diff-e16d384f48e95c49a8aa88db12e2e8582bf119542fe3101748eb9269950d3c80R119

Suggested change
func handleEnforceLifecycle(w http.ResponseWriter, r *http.Request) {
func handleEnforceLifecycle(w http.ResponseWriter, r *http.Request) {
if r.Method != http.MethodPost {
http.Error(w, "method not allowed", http.StatusMethodNotAllowed)
return
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — added POST method validation at the top of the handler.

@rafabene rafabene force-pushed the HYPERFLEET-1229-lifecycle-enforcement branch from 51bfcaf to 3564e2b Compare June 26, 2026 13:47

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@functions/lifecycle-enforcer/function.go`:
- Around line 246-277: The lifecycle-enforcer logic in the shutdown path applies
the `shutdown-date` label before the node pool scale-to-zero work completes, so
move the label update out of the early `SetResourceLabels` block and only apply
it after the `ActionShutdown` loop in `enforceLifecycle` succeeds for every node
pool. Keep the existing cluster label merge behavior, but ensure
`decision.SetLabels` is written only after all
`svc.Projects.Locations.Clusters.NodePools.SetSize` calls have completed without
error so a failed shutdown does not advance the delete window.
- Around line 78-158: The handler currently logs and continues on cluster-level
failures in the main loop, which leaves the overall request returning success
even when enforcement partially fails. In the lifecycle-enforcer function, add
an explicit failure-tracking flag around the cluster processing path that is set
when buildClusterInfo or executeDecision fails, then after encoding the JSON
response use that flag to return a non-2xx status. Keep the per-cluster results
and logging in place, but make the degraded behavior intentional with a clear
comment near the continue paths and the final response handling.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 88b71d29-efb9-484a-a088-c1b1b182662d

📥 Commits

Reviewing files that changed from the base of the PR and between 51bfcaf and 3564e2b.

⛔ Files ignored due to path filters (1)
  • functions/lifecycle-enforcer/go.sum is excluded by !**/*.sum, !**/go.sum
📒 Files selected for processing (19)
  • AGENTS.md
  • Makefile
  • README.md
  • functions/lifecycle-enforcer/README.md
  • functions/lifecycle-enforcer/decision.go
  • functions/lifecycle-enforcer/decision_test.go
  • functions/lifecycle-enforcer/function.go
  • functions/lifecycle-enforcer/go.mod
  • scripts/add-ttl-labels.sh
  • terraform/README.md
  • terraform/envs/gke/dev-prow.tfvars
  • terraform/envs/gke/dev.tfvars.example
  • terraform/main.tf
  • terraform/modules/lifecycle/.gitignore
  • terraform/modules/lifecycle/main.tf
  • terraform/modules/lifecycle/outputs.tf
  • terraform/modules/lifecycle/variables.tf
  • terraform/variables.tf
  • terraform/versions.tf
🔗 Linked repositories identified

CodeRabbit considers these linked repositories for cross-repo context during reviews:

  • openshift-hyperfleet/architecture (manual)
  • openshift-hyperfleet/hyperfleet-api (manual)
  • openshift-hyperfleet/hyperfleet-sentinel (manual)
  • openshift-hyperfleet/hyperfleet-adapter (manual)
  • openshift-hyperfleet/hyperfleet-broker (manual)
✅ Files skipped from review due to trivial changes (6)
  • terraform/modules/lifecycle/.gitignore
  • terraform/modules/lifecycle/variables.tf
  • terraform/README.md
  • terraform/envs/gke/dev.tfvars.example
  • AGENTS.md
  • README.md
🚧 Files skipped from review as they are similar to previous changes (11)
  • terraform/envs/gke/dev-prow.tfvars
  • terraform/versions.tf
  • functions/lifecycle-enforcer/go.mod
  • functions/lifecycle-enforcer/decision_test.go
  • terraform/variables.tf
  • terraform/modules/lifecycle/outputs.tf
  • functions/lifecycle-enforcer/decision.go
  • terraform/modules/lifecycle/main.tf
  • scripts/add-ttl-labels.sh
  • terraform/main.tf
  • Makefile

Comment thread functions/lifecycle-enforcer/function.go
Comment thread functions/lifecycle-enforcer/function.go Outdated
@rafabene rafabene force-pushed the HYPERFLEET-1229-lifecycle-enforcement branch from 3564e2b to 86e8910 Compare June 26, 2026 14:17

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
functions/lifecycle-enforcer/function.go (1)

223-235: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Do not drop nodes with unknown creation time.

Lines 223-235 turn an instance inventory failure into partial age data, but the idle rule is only safe when you know the age of every running node. If one fresh node is skipped here, EvaluateCluster() can misclassify the pool as “all nodes >12h” and scale down an active cluster. CWE-703.

Suggested fix
 				if lastErr != nil {
-					slog.Warn("failed to get instance details after retries",
-						"instance", instanceName,
-						"attempts", 3,
-						"error", lastErr,
-					)
-					continue
+					return ClusterInfo{}, fmt.Errorf("getting instance details for %s/%s after retries: %w", igZone, instanceName, lastErr)
 				}
 
 				creationTime, err := time.Parse(time.RFC3339, inst.CreationTimestamp)
 				if err != nil {
-					continue
+					return ClusterInfo{}, fmt.Errorf("parsing creation timestamp for %s: %w", instanceName, err)
 				}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@functions/lifecycle-enforcer/function.go` around lines 223 - 235, The
instance age check in EvaluateCluster must not silently skip nodes when instance
details or CreationTimestamp parsing fails. In the loop around lastErr and
time.Parse, treat unknown creation time as an evaluation failure instead of
continuing, so the cluster is not misclassified as fully idle; update the logic
in the instance inventory handling path to return or mark the cluster
unsafe-to-scale when any node’s age cannot be determined.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@functions/lifecycle-enforcer/README.md`:
- Around line 90-101: The README wording for the idle shutdown flow is
misleading because the enforcer is based on node age, not workload or traffic.
Update the description in the lifecycle enforcer README to refer to node
age/time-to-live eligibility rather than “inactivity,” and keep the mitigation
guidance aligned with that behavior so operators understand that actively used
but old nodes can still be shut down.

---

Duplicate comments:
In `@functions/lifecycle-enforcer/function.go`:
- Around line 223-235: The instance age check in EvaluateCluster must not
silently skip nodes when instance details or CreationTimestamp parsing fails. In
the loop around lastErr and time.Parse, treat unknown creation time as an
evaluation failure instead of continuing, so the cluster is not misclassified as
fully idle; update the logic in the instance inventory handling path to return
or mark the cluster unsafe-to-scale when any node’s age cannot be determined.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 3777ff06-b90e-409a-afc2-ef0f9b692fd0

📥 Commits

Reviewing files that changed from the base of the PR and between 3564e2b and 86e8910.

⛔ Files ignored due to path filters (1)
  • functions/lifecycle-enforcer/go.sum is excluded by !**/*.sum, !**/go.sum
📒 Files selected for processing (19)
  • AGENTS.md
  • Makefile
  • README.md
  • functions/lifecycle-enforcer/README.md
  • functions/lifecycle-enforcer/decision.go
  • functions/lifecycle-enforcer/decision_test.go
  • functions/lifecycle-enforcer/function.go
  • functions/lifecycle-enforcer/go.mod
  • scripts/add-ttl-labels.sh
  • terraform/README.md
  • terraform/envs/gke/dev-prow.tfvars
  • terraform/envs/gke/dev.tfvars.example
  • terraform/main.tf
  • terraform/modules/lifecycle/.gitignore
  • terraform/modules/lifecycle/main.tf
  • terraform/modules/lifecycle/outputs.tf
  • terraform/modules/lifecycle/variables.tf
  • terraform/variables.tf
  • terraform/versions.tf
🔗 Linked repositories identified

CodeRabbit considers these linked repositories for cross-repo context during reviews:

  • openshift-hyperfleet/architecture (manual)
  • openshift-hyperfleet/hyperfleet-api (manual)
  • openshift-hyperfleet/hyperfleet-sentinel (manual)
  • openshift-hyperfleet/hyperfleet-adapter (manual)
  • openshift-hyperfleet/hyperfleet-broker (manual)
✅ Files skipped from review due to trivial changes (4)
  • terraform/modules/lifecycle/.gitignore
  • terraform/README.md
  • AGENTS.md
  • README.md
🚧 Files skipped from review as they are similar to previous changes (13)
  • terraform/modules/lifecycle/variables.tf
  • terraform/envs/gke/dev-prow.tfvars
  • functions/lifecycle-enforcer/go.mod
  • terraform/envs/gke/dev.tfvars.example
  • terraform/modules/lifecycle/outputs.tf
  • terraform/versions.tf
  • terraform/main.tf
  • terraform/modules/lifecycle/main.tf
  • scripts/add-ttl-labels.sh
  • Makefile
  • terraform/variables.tf
  • functions/lifecycle-enforcer/decision.go
  • functions/lifecycle-enforcer/decision_test.go

Comment thread functions/lifecycle-enforcer/README.md Outdated
…rcement

Cloud Function (Go) that runs hourly via Cloud Scheduler to enforce
idle shutdown (>12h), TTL expiration (48h grace), and missing owner
(7 day grace) rules on developer GKE clusters. Includes Terraform
module, Makefile targets, migration script, and dry-run mode.
@rafabene rafabene force-pushed the HYPERFLEET-1229-lifecycle-enforcement branch from 86e8910 to ea5643e Compare June 26, 2026 14:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants