Skip to content

Use AWS Secrets Manager for privileged CI config#57

Open
Kravalg wants to merge 33 commits into
mainfrom
codex/issue20-pulumi-esc
Open

Use AWS Secrets Manager for privileged CI config#57
Kravalg wants to merge 33 commits into
mainfrom
codex/issue20-pulumi-esc

Conversation

@Kravalg

@Kravalg Kravalg commented May 24, 2026

Copy link
Copy Markdown
Contributor

Pull Request

Description

Replaces the Pulumi Cloud/ESC CI configuration path with an AWS-only path:

GitHub Actions -> GitHub OIDC -> AWS Secrets Manager -> Pulumi CLI with S3 backend + AWS KMS secrets provider

The old .github/actions/load-esc-ci-env action and .github/ci/pulumi-esc.json are removed. Privileged workflows now call .github/actions/load-aws-ci-env, which assumes a GitHub OIDC config-read role, reads one fixed AWS Secrets Manager JSON secret, validates required keys without printing values, and exports the role/backend/stack values needed by later AWS OIDC steps.

Pulumi now creates the AWS-side config foundation directly: Secrets Manager secret containers and one GitHubCiConfigRead-* role per fixed CI suffix (test-pr, test, prod-preview, prod). Production preview and production apply use separate config-read role variables so the protected prod apply secret is not readable through the prod-preview path.

Deployment apply paths are saved-plan-only. The previous test-account direct make pulumi-up fallback after a saved-plan KMS decrypt failure has been removed, scripts/run_pulumi_command.py rejects direct pulumi up whenever GITHUB_ACTIONS=true, and same-repo PR jobs now fail fast if AWS_TEST_PR_CI_CONFIG_ROLE_ARN is missing instead of falling back to the main AWS_TEST_CI_CONFIG_ROLE_ARN.

Related Issues

Fixes #20.
Refs #49, #50, #52, #53, #54, #55, #56, #58.

The operations-alert issues are legacy duplicate candidates. They should stay open until the AWS-backed triage creates or updates a canonical fingerprinted issue and SRE supplies the required confirmation reference for the protected reconcile workflow.

Manual Secure Follow-Up

Follow docs/aws-secrets-manager-ci-cutover.md for the AWS-only setup.

Required non-secret GitHub repository variables:

  • AWS_TEST_REGION
  • AWS_TEST_PR_CI_CONFIG_ROLE_ARN
  • AWS_TEST_CI_CONFIG_ROLE_ARN
  • AWS_PROD_REGION
  • AWS_PROD_PREVIEW_CI_CONFIG_ROLE_ARN
  • AWS_PROD_CI_CONFIG_ROLE_ARN

Required AWS Secrets Manager JSON secret IDs:

  • /bootstrap-infrastructure/ci/test-pr
  • /bootstrap-infrastructure/ci/test
  • /bootstrap-infrastructure/ci/prod-preview
  • /bootstrap-infrastructure/ci/prod

Do not paste secret JSON values into GitHub, chat, logs, docs, or Pulumi config. Use put-secret-value from a private local file and verify with describe-secret, not get-secret-value.

Current External Setup State

  • The six non-secret GitHub repository variables for the AWS config-read roles are populated.
  • Privileged PR jobs reach GitHub OIDC and fail at AWS authorization for the test role: Could not assume role with OIDC: Not authorized to perform sts:AssumeRoleWithWebIdentity for arn:aws:iam::891377212104:role/GitHubCiConfigRead-bootstrap-infrastructure-test-pr.
  • The local AWS CLI currently has only the default profile, and after unsetting inherited AWS env vars it resolves to production account 933245420672 as arn:aws:iam::933245420672:user/codex; a separate test-account profile for account 891377212104 is still needed before fixing test-account resources from this workstation.
  • The current shell process still has inherited AWS credential env vars; values were not recorded. Use unset AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY AWS_SESSION_TOKEN AWS_PROFILE before profile-based verification.
  • AWS MCP production identity succeeds for account 933245420672 as arn:aws:iam::933245420672:user/codex.
  • AWS MCP production metadata currently shows the new prod AWS-only resources are not present before stack apply: /bootstrap-infrastructure/ci/prod-preview and /bootstrap-infrastructure/ci/prod return ResourceNotFoundException; the corresponding GitHubCiConfigRead-* roles are not present yet.

How Has This Been Tested?

Latest local validation on this branch:

  • make test-unit (389 passed, 100% coverage)
  • uv run pytest tests/pulumi/test_project_structure.py tests/pulumi/test_delivery_contracts.py -q (58 passed)
  • uv run pytest tests/pulumi/test_delivery_contracts.py::test_pr_comment_workflows_gate_prod_after_successful_test_apply tests/pulumi/test_delivery_contracts.py::test_prod_workflow_requires_successful_test_deploy_for_same_sha -q (2 passed)
  • uv run pytest tests/unit/test_components.py::test_ci_configuration_requires_github_oidc_provider tests/unit/test_mutation_targets.py::test_mutation_target_ci_config_validation_and_lookup_helpers -q (2 passed)
  • uv run pytest tests/pulumi/test_project_structure.py::test_docs_cover_current_testing_and_guardrail_guidance tests/pulumi/test_project_structure.py::test_issue20_cutover_manual_is_secret_safe_and_actionable tests/pulumi/test_project_structure.py::test_issue20_closeout_evidence_tracks_external_manual_steps -q (3 passed)
  • uv run pytest tests/pulumi/test_delivery_contracts.py::test_aws_ci_loader_reads_secrets_manager_without_pulumi_cloud tests/pulumi/test_delivery_contracts.py::test_multi_account_workflows_use_fixed_aws_ci_config_contracts tests/pulumi/test_delivery_contracts.py::test_multi_account_environment_docs_are_explicit tests/unit/test_script_entrypoints.py::test_run_up_plan_stack_rejects_plan_decrypt_without_direct_apply -q (4 passed)
  • uv run pytest tests/unit/test_script_entrypoints.py::test_run_up_stack_does_not_retry_after_lock tests/unit/test_script_entrypoints.py::test_run_up_stack_rejects_direct_apply_in_github_actions tests/unit/test_script_entrypoints.py::test_run_pulumi_command_unhandled_apply_failures_return_status tests/unit/test_script_entrypoints.py::test_run_pulumi_command_dispatch_propagates_apply_failures -q (4 passed)
  • uv run pytest tests/pulumi/test_delivery_contracts.py::test_makefile_keeps_pulumi_guardrails_secret_safe tests/pulumi/test_project_structure.py::test_issue20_cutover_manual_is_secret_safe_and_actionable -q (2 passed)
  • uv run pytest tests/pulumi/test_ci_guardrails.py::test_preview_guardrail_workflow_requires_preview_diff_and_iam_jobs tests/pulumi/test_ci_guardrails.py::test_well_architected_evidence_workflow_uploads_enforced_reports tests/pulumi/test_delivery_contracts.py::test_multi_account_workflows_use_fixed_aws_ci_config_contracts -q (3 passed)
  • uv run pytest tests/pulumi/test_project_structure.py::test_issue20_closeout_evidence_tracks_external_manual_steps -q (1 passed)
  • uv run ruff check scripts/run_pulumi_command.py tests/unit/test_script_entrypoints.py tests/pulumi/test_delivery_contracts.py tests/pulumi/test_project_structure.py tests/pulumi/test_ci_guardrails.py
  • make test-actionlint
  • make test-yaml
  • make test-secrets (latest commit scanned; no leaks found)
  • qlty check
  • git diff --check

bmalph workflow status:

  • bmalph doctor ran and reported the project is not initialized for bmalph/BMAD artifacts.
  • bmalph init --dry-run --platform codex ran to inspect the generated artifact set without adding unrelated workflow scaffolding to this PR.

Checklist

  • My code follows the code style of this project.
  • I have performed a self-review of my code.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have added tests to cover my changes.
  • Local tests passed; external AWS/GitHub setup blockers are called out above.
  • You have only one commit. Repository rules allow squash merge into main; this branch is intentionally not force-squashed while review/check state is active.

@qodo-code-review

Copy link
Copy Markdown

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

@coderabbitai

coderabbitai Bot commented May 24, 2026

Copy link
Copy Markdown

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a composite GitHub Action that resolves/authenticates to Pulumi Cloud, loads and validates Pulumi ESC environmentVariables, exports AWS/Pulumi outputs for OIDC assume-role steps, implements deterministic operations-alert fingerprinting and reconcile tooling, rewires workflows to consume ESC outputs, updates IAM trust-policy generation, tests, and documentation.

Changes

Pulumi ESC-Driven CI Configuration

Layer / File(s) Summary
ESC loader action and outputs
.github/actions/load-esc-ci-env/action.yml
Composite action resolves Pulumi ESC target (from .github/ci/pulumi-esc.json or override), authenticates to Pulumi Cloud, loads ESC environmentVariables, validates required keys/patterns, and emits AWS/Pulumi outputs (steps.esc.outputs.*).
Committed ESC target config
.github/ci/pulumi-esc.json
Adds committed Pulumi CI target configuration (organization: vilnacrm-org, project: bootstrap-infrastructure) used by the ESC loader action.
CI environment validator & unit tests
scripts/validate_ci_environment.py, tests/unit/test_validate_ci_environment.py
CLI validator parses --required-keys, enforces per-key regex/shape validators (account ID, region, role ARN, backend/secrets provider, stack lists, SNS topic, metadata-only names), writes AWS_DEFAULT_REGION to GITHUB_ENV when applicable, rejects newline injection, and is thoroughly tested.
Operations alert fingerprinting & triage script
scripts/operations_alert_triage.py, tests/unit/test_operations_alert_triage.py
Sanitizes nested SNS/SQS/EventBridge payloads, computes deterministic SHA-256 batch fingerprints from stable metadata, renders sanitized issue bodies, and writes body+fingerprint files for workflow consumption and dedupe.
Operations alert workflow integration
.github/workflows/operations-alert-triage.yml
Workflow loads triage ESC config via ESC loader, assumes triage role via OIDC using ESC outputs, delegates rendering/fingerprinting to the script, searches issues by fingerprint, comments or creates accordingly, and deletes SQS messages only after success.
Legacy reconcile & cleanup workflows
.github/workflows/operations-alert-reconcile.yml, .github/workflows/github-environment-legacy-cleanup.yml
Manual reconcile workflow to close legacy pre-fingerprint issues as duplicates of a canonical fingerprinted issue; manual cleanup workflow to list/delete allowlisted legacy GitHub Environment variables with dry-run and confirmation gating.
Preview / test / drift workflow rewires
.github/workflows/pulumi-pr-guardrails.yml, .github/workflows/pulumi-test-deploy.yml, .github/workflows/nightly-guardrails.yml, .github/workflows/well-architected-evidence.yml, tests/pulumi/test_ci_guardrails.py
PR guardrails, test deploy, nightly drift, and evidence jobs now load fixed ESC environments per-purpose via the ESC loader, grant id-token permissions as needed, consume steps.esc.outputs.* for OIDC credential configuration, remove privileged job-level env wiring, and emit ESC-derived evidence.
Production workflows & command-runner
.github/workflows/pulumi-prod.yml, .github/workflows/pulumi-pr-command-runner.yml
Prod preview/IAM validation/apply/post-apply drift jobs load prod-preview/prod ESC environments with purpose-specific required-keys, use ESC outputs for OIDC role assumption/account/region allowlists, and keep production apply gated by protected prod GitHub Environment approval.
Pulumi automation IAM trust policy
pulumi/infra/automation.py, tests/unit/test_components.py
Trust-policy builders accept branch_name and produce branch/workflow-scoped token.subjects and explicit job_workflow_ref allowlists; operations-alert triage role is bound to a specific workflow ref; production apply retains repo:...:environment:prod subject for approval gating.
Structural and unit tests updated
tests/pulumi/*, tests/unit/*
Structural tests updated to discover ESC loader steps and required-keys assertions, ensure AWS credential inputs come from steps.esc.outputs.*, skip SHA pin checks for local ./ actions, and add unit tests covering validator, alert triage, and IAM subject behaviors.
Documentation and specs
README.md, docs/*, specs/issue-20-pulumi-esc-ci-config/*, .github/github-actions-secrets.md
Extensive docs/specs updated to require fixed Pulumi ESC environments (test-pr, test, prod-preview, prod), list required ESC environmentVariables, describe OIDC-first flows, migration checklist, operations-alert fingerprinting/dedupe semantics, and readiness artifacts.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

Possibly related PRs

Suggested labels

ci, pulumi, aws, documentation, needs-admin, external-control

Suggested reviewers

  • dmytrocraft
  • cubic-dev-ai

"🐰 I hopped through ESC fields with glee,
I moved secrets where they ought to be.
Fingerprints hush duplicate alarm,
Branch-scoped trust keeps production calm.
Cheers — a carrot for the CI tree!"

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/issue20-pulumi-esc

@qltysh

qltysh Bot commented May 24, 2026

Copy link
Copy Markdown

All good ✅

Comment thread scripts/operations_alert_triage.py Outdated
Comment thread scripts/validate_ci_environment.py Outdated
Comment thread scripts/validate_ci_environment.py Outdated
Comment thread scripts/validate_ci_environment.py Outdated
Comment thread scripts/validate_ci_environment.py Outdated

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 issues found across 31 files

Confidence score: 2/5

  • High merge risk due to a concrete security concern in pulumi/infra/automation.py: broadened prod role trust to branch/PR subjects could bypass environment-based production approval controls.
  • scripts/validate_ci_environment.py has two user-impacting validation gaps: newline handling when writing $GITHUB_ENV (environment-file injection risk) and allowing an empty --required-keys set to pass without actually validating CI config.
  • Given the high severities (9/10 and 7/10) and strong confidence signals, this does not look safe to merge until these are addressed; the docs issues are lower risk but still worth fixing for operator correctness.
  • Pay close attention to pulumi/infra/automation.py, scripts/validate_ci_environment.py, docs/ci-guardrails.md, docs/github-actions-secrets.md - security trust boundaries, CI validation hardening, and preventing misleading/broken guidance.

Tip: cubic can generate docs of your entire codebase and keep them up to date. Try it here.

Re-trigger cubic

Comment thread pulumi/infra/automation.py Outdated
Comment thread scripts/validate_ci_environment.py
Comment thread scripts/validate_ci_environment.py
Comment thread docs/ci-guardrails.md
Comment thread docs/github-actions-secrets.md
@dmytrocraft dmytrocraft force-pushed the codex/issue20-pulumi-esc branch from 5389d3e to 84bf011 Compare May 24, 2026 16:12

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pulumi/infra/automation.py (1)

592-645: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Require environment:prod as the only prod subject.

Line 601 still whitelists repo:...:ref:refs/heads/{branch_name} for the production role. Combined with the workflow refs on Lines 623-645, any allowed main-branch workflow can satisfy the trust policy without ever entering the protected GitHub prod environment, which turns the manual approval gate into an optional path instead of an enforcement point.

🔐 Suggested tightening
 def _automation_assume_role_policy(
     oidc_provider_arn: str,
     org: str,
     repo_name: str,
     production_environment: str,
     branch_name: str,
 ) -> str:
     """Build the GitHub OIDC trust policy for fixed workflow automation."""
     workflow_prefix = f"{org}/{repo_name}/.github/workflows"
-    subjects = [f"repo:{org}/{repo_name}:ref:refs/heads/{branch_name}"]
     if production_environment == "prod":
-        subjects.append(f"repo:{org}/{repo_name}:environment:{production_environment}")
+        subjects = [
+            f"repo:{org}/{repo_name}:environment:{production_environment}"
+        ]
     else:
-        subjects.append(f"repo:{org}/{repo_name}:pull_request")
+        subjects = [
+            f"repo:{org}/{repo_name}:ref:refs/heads/{branch_name}",
+            f"repo:{org}/{repo_name}:pull_request",
+        ]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pulumi/infra/automation.py` around lines 592 - 645, The trust policy in
_automation_assume_role_policy currently always includes the branch ref subject
(repo:...:ref:refs/heads/{branch_name}), which allows workflows on the main
branch to assume the prod role without entering the protected GitHub
environment; change the subject construction so that when production_environment
== "prod" the subjects list contains only the environment subject
(repo:{org}/{repo_name}:environment:prod) and does not include the branch ref,
otherwise keep the existing non-prod subjects (ref and pull_request); update the
subjects variable used in the "token.actions.githubusercontent.com:sub"
condition accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/operations-alert-reconcile.yml:
- Around line 33-89: The workflow calls gh commands (functions: gh issue view
and gh issue close) without an explicit repository context so they can fail;
update every gh invocation in the script (the gh issue view calls that set
canonical_body, state, title, body and the gh issue close call) to include
--repo "${GH_REPO}" (or alternately add an actions/checkout step and export
GH_REPO before use) so the GitHub CLI has an explicit repository context at
runtime.

In `@scripts/operations_alert_triage.py`:
- Around line 138-142: main() assumes the file at args.alerts_json parses to a
dict and currently lets json errors or non-object top-level JSON raise; wrap the
JSON load in a try/except to catch json.JSONDecodeError and TypeError, validate
that the parsed alerts is a dict (isinstance(alerts, dict)), and if either the
parse fails or the result is not a dict, log a clear error mentioning
args.alerts_json and return 1 instead of raising; keep using
alerts_fingerprint(alerts) only after validation.

---

Outside diff comments:
In `@pulumi/infra/automation.py`:
- Around line 592-645: The trust policy in _automation_assume_role_policy
currently always includes the branch ref subject
(repo:...:ref:refs/heads/{branch_name}), which allows workflows on the main
branch to assume the prod role without entering the protected GitHub
environment; change the subject construction so that when production_environment
== "prod" the subjects list contains only the environment subject
(repo:{org}/{repo_name}:environment:prod) and does not include the branch ref,
otherwise keep the existing non-prod subjects (ref and pull_request); update the
subjects variable used in the "token.actions.githubusercontent.com:sub"
condition accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 73e5b8ff-0988-4aad-b921-7cfb82aabba5

📥 Commits

Reviewing files that changed from the base of the PR and between 84bf011 and 760a838.

📒 Files selected for processing (34)
  • .github/actions/load-esc-ci-env/action.yml
  • .github/ci/pulumi-esc.json
  • .github/github-actions-secrets.md
  • .github/workflows/nightly-guardrails.yml
  • .github/workflows/operations-alert-reconcile.yml
  • .github/workflows/operations-alert-triage.yml
  • .github/workflows/pulumi-pr-command-runner.yml
  • .github/workflows/pulumi-pr-guardrails.yml
  • .github/workflows/pulumi-prod.yml
  • .github/workflows/pulumi-test-deploy.yml
  • .github/workflows/well-architected-evidence.yml
  • README.md
  • docs/README.md
  • docs/alert-routing-evidence.md
  • docs/ci-architecture.md
  • docs/ci-guardrails.md
  • docs/github-actions-secrets.md
  • docs/security-operating-evidence.md
  • docs/sre-operations.md
  • pulumi/infra/automation.py
  • scripts/operations_alert_triage.py
  • scripts/validate_ci_environment.py
  • specs/issue-18-multi-account-test-prod-pulumi-environments/epics.md
  • specs/issue-18-multi-account-test-prod-pulumi-environments/prd.md
  • specs/issue-20-pulumi-esc-ci-config/architecture.md
  • specs/issue-20-pulumi-esc-ci-config/epics.md
  • specs/issue-20-pulumi-esc-ci-config/implementation-readiness-report.md
  • specs/issue-20-pulumi-esc-ci-config/prd.md
  • tests/pulumi/test_ci_guardrails.py
  • tests/pulumi/test_delivery_contracts.py
  • tests/pulumi/test_project_structure.py
  • tests/unit/test_components.py
  • tests/unit/test_operations_alert_triage.py
  • tests/unit/test_validate_ci_environment.py
✅ Files skipped from review due to trivial changes (4)
  • README.md
  • specs/issue-20-pulumi-esc-ci-config/implementation-readiness-report.md
  • docs/sre-operations.md
  • specs/issue-20-pulumi-esc-ci-config/epics.md

Comment thread .github/workflows/operations-alert-reconcile.yml
Comment thread scripts/operations_alert_triage.py

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 issues found across 10 files (changes from recent commits).

Requires human review: Auto-approval blocked by 2 unresolved issues from previous reviews.

Re-trigger cubic

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 issues found across 5 files (changes from recent commits).

Re-trigger cubic

@dmytrocraft dmytrocraft force-pushed the codex/issue20-pulumi-esc branch 3 times, most recently from 33b2fea to 378c21e Compare May 24, 2026 18:33

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
docs/ci-guardrails.md (1)

516-527: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Complete the post-cutover checklist.

The rollout steps stop short of the actions that actually finish this migration: applying the Pulumi stacks to stamp the tightened IAM trust, running the legacy environment-variable cleanup, and deleting the temporary GH_ENVIRONMENT_ADMIN_TOKEN afterward. As written, an operator can follow this section and still leave stale AWS trust or stale privileged GitHub config behind.

📝 Suggested checklist update
 1. create the GitHub OIDC IAM roles in AWS
 2. create and populate the four Pulumi ESC environments listed above
 3. configure Pulumi Cloud OIDC for this repository and organization
-4. create only the protected `prod` GitHub Environment for production approval
-5. enable required reviewers and branch restrictions on `prod`
-6. mark the required PR checks in GitHub branch protection
-7. decide whether production repositories want stricter stack lists or narrower
+4. apply the Pulumi test/prod stacks so the updated IAM trust policies converge in AWS
+5. run the legacy GitHub Environment variable cleanup workflow (dry-run, then confirmed apply)
+6. delete the temporary `GH_ENVIRONMENT_ADMIN_TOKEN` secret after cleanup
+7. create only the protected `prod` GitHub Environment for production approval
+8. enable required reviewers and branch restrictions on `prod`
+9. mark the required PR checks in GitHub branch protection
+10. decide whether production repositories want stricter stack lists or narrower
    IAM role scopes than the template defaults

As per coding guidelines, docs/**: Update docs/ whenever the developer workflow, CI surface, or credential contract changes.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/ci-guardrails.md` around lines 516 - 527, Update the "Manual maintainer
follow-up" section to include the missing post-cutover steps: explicitly
instruct maintainers to apply the Pulumi stacks to enforce the new IAM trust
(reference "Pulumi ESC environments" and "apply the Pulumi stacks"), run the
legacy environment-variable cleanup, and delete the temporary
GH_ENVIRONMENT_ADMIN_TOKEN; also add a short confirmation step to verify no
stale AWS trust or privileged GitHub config remains (referencing the protected
`prod` GitHub Environment and branch protection items) so operators have a
complete checklist to finish the migration.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/operations-alert-reconcile.yml:
- Around line 59-91: The loop over LEGACY_ISSUES should first parse, trim and
deduplicate tokens and error if no valid issue IDs were produced; change the
parsing around normalized/raw_issue/issue to skip empty/whitespace-only tokens,
validate and add each numeric issue to a deduplicated collection (e.g. an
associative array or temporary list keyed by issue) and after parsing fail with
a non-zero exit if the deduped collection is empty, then iterate over the
deduped issue list (instead of the raw tokens) for the existing validation and
gh issue close logic (keeping checks that issue != canonical, state is OPEN,
title matches, and no fingerprint marker).

---

Outside diff comments:
In `@docs/ci-guardrails.md`:
- Around line 516-527: Update the "Manual maintainer follow-up" section to
include the missing post-cutover steps: explicitly instruct maintainers to apply
the Pulumi stacks to enforce the new IAM trust (reference "Pulumi ESC
environments" and "apply the Pulumi stacks"), run the legacy
environment-variable cleanup, and delete the temporary
GH_ENVIRONMENT_ADMIN_TOKEN; also add a short confirmation step to verify no
stale AWS trust or privileged GitHub config remains (referencing the protected
`prod` GitHub Environment and branch protection items) so operators have a
complete checklist to finish the migration.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e368df4c-8673-4013-b117-26c62db7549a

📥 Commits

Reviewing files that changed from the base of the PR and between 760a838 and 378c21e.

📒 Files selected for processing (35)
  • .github/actions/load-esc-ci-env/action.yml
  • .github/ci/pulumi-esc.json
  • .github/github-actions-secrets.md
  • .github/workflows/github-environment-legacy-cleanup.yml
  • .github/workflows/nightly-guardrails.yml
  • .github/workflows/operations-alert-reconcile.yml
  • .github/workflows/operations-alert-triage.yml
  • .github/workflows/pulumi-pr-command-runner.yml
  • .github/workflows/pulumi-pr-guardrails.yml
  • .github/workflows/pulumi-prod.yml
  • .github/workflows/pulumi-test-deploy.yml
  • .github/workflows/well-architected-evidence.yml
  • README.md
  • docs/README.md
  • docs/alert-routing-evidence.md
  • docs/ci-architecture.md
  • docs/ci-guardrails.md
  • docs/github-actions-secrets.md
  • docs/security-operating-evidence.md
  • docs/sre-operations.md
  • pulumi/infra/automation.py
  • scripts/operations_alert_triage.py
  • scripts/validate_ci_environment.py
  • specs/issue-18-multi-account-test-prod-pulumi-environments/epics.md
  • specs/issue-18-multi-account-test-prod-pulumi-environments/prd.md
  • specs/issue-20-pulumi-esc-ci-config/architecture.md
  • specs/issue-20-pulumi-esc-ci-config/epics.md
  • specs/issue-20-pulumi-esc-ci-config/implementation-readiness-report.md
  • specs/issue-20-pulumi-esc-ci-config/prd.md
  • tests/pulumi/test_ci_guardrails.py
  • tests/pulumi/test_delivery_contracts.py
  • tests/pulumi/test_project_structure.py
  • tests/unit/test_components.py
  • tests/unit/test_operations_alert_triage.py
  • tests/unit/test_validate_ci_environment.py
✅ Files skipped from review due to trivial changes (6)
  • .github/ci/pulumi-esc.json
  • specs/issue-20-pulumi-esc-ci-config/epics.md
  • specs/issue-20-pulumi-esc-ci-config/implementation-readiness-report.md
  • README.md
  • docs/sre-operations.md
  • docs/alert-routing-evidence.md

Comment thread .github/workflows/operations-alert-reconcile.yml
@dmytrocraft dmytrocraft force-pushed the codex/issue20-pulumi-esc branch from 378c21e to 647c160 Compare May 24, 2026 18:48
@Kravalg

Kravalg commented May 24, 2026

Copy link
Copy Markdown
Contributor Author

All currently visible review threads have been addressed in 647c160768b118e710207577d5603e488c15875f.

Specifically:

  • operations-alert-reconcile.yml now parses, deduplicates, and rejects empty legacy issue input before closing anything.
  • docs/ci-guardrails.md now includes the post-cutover Pulumi stack apply, legacy GitHub Environment cleanup, temporary token deletion, and stale-trust/config verification steps.
  • Focused validation plus project-structure, Ruff, YAML, actionlint, and diff checks passed locally.

@coderabbitai please re-review the latest head.

@coderabbitai

coderabbitai Bot commented May 24, 2026

Copy link
Copy Markdown

@Kravalg Thanks for the detailed summary of the addressed threads. I'll kick off a full re-review against the latest head now.

✅ Actions performed

Full review triggered.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
.github/workflows/operations-alert-triage.yml (1)

123-129: ⚡ Quick win

Use uv run for the triage Python CLI invocation.

Switching this call to uv run keeps interpreter/dependency resolution consistent with the repo’s Python command contract.

♻️ Proposed change
-          python3 scripts/operations_alert_triage.py \
+          uv run python scripts/operations_alert_triage.py \
             --alerts-json "${alerts_json}" \
             --queue-name "${OPERATIONS_ALERT_QUEUE_NAME}" \
             --account-id "${AWS_ACCOUNT_ID}" \
             --region "${AWS_REGION}" \
             --body-file "${body_file}" \
             --fingerprint-file "${fingerprint_file}"

As per coding guidelines, "Use uv run ... for Python CLI commands instead of invoking tools directly from the global environment."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/operations-alert-triage.yml around lines 123 - 129,
Replace the direct python3 invocation of the triage CLI with the repository
Python runner; specifically change the call that runs operations_alert_triage.py
(the block invoking "python3 scripts/operations_alert_triage.py ...") to use "uv
run" so the command becomes "uv run scripts/operations_alert_triage.py ..."
ensuring interpreter and dependency resolution follow the repo's Python command
contract.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In @.github/workflows/operations-alert-triage.yml:
- Around line 123-129: Replace the direct python3 invocation of the triage CLI
with the repository Python runner; specifically change the call that runs
operations_alert_triage.py (the block invoking "python3
scripts/operations_alert_triage.py ...") to use "uv run" so the command becomes
"uv run scripts/operations_alert_triage.py ..." ensuring interpreter and
dependency resolution follow the repo's Python command contract.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: fb1ae087-5e62-43d9-846c-f922c467a76e

📥 Commits

Reviewing files that changed from the base of the PR and between d301569 and 647c160.

📒 Files selected for processing (35)
  • .github/actions/load-esc-ci-env/action.yml
  • .github/ci/pulumi-esc.json
  • .github/github-actions-secrets.md
  • .github/workflows/github-environment-legacy-cleanup.yml
  • .github/workflows/nightly-guardrails.yml
  • .github/workflows/operations-alert-reconcile.yml
  • .github/workflows/operations-alert-triage.yml
  • .github/workflows/pulumi-pr-command-runner.yml
  • .github/workflows/pulumi-pr-guardrails.yml
  • .github/workflows/pulumi-prod.yml
  • .github/workflows/pulumi-test-deploy.yml
  • .github/workflows/well-architected-evidence.yml
  • README.md
  • docs/README.md
  • docs/alert-routing-evidence.md
  • docs/ci-architecture.md
  • docs/ci-guardrails.md
  • docs/github-actions-secrets.md
  • docs/security-operating-evidence.md
  • docs/sre-operations.md
  • pulumi/infra/automation.py
  • scripts/operations_alert_triage.py
  • scripts/validate_ci_environment.py
  • specs/issue-18-multi-account-test-prod-pulumi-environments/epics.md
  • specs/issue-18-multi-account-test-prod-pulumi-environments/prd.md
  • specs/issue-20-pulumi-esc-ci-config/architecture.md
  • specs/issue-20-pulumi-esc-ci-config/epics.md
  • specs/issue-20-pulumi-esc-ci-config/implementation-readiness-report.md
  • specs/issue-20-pulumi-esc-ci-config/prd.md
  • tests/pulumi/test_ci_guardrails.py
  • tests/pulumi/test_delivery_contracts.py
  • tests/pulumi/test_project_structure.py
  • tests/unit/test_components.py
  • tests/unit/test_operations_alert_triage.py
  • tests/unit/test_validate_ci_environment.py

@dmytrocraft dmytrocraft force-pushed the codex/issue20-pulumi-esc branch 2 times, most recently from b638d1e to a95ab85 Compare May 24, 2026 19:57

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
.github/actions/load-esc-ci-env/action.yml (1)

95-97: 💤 Low value

Consider using uv run for the validation script.

The coding guidelines specify using uv run ... for Python CLI commands. The validation script is invoked directly with python3.

♻️ Suggested change
-        python3 scripts/validate_ci_environment.py \
+        uv run python scripts/validate_ci_environment.py \
           --purpose "${{ inputs.purpose }}" \
           --required-keys "${{ inputs.required-keys }}"

As per coding guidelines: **/*.{py,sh}: Use uv run ... for Python CLI commands instead of invoking tools directly from the global environment.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/actions/load-esc-ci-env/action.yml around lines 95 - 97, Replace the
direct python3 invocation of the validation script with the project's task
runner by calling "uv run" for the CLI command: change the step that runs
"python3 scripts/validate_ci_environment.py --purpose \"${{ inputs.purpose }}\"
--required-keys \"${{ inputs.required-keys }}\"" to use "uv run" so the script
is executed via the project's uv task runner; update the command that references
scripts/validate_ci_environment.py and the inputs purpose/required-keys
accordingly.
tests/pulumi/test_project_structure.py (1)

316-366: ⚡ Quick win

Normalize trigger access instead of indexing workflow[True] directly.

Line 316 and Line 358 assume PyYAML always converts on to True. If a workflow uses quoted "on", these tests fail for the wrong reason. Use the same normalized fallback used elsewhere in this suite.

Proposed fix
-    assert "workflow_dispatch" in reconcile_workflow[True]  # nosec B101
+    reconcile_triggers = reconcile_workflow.get("on", reconcile_workflow.get(True, {}))
+    assert "workflow_dispatch" in reconcile_triggers  # nosec B101
@@
-    assert "workflow_dispatch" in cleanup_workflow[True]  # nosec B101
+    cleanup_triggers = cleanup_workflow.get("on", cleanup_workflow.get(True, {}))
+    assert "workflow_dispatch" in cleanup_triggers  # nosec B101
@@
-    assert cleanup_workflow[True]["workflow_dispatch"]["inputs"]["dry_run"][  # nosec B101
+    assert cleanup_triggers["workflow_dispatch"]["inputs"]["dry_run"][  # nosec B101
         "default"
     ]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/pulumi/test_project_structure.py` around lines 316 - 366, The tests
index the parsed YAML with True (reconcile_workflow[True],
cleanup_workflow[True]) which breaks when PyYAML preserves the "on" key as the
string "on"; update those assertions to use the same normalized fallback used
elsewhere: read the trigger via reconcile_workflow.get("on",
reconcile_workflow.get(True)) and cleanup_workflow.get("on",
cleanup_workflow.get(True)) (use that value for checks like workflow_dispatch
and inputs), replacing direct [True] indexing on reconcile_workflow and
cleanup_workflow.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scripts/operations_alert_triage.py`:
- Around line 58-66: The code assumes incoming "Messages" is a list of dicts and
directly accesses message.get(...) which will crash for non-list or non-dict
payloads; update the message-shape handling so you first validate that Messages
is a list and iterate only over items that are dicts (skip or coerce others),
and ensure local variables like message, attributes, sns, and event are always
dicts (default to {} when missing) before calling .get; apply this defensive
pattern to the blocks building the return dict (references: variables message,
attributes, sns, event and the code paths around lines 58-66 and the regions
noted 90-97 and 105-129) so malformed payloads are skipped or safely handled
instead of raising.

---

Nitpick comments:
In @.github/actions/load-esc-ci-env/action.yml:
- Around line 95-97: Replace the direct python3 invocation of the validation
script with the project's task runner by calling "uv run" for the CLI command:
change the step that runs "python3 scripts/validate_ci_environment.py --purpose
\"${{ inputs.purpose }}\" --required-keys \"${{ inputs.required-keys }}\"" to
use "uv run" so the script is executed via the project's uv task runner; update
the command that references scripts/validate_ci_environment.py and the inputs
purpose/required-keys accordingly.

In `@tests/pulumi/test_project_structure.py`:
- Around line 316-366: The tests index the parsed YAML with True
(reconcile_workflow[True], cleanup_workflow[True]) which breaks when PyYAML
preserves the "on" key as the string "on"; update those assertions to use the
same normalized fallback used elsewhere: read the trigger via
reconcile_workflow.get("on", reconcile_workflow.get(True)) and
cleanup_workflow.get("on", cleanup_workflow.get(True)) (use that value for
checks like workflow_dispatch and inputs), replacing direct [True] indexing on
reconcile_workflow and cleanup_workflow.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 9cfc1992-ca5f-4e88-9b7d-9838e3472f10

📥 Commits

Reviewing files that changed from the base of the PR and between 647c160 and a95ab85.

📒 Files selected for processing (35)
  • .github/actions/load-esc-ci-env/action.yml
  • .github/ci/pulumi-esc.json
  • .github/github-actions-secrets.md
  • .github/workflows/github-environment-legacy-cleanup.yml
  • .github/workflows/nightly-guardrails.yml
  • .github/workflows/operations-alert-reconcile.yml
  • .github/workflows/operations-alert-triage.yml
  • .github/workflows/pulumi-pr-command-runner.yml
  • .github/workflows/pulumi-pr-guardrails.yml
  • .github/workflows/pulumi-prod.yml
  • .github/workflows/pulumi-test-deploy.yml
  • .github/workflows/well-architected-evidence.yml
  • README.md
  • docs/README.md
  • docs/alert-routing-evidence.md
  • docs/ci-architecture.md
  • docs/ci-guardrails.md
  • docs/github-actions-secrets.md
  • docs/security-operating-evidence.md
  • docs/sre-operations.md
  • pulumi/infra/automation.py
  • scripts/operations_alert_triage.py
  • scripts/validate_ci_environment.py
  • specs/issue-18-multi-account-test-prod-pulumi-environments/epics.md
  • specs/issue-18-multi-account-test-prod-pulumi-environments/prd.md
  • specs/issue-20-pulumi-esc-ci-config/architecture.md
  • specs/issue-20-pulumi-esc-ci-config/epics.md
  • specs/issue-20-pulumi-esc-ci-config/implementation-readiness-report.md
  • specs/issue-20-pulumi-esc-ci-config/prd.md
  • tests/pulumi/test_ci_guardrails.py
  • tests/pulumi/test_delivery_contracts.py
  • tests/pulumi/test_project_structure.py
  • tests/unit/test_components.py
  • tests/unit/test_operations_alert_triage.py
  • tests/unit/test_validate_ci_environment.py
✅ Files skipped from review due to trivial changes (7)
  • specs/issue-20-pulumi-esc-ci-config/epics.md
  • README.md
  • specs/issue-18-multi-account-test-prod-pulumi-environments/prd.md
  • docs/sre-operations.md
  • specs/issue-20-pulumi-esc-ci-config/implementation-readiness-report.md
  • specs/issue-20-pulumi-esc-ci-config/architecture.md
  • docs/alert-routing-evidence.md

Comment thread scripts/operations_alert_triage.py Outdated
@dmytrocraft dmytrocraft force-pushed the codex/issue20-pulumi-esc branch 3 times, most recently from defa592 to a810515 Compare May 24, 2026 20:51
Comment thread pulumi/infra/bootstrap_infrastructure.py
@cubic-dev-ai

cubic-dev-ai Bot commented May 24, 2026

Copy link
Copy Markdown

You're iterating quickly on this pull request. To help protect your rate limits, cubic has paused automatic reviews on new pushes for now—when you're ready for another review, comment @cubic-dev-ai review.

Comment thread scripts/validate_ci_environment.py Fixed
@Kravalg Kravalg changed the title Move privileged CI config to Pulumi ESC Use AWS Secrets Manager for privileged CI config May 25, 2026
Comment thread .github/workflows/operations-alert-backfill.yml
@Kravalg Kravalg mentioned this pull request May 25, 2026
12 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Use AWS Secrets Manager and minimize GitHub Environment configuration

3 participants