Skip to content

add: AWS ARM node tooling upgrade guide for Orka 3.6 (OK-5476)#260

Draft
celanthe wants to merge 8 commits into
mainfrom
update/aws-arm-tooling-upgrade-ok5476
Draft

add: AWS ARM node tooling upgrade guide for Orka 3.6 (OK-5476)#260
celanthe wants to merge 8 commits into
mainfrom
update/aws-arm-tooling-upgrade-ok5476

Conversation

@celanthe

Copy link
Copy Markdown
Collaborator

Summary

  • Adds upgrading-orka-on-aws.mdx: customer-facing upgrade guide for AWS deployments (3.5 to 3.6)
  • Documents the new Ansible-based in-place ARM node tooling upgrade (OK-5476, PR macstadium/monorepo-dev#23804), replacing the AMI replacement approach
  • Covers SSH/SSM prerequisites, required EC2 node tag (role=orka-arm), what's preserved during upgrade (node name, IP, cluster registration, license key, VM quota, storage layout, running VMs), and all AWS-specific changes in 3.6
  • Wires the new page into the Upgrade Guides nav in docs.json

Tracks MPD-67. Scoped to EC2 ARM nodes only — hybrid deployment section (EKS control plane + on-prem Mac nodes) is a separate follow-on pending engineering input (DI-623).

Test plan

  • Preview renders correctly in Mintlify
  • Nav link works from Upgrade Guides section
  • Engineering confirms SSH/tag requirements with Ivan before merge

🤖 Generated with Claude Code

Documents the new Ansible-based in-place upgrade path for ARM EC2 Mac
nodes, replacing the AMI replacement approach. Covers SSH/SSM prereqs,
required node tag (role=orka-arm), what's preserved during upgrade, and
what changes in 3.5 to 3.6 for AWS deployments.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@mintlify

mintlify Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Preview deployment for your docs. Learn more about Mintlify Previews.

Project Status Preview Updated (UTC)
macstadiuminc 🟢 Ready View Preview Jun 15, 2026, 7:47 PM

…ides to AWS upgrade guide

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment thread orka/orka-upgrades-and-release-notes/upgrading-orka-on-aws.mdx Outdated
@ispasov

ispasov commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

A lot of the examples in the original doc are missing.
For example all of the ones below are ready for copy paste from customers. So they don't have to figure things out themselves:

  1. The CodeBuild steps.
  2. The SSM permissions needed to access S3 buckets
  3. The SSM permissions needed to run SSM from CodeBuild

My suggestion would be to add these to the doc.

Comment thread orka/orka-upgrades-and-release-notes/upgrading-orka-on-aws.mdx Outdated
Comment thread orka/orka-upgrades-and-release-notes/upgrading-orka-on-aws.mdx Outdated
Comment thread orka/orka-upgrades-and-release-notes/upgrading-orka-on-aws.mdx Outdated
Comment thread orka/orka-upgrades-and-release-notes/upgrading-orka-on-aws.mdx Outdated
Comment thread orka/orka-upgrades-and-release-notes/upgrading-orka-on-aws.mdx Outdated
Comment thread orka/orka-upgrades-and-release-notes/upgrading-orka-on-aws.mdx Outdated
Comment thread orka/orka-upgrades-and-release-notes/upgrading-orka-on-aws.mdx Outdated
Comment thread orka/orka-upgrades-and-release-notes/upgrading-orka-on-aws.mdx Outdated
@ispasov

ispasov commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

One general theme I notice is that we say that:

  1. The upgrade is requested and scheduled
  2. MacStadium does it

It is a self service upgrade. No need to contact us for anything.
I mentioned this in several places, but not everywhere as I do not want to spam with the same comment.

@ispasov

ispasov commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

The doc description says Upgrade your Orka cluster on AWS from 3.5 to 3.6. Covers ARM node tooling updates

It talks more about the ARM nodes, but it doesn't mention the services upgrade.

Is the goal to be a general upgrade guide (both services and nodes) or nodes specific only?
If the former, then I would mention that there are two specific parts that need an upgrade - the Orka K8s services and the nodes.
The services are upgraded the same way they are installed (we can show the CodeBuild example from the installation guide). Customers only need to change the Ansible image.

Nodes are now upgraded with Ansible.
We should also mention that the Ansible upgrade for the nodes won't always be possible. There will be cases where a new ARM instance from a new AMI needs to be upgraded (for example when the host OS needs to be upgraded). But we will specifically mention this in the release nodes when this is needed

Comment thread orka/orka-upgrades-and-release-notes/upgrading-orka-on-aws.mdx Outdated
…review

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…/IAM examples

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…n sets

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@celanthe

Copy link
Copy Markdown
Collaborator Author

A lot of the examples in the original doc are missing. For example all of the ones below are ready for copy paste from customers. So they don't have to figure things out themselves:

  1. The CodeBuild steps.
  2. The SSM permissions needed to access S3 buckets
  3. The SSM permissions needed to run SSM from CodeBuild

My suggestion would be to add these to the doc.

Thanks for the review @ispasov . I updated the PR to address your feedback. Rewrote the framing as fully self-service throughout, added the CodeBuild buildspec, Secrets Manager IAM policy, and SSM Session Manager and S3 permissions as copy-paste blocks.

Removed the credential scoping section. Expanded scope to cover both services and node tooling upgrades, with a note that AMI replacement is still required for host OS upgrades. Multi-region section updated to cover both the multi-step and per-run override approaches. If anything else needs updating, do let me know and I will address it.

env:
shell: bash
secrets-manager:
SSH_PRIVATE_KEY: "<your-secret-name>"

@ispasov ispasov Jun 17, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets show an example how to set the default region here as well.
You have it in the command above, but ideally we show people they have options


Previously, updating tooling on ARM nodes required replacing the EC2 Mac AMI: the instance had to be deleted, a new one provisioned (a process that takes approximately 2 hours), and the node's name, namespace, and custom tags had to be manually reapplied.

Starting with the 3.5 to 3.6 upgrade path, ARM node tooling is updated in place using Ansible over SSH. The upgrade takes under 10 minutes per node. The following are read from the running node and reapplied automatically: node name, node IP, cluster registration, license key, VM quota, and storage layout (including data volumes on instances with local NVMe). Running VMs are not interrupted.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It takes < 10min combined, not per node.
Although this depends on the number of nodes.
If they have 100, it may take 15.


### Upgrade Service is installed

As part of the 3.6 upgrade, the Orka Upgrade Service is deployed to your cluster. This enables smoother tooling updates in future Orka releases without requiring AMI replacement.

@ispasov ispasov Jun 17, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without requiring AMI replacement. There will be caess where a replacement will be needed (you mention them above).
Maybe we should clarify this so people do not get confused.
We can say "without requireing to run CodeBuild"


### cert-manager behavior change

Orka no longer installs its own cert-manager if one is already present in the cluster. If your cluster runs its own cert-manager and you previously experienced version or configuration conflicts with Orka's bundled installation, those conflicts are resolved in 3.6.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A smal nuance needs to be added - it no longer installs it if the customer configures it not to install it. It does not check for the presence of another cert manager installation

## After the upgrade

1. [Download and install](/orka/orka-overview/tools-integrations) the Orka 3.6 CLI if you haven't already.
2. Regenerate Service Account tokens for any automated workflows: `orka3 serviceaccount token <name>`

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 and 3 are not relevant.
SA regeneration is MSDC specific. Only when the K8s cluster is recreated.
Images are not removed from the cache during an upgrade

3. Repopulate the image cache on your ARM nodes if needed: `orka3 imagecache add <image> --all`

<Warning>
Service Account tokens must be regenerated after this upgrade. Any automated workflows using service account tokens will fail until tokens are regenerated with `orka3 serviceaccount token <name>`.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mentioned this above - this is not relevant


## Upgrading the Orka services

The Orka Kubernetes services are upgraded the same way they were installed: run the CodeBuild project pointed at the Orka 3.6 Ansible image. No additional configuration is required.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets add a link to the original doc. In case people do not remember what they have done.


If your nodes cannot accept SSH, the upgrade can run over SSM instead. SSM upgrades require an S3 bucket in the same region as your ARM nodes for Ansible file transfer, and can take significantly longer (up to 4 hours). SSH is strongly recommended.

### Enabling SSH on nodes launched without a key pair

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can link the article from the confluence doc that points to AWS doc explaining how to rotate the SSH key.
I imagine people would want to rotate it.

ansible-playbook -i arm.ssm.aws_ec2.yml configure-arm.yml
```

## Changing node values during the upgrade

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets mention that this can be run not only during an upgrade.
Customers may want to rename nodes - they run the CodeBuild project configured above and pass one of the Ansible vars here.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets mention that this can be run not only during an upgrade. Customers may want to rename nodes - they run the CodeBuild project configured above and pass one of the Ansible vars here.

Thanks for the second pass, @ispasov. Updated:

  • Timing fixed: "Under 10 minutes," total (Ansible parallel), not per node
  • Section heading and Upgrade Service copy updated to "no longer require provisioning a new EC2 Mac instance," made it clearer what's changed without implying AMI replacement is gone entirely
  • cert-manager: rewritten as explicit opt-out, not presence-detection
  • "After the upgrade" trimmed to just the CLI install, Service Account token regeneration and image cache repopulation removed as these are MSDC specific
  • Added AWS_DEFAULT_REGION example variable in the buildspec
  • Added key rotation link and installation guide link
  • "Changing node values" section: Renamed and clarified it can be run independently, not only during an upgrade

Let me know if any of these landed wrong, and thank you again! :)

- Add AWS blog link for SSH key pair rotation in Enabling SSH section
- Add link to installation guide in Upgrading the Orka services section
- Add AWS_DEFAULT_REGION example variable in buildspec env.variables
- Rename "Changing node values during the upgrade" to "Changing node values"; clarify playbook can run independently to rename nodes
- Fix timing: Ansible runs nodes in parallel, typical deployment under 10 minutes total (not per node)
- Update section heading and Upgrade Service copy: "no longer require provisioning a new EC2 Mac instance" instead of "AMI replacement" (AMI replacement still required for host OS upgrades)
- Fix cert-manager language: opt-out via explicit config, not auto-detected presence
- Remove SA token regeneration and image cache items from After the upgrade (MSDC-specific, not applicable to AWS self-service); remove Warning block

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants