Skip to content

chiphwang1/oke-agent-plugin

Repository files navigation

OKE Agent Plugin

Give your coding agent practical OCI Kubernetes Engine (OKE) operating knowledge.

This repo packages reusable agent skills for Claude Code and Codex. The skills help you generate OKE Terraform, troubleshoot cluster incidents, configure Generic VNIC Attachment (GVA), and validate Multus multi-home workloads without starting from a blank prompt every time.

An agent skill is a packaged workflow: instructions, references, and helper scripts that tell the agent how to handle a specific OKE task. Instead of explaining your preferred OKE troubleshooting or Terraform-generation process every session, you can ask the agent to use one of these skills.

Why Try It

OKE work often crosses several layers at once: Kubernetes objects, OKE add-ons, OCI networking, IAM, node pools, load balancers, and storage. General-purpose agents can miss those connections unless you give them a lot of context.

This plugin gives the agent that context up front:

If you need to... Use this skill
Create an OKE Terraform stack and OCI Resource Manager schema oke-cluster-generator
Diagnose OKE pods, services, networking, add-ons, DNS, storage, image pulls, or IAM oke-troubleshooter
Build an OKE node pool with GVA secondary VNIC profiles oke-gva-deployer
Deploy and test Multus pods on a GVA-enabled node pool oke-multihome-deployer

The troubleshooting and multihome skills also include checks for DPDK/SR-IOV paths: Multus attachments, SR-IOV device-plugin resources, Mellanox mlx5, vfio-pci, RDMA/verbs devices, hugepages, and DPDK application configuration.

Quick Start

Clone the repo:

git clone https://github.com/chiphwang1/oke-agent-plugin.git
cd oke-agent-plugin

Pick the path that matches your agent:

I use... Start here
Claude Code Run claude --plugin-dir . from this repo
Codex and want to try the plugin from this repo Run codex "What OKE skills are available in this plugin?"
Codex and want the skills available in other workspaces Follow Codex Local Skill Install

Claude Code

claude --plugin-dir .

Then try:

/oke-agent-plugin:oke-troubleshooter "pods stuck Pending in prod namespace"

Codex

From inside the cloned repository, run:

codex "What OKE skills are available in this plugin?"

That command is a quick sanity check: Codex should read the local plugin context and list the OKE skills in this repo.

Then try a natural-language prompt:

Use OKE Agent Plugin to troubleshoot pods stuck Pending in my OKE cluster.

Live cluster workflows need your own OCI CLI session and kubeconfig. Offline smoke tests do not need OCI or Kubernetes access.

Good First Prompts

Start with the outcome you want:

Generate an OKE Terraform stack fast path for us-ashburn-1 with private workers.
Troubleshoot why my OKE LoadBalancer service has no public IP.
Create a GVA-enabled node pool for my current OKE cluster.
Deploy Multus multi-home test pods on my GVA node pool and validate net1.
Check an OKE workload that uses Multus, SR-IOV, hugepages, and DPDK.

The agent will ask follow-up questions when it needs cluster, region, compartment, subnet, node-pool, or workload details.

What You Get

Generate OKE Terraform

/oke-agent-plugin:oke-cluster-generator

The agent walks through a structured OKE design conversation, summarizes the architecture, then generates Terraform and an OCI Resource Manager schema.

Use it when you want:

  • main.tf, variables.tf, outputs.tf, provider.tf, and terraform.tfvars.example
  • schema.yaml for OCI Resource Manager
  • Guided choices for cluster type, networking, node pools, storage, IAM, encryption, observability, GPU, RDMA/RoCE, and validation rules
  • A fast-path mode for a private, production-friendly starter stack with minimal questions

Example prompts:

/oke-agent-plugin:oke-cluster-generator
/oke-agent-plugin:oke-cluster-generator fast-path us-ashburn-1 demo-private-oke
/oke-agent-plugin:oke-cluster-generator ai/ml us-ashburn-1 prod-cluster
/oke-agent-plugin:oke-cluster-generator hpc us-frankfurt-1

Example: successful transcript and sample Terraform output.

Troubleshoot OKE Incidents

/oke-agent-plugin:oke-troubleshooter

The agent turns a symptom into an evidence plan, collects Kubernetes and OCI signals, ranks likely causes, and gives remediation steps.

It covers:

  • OKE add-ons: CoreDNS, OCI CNI, CSI, metrics, daemonsets, and deployments
  • Pod networking: OCI CNI/IPAM, Multus, NADs, pod sandboxes, and secondary interfaces
  • Load balancers, private endpoints, DNS, OCIR pulls, Workload Identity, and ingress
  • Autoscaler and node-pool scale-up failures
  • Storage, PVCs, CSI logs, and OCI limits
  • OCI object correlation for Pod-to-Node-to-Instance, Service/Ingress-to-LB, PVC-to-volume, VNIC/subnet, and node-pool graph evidence
  • Incident timelines across Kubernetes events, rollouts, object descriptions, and OCI alarms

Example prompts:

/oke-agent-plugin:oke-troubleshooter "pods stuck Pending in prod namespace"
/oke-agent-plugin:oke-troubleshooter "service payments-lb has no IP us-phoenix-1"
/oke-agent-plugin:oke-troubleshooter "OCIR ImagePullBackOff unauthorized"
/oke-agent-plugin:oke-troubleshooter "private OKE API endpoint unreachable"

Example: successful transcript and sample troubleshooting report.

The troubleshooter runs scripts/oke-object-correlator.sh before domain-specific collectors when a namespace and at least one selector are known. The correlator builds a read-only graph that links Kubernetes resources to OCI resources, so hypothesis ranking can use explicit paths like:

pod/default/web-0 -> node/node-a -> instance/ocid1.instance...
service/default/web -> loadbalancer/ocid1.loadbalancer...
pvc/default/data -> pv/pvc-123 -> volume/ocid1.volume...

You can run the correlator directly when you want the graph without a full troubleshooting session:

./scripts/oke-object-correlator.sh \
  --namespace default \
  --cluster-id <cluster_ocid> \
  --compartment-id <compartment_ocid> \
  --region <region> \
  --pod web-0 \
  --service web

The output includes graph.kubernetes, graph.oci, graph.edges, findings, anomalies, and fallback_used.

Configure GVA Node Pools

/oke-agent-plugin:oke-gva-deployer

The agent helps create OKE node pools configured with Generic VNIC Attachment. It discovers cluster context, walks through subnet and shape choices, generates an oci ce node-pool create command, and gives a validation deployment.

Use it when you need:

  • Secondary VNIC profiles on an OKE node pool
  • Application Resource labels for scheduling
  • IPv4-only subnet validation for secondary VNIC subnets
  • A repeatable command instead of a hand-built CLI invocation

Example prompt:

/oke-agent-plugin:oke-gva-deployer

Example: successful transcript and sample GVA command output.

Validate Multus Multi-Home Pods

/oke-agent-plugin:oke-multihome-deployer

The agent generates and validates Multus NetworkAttachmentDefinition resources and pinned test pods for a GVA-enabled node pool.

It checks:

  • eth0 and net1 pod interfaces
  • Multus network-status
  • Pod-to-pod connectivity over net1
  • OCI CNI/IPAM state
  • Common failures such as missing ipvlan, CRI-O short-name rejection, pod sandbox failures, and DPDK/SR-IOV Mellanox mlx5 validation gaps

Example prompt:

/oke-agent-plugin:oke-multihome-deployer

Example: successful transcript and sample Multus manifest.

Requirements

For live OKE workflows:

  • OCI CLI installed and authenticated
  • kubectl configured for the target OKE cluster
  • Read access to the OCI resources you want the agent to inspect

For security-token auth, use one of these explicitly:

export OCI_CLI_AUTH=security_token
# or pass --auth security_token to OCI CLI commands

For offline development and smoke tests:

  • bash
  • python3

Codex Local Skill Install

If you want Codex to use the skills outside this repo, copy the skills and helper assets into ~/.codex:

mkdir -p ~/.codex/skills ~/.codex/scripts ~/.codex/shared ~/.codex/agents

cp -R skills/oke-cluster-generator ~/.codex/skills/
cp -R skills/oke-troubleshooter ~/.codex/skills/
cp -R skills/oke-gva-deployer ~/.codex/skills/
cp -R skills/oke-multihome-deployer ~/.codex/skills/

cp scripts/*.sh ~/.codex/scripts/
chmod +x ~/.codex/scripts/*.sh

cp -R shared/. ~/.codex/shared/
cp -R agents/. ~/.codex/agents/

Then start Codex in the workspace you want to operate on:

codex login
codex -C /path/to/your/workspace

If you update this repo later, reinstall the changed folders into ~/.codex.

Safety Model

The skills are designed to be useful without hiding what they are doing:

  • Live OCI and Kubernetes commands require your local credentials and kubeconfig.
  • Offline tests use mocks and static inputs.
  • Public examples use placeholders such as <cluster_ocid>, <region>, and <node-name>.
  • Generated OKE examples use fully qualified container images, for example docker.io/nicolaka/netshoot:v0.13.
  • The troubleshooting workflow separates facts from hypotheses so you can see which evidence supports each recommendation.

Repository Layout

oke-agent-plugin/
|-- .codex-plugin/plugin.json        # Codex plugin manifest
|-- .claude-plugin/plugin.json       # Claude Code plugin manifest
|-- AGENTS.md                        # Codex repository instructions
|-- agents/                          # Optional Claude Code subagents
|-- examples/                        # Successful transcripts and sample outputs
|-- scripts/                         # Shared shell helpers
|-- shared/                          # Shared OKE reference material
|-- skills/
|   |-- oke-cluster-generator/       # Terraform and ORM generation
|   |-- oke-troubleshooter/          # OKE incident diagnosis
|   |-- oke-gva-deployer/            # GVA node-pool workflow
|   `-- oke-multihome-deployer/      # Multus multi-home validation
`-- tests/scripts-smoke.sh           # Offline smoke tests

Development

Run the offline validation suite after edits:

bash tests/scripts-smoke.sh
git diff --check

The smoke tests intentionally avoid live OCI and Kubernetes access so they can run in CI or a cloud coding environment.

Script errors follow a consistent contract:

Exit code Meaning
0 Success
1 Expected error, such as missing OCI auth or a CIDR overlap
2 Unexpected error, such as a missing CLI or invalid argument

Errors are emitted as structured JSON to stderr:

{
  "error_code": "OCI_CLI_NOT_AUTHENTICATED",
  "message": "The OCI CLI is installed but not authenticated.",
  "remediation": "Run: oci setup config",
  "docs_url": "https://docs.oracle.com/en-us/iaas/Content/API/SDKDocs/cliconfigure.htm"
}

Manual Verification Ideas

Use these scenarios when testing against a real cluster:

  • Broken image: run the troubleshooter against an ImagePullBackOff.
  • Load balancer pending: diagnose a LoadBalancer service with no external IP.
  • PVC pending: validate storage, CSI, volume attachment, and service-limit evidence.
  • DNS outage: scale down or break CoreDNS, then check DNS timeout diagnosis.
  • Autoscaler no-scale: create a Pending workload and inspect scale-up refusal signals.
  • Multus validation: run the multihome skill on a GVA-enabled node pool and confirm pods expose eth0 and net1.
  • Workload Identity: diagnose a pod receiving NotAuthorized from OCI APIs.

References

About

Native Claude Code plugin for Oracle Kubernetes Engine (OKE) — Terraform generator, troubleshooter, and LA enablement

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors