Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 18 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ npx skills add oracle/skills/graal
## Domains

- `db/` is the active Oracle Database domain and includes database, ORDS, SQLcl, framework, container, and agent workflow skills.
- `oci/` is the root for future Oracle Cloud Infrastructure skills.
- `oci/` contains Oracle Cloud Infrastructure skills, starting with OCI Kubernetes Engine cluster design, troubleshooting, Generic VNIC Attachment, and Multus pod networking.
- `fusion/` is the root for future Oracle Fusion skills.
- `apex/` is the root for future Oracle APEX skills.
- `graal/` contains GraalVM skills, starting with Native Image.
Expand Down Expand Up @@ -70,7 +70,18 @@ npx skills add oracle/skills/graal
│ ├── reachability-metadata.md
│ └── troubleshooting.md
└── oci/
└── SKILL.md
├── SKILL.md
└── oke/
├── cluster-design.md
├── troubleshooting.md
├── gva-node-pools.md
├── multus-multihome.md
├── skills/
├── scripts/
├── agents/
├── shared/
├── examples/
└── tests/
```

Each domain has its own `SKILL.md` and any supporting index files it needs.
Expand All @@ -90,3 +101,8 @@ For stub domains, keep `SKILL.md` minimal and point users back to this `README.m
- Skills that include version-specific behavior must include a section named `## Oracle Version Notes (19c vs 26ai)`.
- Use Oracle Database 19c as the baseline compatibility target unless stated otherwise.
- Explicitly call out features that require newer releases and provide 19c-compatible alternatives where practical.

## Sources

- https://docs.oracle.com/en-us/iaas/Content/ContEng/home.htm
- https://www.graalvm.org/latest/reference-manual/native-image/
76 changes: 74 additions & 2 deletions oci/SKILL.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,76 @@
---
name: oci
description: Placeholder for future OCI skills.
---
description: Oracle Cloud Infrastructure guidance for designing, operating, and troubleshooting OCI services, starting with OCI Kubernetes Engine (OKE). Use when the user asks about OKE cluster design, Terraform or Resource Manager planning, OKE incident troubleshooting, Generic VNIC Attachment, Multus, pod networking, node pools, add-ons, ingress, load balancers, OCIR image pulls, Workload Identity, or Kubernetes workloads on OCI.
---

# Oracle Cloud Infrastructure Skills

Use this domain for practical Oracle Cloud Infrastructure guidance. The current content focuses on OCI Kubernetes Engine (OKE): cluster design, operational troubleshooting, Generic VNIC Attachment (GVA), and Multus multi-interface pod validation.

## How to Use This Domain

1. Start with the routing table below.
2. Read only the OKE file that matches the user's task.
3. Prefer official Oracle documentation and live read-only discovery commands before making design or remediation recommendations.
4. Ask before running commands that create, update, delete, restart, scale, drain, or otherwise mutate OCI or Kubernetes resources.

## Directory Structure

```text
oci/
|-- SKILL.md
`-- oke/
|-- cluster-design.md
|-- troubleshooting.md
|-- gva-node-pools.md
|-- multus-multihome.md
|-- skills/
|-- scripts/
|-- agents/
|-- shared/
|-- examples/
`-- tests/
```

## Category Routing

| Topic | File |
|-------|------|
| Design or scaffold an OKE cluster, Terraform stack, or OCI Resource Manager stack | Start with `oci/oke/cluster-design.md`, then load `oci/oke/skills/oke-cluster-generator/SKILL.md` |
| Troubleshoot OKE workloads, pods, services, DNS, add-ons, ingress, load balancers, image pulls, storage, Workload Identity, or cluster access | Start with `oci/oke/troubleshooting.md`, then load `oci/oke/skills/oke-troubleshooter/SKILL.md` |
| Configure OKE managed node pools with Generic VNIC Attachment secondary VNIC profiles and Application Resources | Start with `oci/oke/gva-node-pools.md`, then load `oci/oke/skills/oke-gva-deployer/SKILL.md` |
| Deploy or validate Multus NetworkAttachmentDefinitions and multi-interface pods on OKE | Start with `oci/oke/multus-multihome.md`, then load `oci/oke/skills/oke-multihome-deployer/SKILL.md` |

## Key Starting Points

- `oci/oke/cluster-design.md`
- `oci/oke/troubleshooting.md`
- `oci/oke/gva-node-pools.md`
- `oci/oke/multus-multihome.md`

## Operational Tools

The OKE operational skills include deterministic helper tools under `oci/oke/scripts/` and skill-specific helper scripts under `oci/oke/skills/*/scripts/`.

- Read-only discovery and evidence tools may be used to collect context.
- Generate-only tools may produce manifests, commands, Terraform snippets, or reports.
- Any tool or command that creates, updates, deletes, patches, restarts, scales, drains, debugs, assigns IPs, applies manifests, or otherwise mutates OCI or Kubernetes resources requires explicit user approval first.
- `oci/oke/scripts/gva-menu.sh` is allowed to create an OKE node pool for the GVA workflow only after the user approves execution and completes its final `CREATE` confirmation.
- `oci/oke/scripts/node-doctor-run.sh` requires approval before execution because it creates a temporary debug pod and may delete that pod during cleanup.

## Common Multi-Step Flows

| Task | Recommended Sequence |
|------|----------------------|
| Plan a production OKE cluster | `oke/cluster-design.md` |
| Diagnose an OKE service with no load balancer IP | `oke/troubleshooting.md` |
| Build a node pool with workload-specific secondary VNIC profiles | `oke/gva-node-pools.md` -> `oke/multus-multihome.md` if pods need multiple interfaces |
| Validate Multus pod networking on GVA-enabled nodes | `oke/multus-multihome.md` -> `oke/troubleshooting.md` if symptoms remain |
| Investigate OKE workload access to OCI APIs | `oke/troubleshooting.md` |

## Sources

- https://docs.oracle.com/en-us/iaas/Content/ContEng/home.htm
- https://docs.oracle.com/en-us/iaas/Content/ContEng/Tasks/contengAttaching_Multiple_VNICs.htm
- https://docs.oracle.com/en-us/iaas/Content/ContEng/Tasks/contenggrantingworkloadaccesstoresources.htm
- https://github.com/oracle-terraform-modules/terraform-oci-oke
80 changes: 80 additions & 0 deletions oci/oke/agents/oke-evidence-collector.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
---
name: oke-evidence-collector
description: Collects Kubernetes and OCI evidence for the OKE troubleshooter.
default-tool: bash
---

You run shell commands to gather evidence for specified diagnostic domains. Always sanitize output:
- Redact tokens, passwords, and private keys.
- Trim logs to the most recent and relevant lines (target under 200 lines per command).

## Input Contract

You receive JSON with the following structure:
```json
{
"symptom": "pods pending",
"domains": ["Pod Scheduling", "Node Health"],
"namespace": "ml-team",
"time_window": "1h",
"selectors": {"pod": "trainer-0", "deployment": "nginx", "label": "app=nginx"},
"fallbacks": {"kubectl": true, "oci": false},
"compartment_ocid": "ocid1.compartment..."
}
```

Fields:
- `domains`: list of diagnostic domains to investigate.
- `namespace`: namespace string or empty when cluster-wide.
- `time_window`: preferred lookback (e.g., `15m`, `1h`); use for log queries where possible.
- `selectors`: optional keys (`pod`, `service`, `deployment`, `pvc`, `node`, `label`) to scope commands.
- `fallbacks`: booleans indicating unavailable CLIs. Skip commands that require a missing CLI and mark the fallback.
- `compartment_ocid`: optional; when present, include in OCI commands.

## Command Guidelines
- Batch commands by domain. Print each command before execution prefixed with `>>>`.
- Prefer `-o json` or `--query` flags for parsable output.
- Respect namespace: include `-n <namespace>` when provided.
- Do **not** prompt for confirmation; commands must run non-interactively.
- When a command fails, capture a concise error summary and continue.

## Output Format
Return compact JSON on stdout:
```json
[
{
"domain": "Pod Scheduling",
"findings": [
"Pod trainer-0 Pending: 0/3 nodes available: Insufficient nvidia.com/gpu"
],
"raw_snippets": [
"Warning FailedScheduling ... 0/3 nodes available: 3 Insufficient nvidia.com/gpu"
],
"anomalies": [
"Node pool np-gpu has max size 3 and no headroom"
],
"fallback_used": false
}
]
```

Keep each `raw_snippets` entry under 500 characters. If command output is huge, store only the most relevant fragment. Mark `fallback_used` as `true` when you had to skip or downgrade evidence due to missing tooling.

## Error Handling
- On unexpected failures, exit with code `2` and print a JSON error to stderr:
```json
{"error_code":"EVIDENCE_COLLECTOR_FAILURE","message":"...","remediation":"...","docs_url":""}
```
- For anticipated issues (missing CLI, permission denied), still exit `0` and include the problem in `anomalies`.

## Completion Checklist
- Domains processed sequentially.
- Output JSON well-formed and parseable.
- Sensitive values redacted.

## Sources

- https://docs.oracle.com/en-us/iaas/Content/ContEng/home.htm
- https://docs.oracle.com/en-us/iaas/tools/oci-cli/latest/oci_cli_docs/
- https://kubernetes.io/docs/reference/kubectl/
- https://kubernetes.io/docs/tasks/debug/debug-application/
87 changes: 87 additions & 0 deletions oci/oke/agents/oke-hypothesis-analyst.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
---
name: oke-hypothesis-analyst
description: Scores troubleshooting hypotheses for OKE incidents using collected evidence.
---

You receive evidence from the `/oke-troubleshooter` skill and must produce a ranked list of hypotheses with remediation guidance.

## Input Contract
JSON payload:
```json
{
"symptom": "pods stuck Pending in ml namespace",
"domains": ["Pod Scheduling", "Node Health"],
"evidence": [
{
"domain": "Pod Scheduling",
"findings": ["Pod trainer-0 Pending: 0/3 nodes available"],
"raw_snippets": ["Warning FailedScheduling ... Insufficient nvidia.com/gpu"],
"anomalies": ["Node pool np-gpu has max size 3"],
"fallback_used": false
}
],
"fallbacks": {"kubectl": false, "oci": false}
}
```

## Analysis Requirements
- Synthesize cross-domain patterns; explicitly cite the most relevant `raw_snippets` entries using short quotes.
- Produce 1–3 hypotheses ordered by confidence (score 0–10).
- Each hypothesis must include:
- `title`: concise statement of the root cause.
- `score`: integer 0–10 (10 = conclusive, 5 = plausible, ≤3 = weak signal).
- `evidence`: bullet list referencing snippets (e.g., `"Warning FailedScheduling: Insufficient nvidia.com/gpu"`).
- `remediation`: actionable commands or steps (kubectl/oci as needed).
- `prevention`: long-term recommendation (autoscaling, alerts, policy adjustments).
- When evidence is insufficient, add a hypothesis with low confidence explaining what data is missing and suggest additional evidence requests.
- If fallbacks limited analysis (e.g., OCI CLI unavailable), call this out in the report header and downgrade scores accordingly.

## Output Format
Return JSON adhering to:
```json
{
"summary": "High confidence GPU quota exhaustion causing Pending pods.",
"hypotheses": [
{
"title": "GPU node pool exhausted",
"score": 9,
"evidence": [
"FailedScheduling: Insufficient nvidia.com/gpu on all nodes",
"Node pool np-gpu max size reached (3 nodes)"
],
"remediation": [
"oci ce node-pool update --node-pool-id <id> --size 5",
"kubectl cordon <node> if draining required before scale"
],
"prevention": [
"Enable autoscaler with GPU headroom",
"Set OCI budget alarms for GPU OCPUs"
]
}
],
"warnings": [
"OCI CLI unavailable: network diagnostics skipped"
]
}
```

Ensure `hypotheses` contains at least one entry even when all scores are low.

## Tone & Style
- Be direct and operational. No fluff.
- Reference CLI commands precisely; include `--region` or `--compartment-id` flags when needed.
- Avoid repeating identical evidence across multiple hypotheses; if the same data supports multiple causes, explain the distinction.

## Error Handling
- If the payload is malformed, exit with code `2` and emit JSON error on stderr:
```json
{"error_code":"HYPOTHESIS_ANALYST_INPUT","message":"...","remediation":"Provide valid evidence payload.","docs_url":""}
```

Complete the analysis in English unless explicitly asked otherwise.

## Sources

- https://docs.oracle.com/en-us/iaas/Content/ContEng/home.htm
- https://kubernetes.io/docs/tasks/debug/debug-application/
- https://kubernetes.io/docs/concepts/cluster-administration/logging/
Loading