AKA is an end-to-end Agent project for GPU kernel implementation, analysis, profiling, and iterative optimization. It helps an Agent turn PyTorch logic or an existing kernel into a high-performance GPU kernel through a structured, profile-driven workflow.
- Creates an isolated optimization workspace under
/tmp/kernel_opt_<name>/. - Looks up target hardware specs from the local
gpu-wikiknowledge base. - Runs Roofline analysis and sets auditable performance targets.
- Implements a correct baseline kernel before entering optimization.
- Runs the profile-driven optimization loop: profile with
ncuorrocprofv3, extract bottleneck evidence, querygpu-wiki/ reference projects / web sources for relevant optimization knowledge, write an evidence-based plan, apply one optimization category, validate correctness and performance, record memory, commit, then repeat until Stop Conditions are met. - Records plans, profile artifacts, structured memory, reports, and Git commits for every accepted iteration.
For the full architecture and workflow design, see docs/design.md.
Installation requires:
bashgitjq- Codex or Claude Code installed
Running optimization tasks also requires platform-specific profiling tools:
- NVIDIA:
ncu - AMD:
rocprofv3, wrapped bytools/profile_kernel.sh
./install.shCommon options:
./install.sh --hooks-only # Install or update hooks only
./install.sh --without-github # Skip GitHub reference repositories listed by gpu-wiki
./install.sh --max-iterations N # Configure hook stop behavior after memory/vN.json exceeds N
./install.sh --uninstall # Remove hooks installed by this scriptThe installer detects:
- Codex:
$CODEX_HOMEor~/.codex - Claude Code:
$CLAUDE_HOMEor~/.claude
It also prepares the default local knowledge base at /tmp/gpu-wiki/ and optional reference projects at /tmp/reference-projects/.
After installation, restart Codex / Claude Code or open a new session so the hooks and Skills are loaded.
Ask the Agent to optimize a kernel with at least:
platform: target hardware platform, such asH20orMI308X.framework: target implementation framework, such asCuteDSLorFlyDSL.kernel_demo: path to the initial PyTorch logic or kernel implementation file.
Example:
/gpu-kernel-optimizer Optimize /path/to/kernel_demo.py on MI308X with FlyDSL, dtype bf16, rel_err < 0.01.
The Agent will initialize a workspace, source hardware specs from gpu-wiki, write the workspace configuration, build a baseline, profile the kernel, and iterate until the configured Stop Conditions are met.
.
├── SKILL.md # Top-level gpu-kernel-optimizer Skill router
├── install.sh # Installer / uninstaller
├── docs/ # Detailed project design docs
├── reference/ # Workspace, plan, memory, and profiling templates
├── skills/ # Baseline, optimizer, restart, and output-contract Skills
├── tools/ # Profiling, utilization, memory, and measurement tools
└── gpu-wiki/ # Local GPU knowledge base
Licensed under the Apache License 2.0.

