Knowledge cutoff: 2026-05-25. See
data/refresh-cutoff.yaml.
A structured knowledge base of AMD GPU kernel optimization for MI300X (CDNA3, gfx942) and MI350 / MI355 (CDNA4, gfx950), packaged as a Claude Code skill. The repository root is the skill directory.
Architecture and conventions mirror KernelWiki (Blackwell/Hopper). Initial content distilled from Apex/tools — skills/, jsons/{hip_sheet,triton_sheet,rocm}.json, and mcps/.
git clone <repo-url> ~/.claude/skills/RocmKernelWiki
pip install -r ~/.claude/skills/RocmKernelWiki/requirements.txtSmoke test:
cd ~/.claude/skills/RocmKernelWiki
python3 scripts/query.py --tag mfma --type hardware --compact
python3 scripts/get_page.py hw-mfma-cdna3 --frontmatter-onlyOptional override:
export ROCM_WIKI_ROOT=/path/to/RocmKernelWiki- Hardware pages — MFMA (CDNA3 + CDNA4), SMFMAC + 4:2 sparsity, AccVGPR / dual register files, LDS + bank conflicts, GWS, buffer resource ops, async-copy direct-to-LDS, wave64, FP8/BF8/MX-FP4
- Technique pages — coalesced/vectorized access, grid-stride loops, software pipelining, LDS double-buffering, swizzling, register budgeting, occupancy tuning, persistent + stream-K kernels, kernel/epilogue fusion, autotune knobs (
num_stages,num_warps,matrix_instr_nonkdim,kpack,waves_per_eu) - Pattern pages — memory-bound / compute-bound / register-pressure / low-CU-utilization / lds-bank-conflict / lane-conflict diagnostics with candidate techniques
- Kernel case studies — GEMM (HIP MFMA, Triton-AMD, CK), FlashAttention on MI300/MI350, fused MoE, paged attention, RMSNorm
- Language guides — HIP C++, Triton ROCm backend, Composable Kernel, Gluon-AMD, FlyDSL
- Migration guides — CUDA → HIP, Triton-NVIDIA → Triton-AMD, WGMMA → MFMA, TMA → buffer-load + async-copy, MI300 → MI350 (gfx942 → gfx950)
- Source ledgers under
candidates/—triton-amd,vllm-rocm,sglang-rocm,aiter,magpie,flydsl - Auto-generated query indices under
queries/
| Tool | Purpose |
|---|---|
scripts/query.py |
Unified search across all pages (keywords + filters + alias-aware) |
scripts/get_page.py |
Fetch any page by id or path; --follow-sources expands cited sources |
scripts/grep_wiki.py |
Regex text search across wiki bodies and source pages |
scripts/validate.py |
Schema/tag/source-id validator |
scripts/generate-indices.py |
Rebuilds queries/*.md from frontmatter |
SKILL.md— Skill entry point: when to engage, query paths, output contract.CLAUDE.md— Extended schema + navigation reference.references/primer.md— Topic map.references/schema.md— Frontmatter schema.references/examples.md— Worked query patterns.index.md— Curated top-level index.
Three layers (the Karpathy LLM Wiki pattern, same as KernelWiki):
sources/— raw data: PRs, vendor docs, blogs, contests.wiki/— synthesized pages cross-referenced byid. All have YAML frontmatter.queries/— auto-generated indices. Do not edit by hand.
Supporting files:
data/schemas.yaml— required/optional fields per page typedata/tags.yaml— controlled vocabularydata/aliases.yaml— canonical → synonym mappings (e.g.gfx950←MI350,mfma←matrix-fma)data/refresh-cutoff.yaml/data/tool-versions.yaml— versioning
- CDNA3/CDNA4-first. gfx942 + gfx950 are primary. gfx90a (CDNA2) only with explicit
cdna_relevance. RDNA out of scope. - Kernel-only. No RCCL / distributed-training / scheduler topics.
- First-class DSLs. HIP, Triton-AMD, Composable Kernel, Gluon-AMD, FlyDSL.
- English canonical.
Summaries and wiki syntheses are derivative works citing upstream PRs, docs, and the Apex/KernelWiki projects. Tooling (scripts/, data/, references/) is MIT-style.