Skip to content

alibaba/atrex-bench

Repository files navigation

Atrex-Bench

End-to-End Operator Generation Benchmark for Multiple DSLs

Production-trace driven · Executable PyTorch references · Source-aware metadata

status python


Atrex-Bench measures whether an agent platform + model stack can turn a PyTorch reference into a DSL kernel that compiles, runs correctly, and approaches the achievable hardware peak (speed-of-light, SOL).

Every operator ships as a small, self-contained directory: an executable PyTorch Model, an input generator, a shape spec, and hidden source-aware metadata. The evaluator runs a candidate through three stages — compile → correctness → performance — and reports SOL efficiency against a cached roofline.

Highlights

  • 30 operators, mostly derived from real production traces (vLLM / SGLang / AITER / rtp-llm).
  • 4 DSL backends out of the box: Triton, Gluon, FlyDSL, CuteDSL.
  • Multi-vendor GPUs: the same operators, references, and evaluator run on both AMD (ROCm) and NVIDIA (CUDA), with prebuilt images for each.
  • Three-stage evaluator: compile, numerical correctness, and performance vs. roofline SOL.
  • Generation harness that drives an LLM CLI (Claude Code / Codex) inside a scoped workspace.
  • Leak-resistant by design: a one-shot cleanup strips the checkout to the agent-visible surface, and provenance / roofline numbers are never staged for the agent.

Environment

Everything runs inside a container — a clean runtime with PyTorch, GPU-accelerated kernels, Node 22, and the Claude Code / Codex CLIs, and no project code. You git clone the repo inside the container (see Usage), so the code you run is always your own checkout — no host mounts.

We provide prebuilt images for both AMD and NVIDIA GPUs:

Platform Image
AMD (ROCm) treinfra/atrex-bench:rocm7.2_ubuntu22.04_py3.10_pytorch_release_2.10.0
NVIDIA (CUDA) treinfra/atrex-bench:2.10.0-cuda12.8-cudnn9-devel

1 · Pull or build the image for your GPU platform:

# AMD GPU (ROCm)
docker pull treinfra/atrex-bench:rocm7.2_ubuntu22.04_py3.10_pytorch_release_2.10.0
# or build from source:
docker build -t atrex-bench:rocm -f docker/Dockerfile.rocm .

# NVIDIA GPU (CUDA)
docker pull treinfra/atrex-bench:2.10.0-cuda12.8-cudnn9-devel
# or build from source:
docker build -t atrex-bench:cuda -f docker/Dockerfile.cuda .

2 · Start the container. Everything under Usage — including your API key — runs inside this shell:

# AMD GPU
docker run -it --rm \
  --device=/dev/kfd --device=/dev/dri --group-add video \
  treinfra/atrex-bench:rocm7.2_ubuntu22.04_py3.10_pytorch_release_2.10.0

# NVIDIA GPU
docker run -it --rm \
  --gpus all \
  treinfra/atrex-bench:2.10.0-cuda12.8-cudnn9-devel

Usage

Everything below runs inside the container you started above (Python ≥ 3.10 and PyTorch already ship in the image). The example uses the ROCm image; on the CUDA image the commands are the same, except you skip source /opt/venv/bin/activate (it ships PyTorch in the system Python).

1 · Configure the agent CLI

Point Claude Code at your API key. Append this to ~/.bashrc — fill in your own ANTHROPIC_API_KEY (and set ANTHROPIC_BASE_URL only if you go through a gateway instead of the official API):

case ":$PATH:" in
  *":$HOME/.local/bin:"*) ;;
  *) export PATH="$HOME/.local/bin:$PATH" ;;
esac
export CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1
export IS_SANDBOX=1
export ANTHROPIC_API_KEY="<YOUR_ANTHROPIC_API_KEY>"
# export ANTHROPIC_BASE_URL="<YOUR_API_BASE_URL>"   # optional: custom gateway
export ANTHROPIC_MODEL="<YOUR_MODEL>"
claude() {
  command claude --dangerously-skip-permissions --effort max "$@"
}

Reload it with source ~/.bashrc. (Prefer Codex? Export OPENAI_API_KEY instead and pass --cli codex below.)

2 · Clone and run

Generation and evaluation run as separate sessions (e.g. separate pods). The generation agent has full filesystem access, so its pod must never hold the answer files (metadata.json, roofline.json, the evaluator) — otherwise it can just read the targets. So you clone the repo and strip those files in place before generating; evaluation uses its own fresh, full clone.

Generate

git clone the repo, run the cleanup script, then generate:

git clone https://github.com/alibaba/atrex-bench.git && cd atrex-bench
source /opt/venv/bin/activate                    # activate the bundled venv (ROCm image)
python scripts/cleanup_for_generation.py        # strip the answer files in place
pip install -e . --no-deps --no-build-isolation  # deps already ship in the image

python scripts/run_generate.py \
  --operator attention_forward \
  --backend flydsl \
  --cli claude \
  --output-dir outputs/attention_forward_flydsl \
  --mirror-trace

cleanup_for_generation.py keeps only each operator's reference.py / input.py / shapes.json, the prompt/ templates, and the generation runner; it deletes metadata.json, roofline.json, configs/, the evaluator, tests/, the README, and the .git history, so provenance and SOL targets cannot leak. It rewrites the checkout in place — copy out your generated kernel before evaluating, and never run cleanup in an evaluation session.

  • Backends: triton, gluon, flydsl, cutedsl. CLI: --cli claude (default) or --cli codex.
  • Each run writes generated_kernel.py plus a generation.json bundle and a trace sidecar under the output directory.

Evaluate

Evaluation needs the full repo (metadata.json, roofline.json, configs/, and the evaluator), so run it in a separate session with a fresh clone you do not strip:

git clone https://github.com/alibaba/atrex-bench.git && cd atrex-bench
source /opt/venv/bin/activate                    # activate the bundled venv (ROCm image)
pip install -e . --no-deps --no-build-isolation  # deps already ship in the image

python scripts/run_eval.py \
  --input path/to/attention_forward_flydsl/generated_kernel.py \
  --reference-dir data/attention_forward \
  --output results/attention_forward_flydsl \
  --num-correctness-cases 5 \
  --bench-iters 3
  • --input — the candidate from the generate stage (a single file exposing class Model). Bring the generated_kernel.py you produced into this session and point --input at it.
  • --reference-dir — an operator directory under data/.
  • --num-correctness-cases / --bench-iters — correctness samples per shape (default 1) and timed perf iterations (default 100).
  • The output directory archives every input artifact plus eval_result.json.

Measure the torch.compile baseline instead of a candidate:

python scripts/run_eval.py --torch-compile --reference-dir data/attention_forward --output results/torch_compile

Repository Layout

atrex-bench/
├── configs/            # hardware SKU profiles (roofline peaks)
├── data/               # one directory per operator (+ data/README.md)
├── prompt/             # generation-stage prompt templates
├── scripts/            # CLI entrypoints: generate, cleanup, eval, roofline, trace
├── src/atrex_bench/    # package: generation runner, evaluator, roofline
└── tests/              # unit + end-to-end tests

Data Format

Each operator directory is self-describing:

data/<operator>/
├── reference.py    # class Model(nn.Module) — definition only
├── input.py        # _make_inputs(**kwargs) -> dict[str, Tensor]
├── shapes.json     # shape spec keyed by id (init_kwargs + input_kwargs)
├── metadata.json   # id, dtype, input/output dtypes, origin    (hidden from agent)
└── roofline.json   # cached W / Q / SOL_time_ms per device     (hidden from agent)

See data/README.md for the full schema and the data-maintenance workflow.

Operators

data/ ships 30 operator directories, each following the same five-file contract above. Most are trace-derived (status: "trace_reference"); the rest are curated. data/operator_importance.json holds trace-based prioritization scores.

Full operator list
OperatoriddtypeUpstreamStatus
attention_forwardatrex_001bf16vllm.attention_forward_varlentrace_reference
block_scaled_mmatrex_002fp8_e4m3vllm.w8a8_triton_block_scaled_mmcurated
causal_conv1datrex_003bf16sglang.causal_conv1d_fntrace_reference
chunk_delta_rule_outputatrex_004bf16sglang/vllm.chunk_fwd_otrace_reference
chunk_gated_delta_rule_stateatrex_005bf16sglang/vllm.chunk_gated_delta_rule_fwd_htrace_reference
fp8_blockscale_fused_moeatrex_006fp8_e4m3aiter.fmoe_fp8_blockscale_g1u1trace_reference
fp8_dynamic_per_token_quantatrex_007fp8_e4m3rtp-llm.dynamic_per_token_scaled_quanttrace_reference
fused_add_rms_normatrex_008bf16vllm.fused_add_rms_normtrace_reference
fused_moeatrex_009bf16vllm.fused_expertscurated
fused_qk_rmsnormatrex_010fp16rtp-llm.fusedQkRmsNormtrace_reference
fused_qkv_ropeatrex_011fp16rtp-llm.add_fusedQKV_bias_transpose_prefill_kerneltrace_reference
fused_rmsnorm_quantatrex_012fp8_e4m3aiter.rmsnorm2d_fwd_with_add_dynamicquanttrace_reference
gated_delta_rule_updateatrex_013bf16sglang/vllm.fused_sigmoid_gating_delta_rule_updatetrace_reference
gated_rms_normatrex_014bf16sglang.rms_norm_gatedtrace_reference
l2_normatrex_015bf16vllm.l2norm_fwdtrace_reference
layer_normatrex_016bf16vllm.layer_normtrace_reference
linear_sigmoid_mulatrex_017bf16sglang.sgl_kernel.fused_linear_sigmoid_multrace_reference
mla_decode_attentionatrex_018bf16aiter.mla_decode_stage1_asm_fwdtrace_reference
moe_align_block_sizeatrex_019int32vllm.moe_align_block_sizetrace_reference
moe_count_and_sortatrex_020int32vllm.moe_count_and_sort_expert_tokenstrace_reference
moe_sum_reduceatrex_021bf16sglang.moe_sum_reduce_tritontrace_reference
moe_topk_gating_softmaxatrex_022fp32vllm.moe_topk_gating_softmaxtrace_reference
mropeatrex_023bf16vllm/sglang.triton_mropetrace_reference
paged_attention_decodeatrex_024bf16rtp-llm.paged_attention_rocmtrace_reference
per_token_group_quant_fp8atrex_025fp8_e4m3vllm.per_token_group_quant_fp8curated
reshape_and_cacheatrex_026bf16vllm.reshape_and_cache_flashtrace_reference
rms_normatrex_027bf16vllm.rms_normtrace_reference
silu_and_mulatrex_028bf16vllm.vllm_silu_and_multrace_reference
topk_filteratrex_029fp32vllm / FlashInfer top-k maskingcurated
unified_attentionatrex_030bf16vllm.unified_attentioncurated

When adding or reshaping operator data, keep shapes.json and roofline.json aligned and refresh SOL_time_ms through scripts/roofline.py.

License

Licensed under the Apache License 2.0.

Copyright 2026 Alibaba Group.

About

End-to-end benchmark for AI-generated GPU kernels, drawn from real production traces — turn a PyTorch reference into a DSL kernel (Triton, Gluon, FlyDSL, CuteDSL) and grade it on compilation, numerical correctness, and speed-of-light efficiency, on both AMD and NVIDIA GPUs.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors