End-to-End Operator Generation Benchmark for Multiple DSLs
Production-trace driven · Executable PyTorch references · Source-aware metadata
Atrex-Bench measures whether an agent platform + model stack can turn a PyTorch reference into a DSL kernel that compiles, runs correctly, and approaches the achievable hardware peak (speed-of-light, SOL).
Every operator ships as a small, self-contained directory: an executable PyTorch Model, an
input generator, a shape spec, and hidden source-aware metadata. The evaluator runs a candidate
through three stages — compile → correctness → performance — and reports SOL efficiency against
a cached roofline.
- 30 operators, mostly derived from real production traces (vLLM / SGLang / AITER / rtp-llm).
- 4 DSL backends out of the box: Triton, Gluon, FlyDSL, CuteDSL.
- Multi-vendor GPUs: the same operators, references, and evaluator run on both AMD (ROCm) and NVIDIA (CUDA), with prebuilt images for each.
- Three-stage evaluator: compile, numerical correctness, and performance vs. roofline SOL.
- Generation harness that drives an LLM CLI (Claude Code / Codex) inside a scoped workspace.
- Leak-resistant by design: a one-shot cleanup strips the checkout to the agent-visible surface, and provenance / roofline numbers are never staged for the agent.
Everything runs inside a container — a clean runtime with PyTorch, GPU-accelerated kernels, Node 22, and the Claude Code / Codex CLIs, and no project code.
You git clone the repo inside the container (see Usage), so the code you run is
always your own checkout — no host mounts.
We provide prebuilt images for both AMD and NVIDIA GPUs:
| Platform | Image |
|---|---|
| AMD (ROCm) | treinfra/atrex-bench:rocm7.2_ubuntu22.04_py3.10_pytorch_release_2.10.0 |
| NVIDIA (CUDA) | treinfra/atrex-bench:2.10.0-cuda12.8-cudnn9-devel |
1 · Pull or build the image for your GPU platform:
# AMD GPU (ROCm)
docker pull treinfra/atrex-bench:rocm7.2_ubuntu22.04_py3.10_pytorch_release_2.10.0
# or build from source:
docker build -t atrex-bench:rocm -f docker/Dockerfile.rocm .
# NVIDIA GPU (CUDA)
docker pull treinfra/atrex-bench:2.10.0-cuda12.8-cudnn9-devel
# or build from source:
docker build -t atrex-bench:cuda -f docker/Dockerfile.cuda .2 · Start the container. Everything under Usage — including your API key — runs inside this shell:
# AMD GPU
docker run -it --rm \
--device=/dev/kfd --device=/dev/dri --group-add video \
treinfra/atrex-bench:rocm7.2_ubuntu22.04_py3.10_pytorch_release_2.10.0
# NVIDIA GPU
docker run -it --rm \
--gpus all \
treinfra/atrex-bench:2.10.0-cuda12.8-cudnn9-develEverything below runs inside the container you started above (Python ≥ 3.10 and PyTorch
already ship in the image). The example uses the ROCm image; on the CUDA image the commands are
the same, except you skip source /opt/venv/bin/activate (it ships PyTorch in the system Python).
Point Claude Code at your API key. Append this to ~/.bashrc — fill in your own
ANTHROPIC_API_KEY (and set ANTHROPIC_BASE_URL only if you go through a gateway instead of
the official API):
case ":$PATH:" in
*":$HOME/.local/bin:"*) ;;
*) export PATH="$HOME/.local/bin:$PATH" ;;
esac
export CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1
export IS_SANDBOX=1
export ANTHROPIC_API_KEY="<YOUR_ANTHROPIC_API_KEY>"
# export ANTHROPIC_BASE_URL="<YOUR_API_BASE_URL>" # optional: custom gateway
export ANTHROPIC_MODEL="<YOUR_MODEL>"
claude() {
command claude --dangerously-skip-permissions --effort max "$@"
}Reload it with source ~/.bashrc. (Prefer Codex? Export OPENAI_API_KEY instead and pass
--cli codex below.)
Generation and evaluation run as separate sessions (e.g. separate pods). The generation
agent has full filesystem access, so its pod must never hold the answer files (metadata.json,
roofline.json, the evaluator) — otherwise it can just read the targets. So you clone the repo
and strip those files in place before generating; evaluation uses its own fresh, full clone.
git clone the repo, run the cleanup script, then generate:
git clone https://github.com/alibaba/atrex-bench.git && cd atrex-bench
source /opt/venv/bin/activate # activate the bundled venv (ROCm image)
python scripts/cleanup_for_generation.py # strip the answer files in place
pip install -e . --no-deps --no-build-isolation # deps already ship in the image
python scripts/run_generate.py \
--operator attention_forward \
--backend flydsl \
--cli claude \
--output-dir outputs/attention_forward_flydsl \
--mirror-tracecleanup_for_generation.py keeps only each operator's reference.py / input.py /
shapes.json, the prompt/ templates, and the generation runner; it deletes
metadata.json, roofline.json, configs/, the evaluator, tests/, the README, and the
.git history, so provenance and SOL targets cannot leak. It rewrites the checkout in
place — copy out your generated kernel before evaluating, and never run cleanup in an
evaluation session.
- Backends:
triton,gluon,flydsl,cutedsl. CLI:--cli claude(default) or--cli codex. - Each run writes
generated_kernel.pyplus ageneration.jsonbundle and a trace sidecar under the output directory.
Evaluation needs the full repo (metadata.json, roofline.json, configs/, and the
evaluator), so run it in a separate session with a fresh clone you do not strip:
git clone https://github.com/alibaba/atrex-bench.git && cd atrex-bench
source /opt/venv/bin/activate # activate the bundled venv (ROCm image)
pip install -e . --no-deps --no-build-isolation # deps already ship in the image
python scripts/run_eval.py \
--input path/to/attention_forward_flydsl/generated_kernel.py \
--reference-dir data/attention_forward \
--output results/attention_forward_flydsl \
--num-correctness-cases 5 \
--bench-iters 3--input— the candidate from the generate stage (a single file exposingclass Model). Bring thegenerated_kernel.pyyou produced into this session and point--inputat it.--reference-dir— an operator directory underdata/.--num-correctness-cases/--bench-iters— correctness samples per shape (default 1) and timed perf iterations (default 100).- The output directory archives every input artifact plus
eval_result.json.
Measure the torch.compile baseline instead of a candidate:
python scripts/run_eval.py --torch-compile --reference-dir data/attention_forward --output results/torch_compileatrex-bench/
├── configs/ # hardware SKU profiles (roofline peaks)
├── data/ # one directory per operator (+ data/README.md)
├── prompt/ # generation-stage prompt templates
├── scripts/ # CLI entrypoints: generate, cleanup, eval, roofline, trace
├── src/atrex_bench/ # package: generation runner, evaluator, roofline
└── tests/ # unit + end-to-end tests
Each operator directory is self-describing:
data/<operator>/
├── reference.py # class Model(nn.Module) — definition only
├── input.py # _make_inputs(**kwargs) -> dict[str, Tensor]
├── shapes.json # shape spec keyed by id (init_kwargs + input_kwargs)
├── metadata.json # id, dtype, input/output dtypes, origin (hidden from agent)
└── roofline.json # cached W / Q / SOL_time_ms per device (hidden from agent)
See data/README.md for the full schema and the data-maintenance workflow.
data/ ships 30 operator directories, each following the same five-file contract above.
Most are trace-derived (status: "trace_reference"); the rest are curated.
data/operator_importance.json holds trace-based prioritization scores.
Full operator list
| Operator | id | dtype | Upstream | Status |
|---|---|---|---|---|
attention_forward | atrex_001 | bf16 | vllm.attention_forward_varlen | trace_reference |
block_scaled_mm | atrex_002 | fp8_e4m3 | vllm.w8a8_triton_block_scaled_mm | curated |
causal_conv1d | atrex_003 | bf16 | sglang.causal_conv1d_fn | trace_reference |
chunk_delta_rule_output | atrex_004 | bf16 | sglang/vllm.chunk_fwd_o | trace_reference |
chunk_gated_delta_rule_state | atrex_005 | bf16 | sglang/vllm.chunk_gated_delta_rule_fwd_h | trace_reference |
fp8_blockscale_fused_moe | atrex_006 | fp8_e4m3 | aiter.fmoe_fp8_blockscale_g1u1 | trace_reference |
fp8_dynamic_per_token_quant | atrex_007 | fp8_e4m3 | rtp-llm.dynamic_per_token_scaled_quant | trace_reference |
fused_add_rms_norm | atrex_008 | bf16 | vllm.fused_add_rms_norm | trace_reference |
fused_moe | atrex_009 | bf16 | vllm.fused_experts | curated |
fused_qk_rmsnorm | atrex_010 | fp16 | rtp-llm.fusedQkRmsNorm | trace_reference |
fused_qkv_rope | atrex_011 | fp16 | rtp-llm.add_fusedQKV_bias_transpose_prefill_kernel | trace_reference |
fused_rmsnorm_quant | atrex_012 | fp8_e4m3 | aiter.rmsnorm2d_fwd_with_add_dynamicquant | trace_reference |
gated_delta_rule_update | atrex_013 | bf16 | sglang/vllm.fused_sigmoid_gating_delta_rule_update | trace_reference |
gated_rms_norm | atrex_014 | bf16 | sglang.rms_norm_gated | trace_reference |
l2_norm | atrex_015 | bf16 | vllm.l2norm_fwd | trace_reference |
layer_norm | atrex_016 | bf16 | vllm.layer_norm | trace_reference |
linear_sigmoid_mul | atrex_017 | bf16 | sglang.sgl_kernel.fused_linear_sigmoid_mul | trace_reference |
mla_decode_attention | atrex_018 | bf16 | aiter.mla_decode_stage1_asm_fwd | trace_reference |
moe_align_block_size | atrex_019 | int32 | vllm.moe_align_block_size | trace_reference |
moe_count_and_sort | atrex_020 | int32 | vllm.moe_count_and_sort_expert_tokens | trace_reference |
moe_sum_reduce | atrex_021 | bf16 | sglang.moe_sum_reduce_triton | trace_reference |
moe_topk_gating_softmax | atrex_022 | fp32 | vllm.moe_topk_gating_softmax | trace_reference |
mrope | atrex_023 | bf16 | vllm/sglang.triton_mrope | trace_reference |
paged_attention_decode | atrex_024 | bf16 | rtp-llm.paged_attention_rocm | trace_reference |
per_token_group_quant_fp8 | atrex_025 | fp8_e4m3 | vllm.per_token_group_quant_fp8 | curated |
reshape_and_cache | atrex_026 | bf16 | vllm.reshape_and_cache_flash | trace_reference |
rms_norm | atrex_027 | bf16 | vllm.rms_norm | trace_reference |
silu_and_mul | atrex_028 | bf16 | vllm.vllm_silu_and_mul | trace_reference |
topk_filter | atrex_029 | fp32 | vllm / FlashInfer top-k masking | curated |
unified_attention | atrex_030 | bf16 | vllm.unified_attention | curated |
When adding or reshaping operator data, keep shapes.json and roofline.json aligned and
refresh SOL_time_ms through scripts/roofline.py.
Licensed under the Apache License 2.0.
Copyright 2026 Alibaba Group.