Atrex-Bench

End-to-End Operator Generation Benchmark for Multiple DSLs

Production-trace driven · Executable PyTorch references · Source-aware metadata

Atrex-Bench measures whether an agent platform + model stack can turn a PyTorch reference into a DSL kernel that compiles, runs correctly, and approaches the achievable hardware peak (speed-of-light, SOL).

Every operator ships as a small, self-contained directory: an executable PyTorch Model, an input generator, a shape spec, and hidden source-aware metadata. The evaluator runs a candidate through three stages — compile → correctness → performance — and reports SOL efficiency against a cached roofline.

Highlights

30 operators, mostly derived from real production traces (vLLM / SGLang / AITER / rtp-llm).
4 DSL backends out of the box: Triton, Gluon, FlyDSL, CuteDSL.
Multi-vendor GPUs: the same operators, references, and evaluator run on both AMD (ROCm) and NVIDIA (CUDA), with prebuilt images for each.
Three-stage evaluator: compile, numerical correctness, and performance vs. roofline SOL.
Generation harness that drives an LLM CLI (Claude Code / Codex) inside a scoped workspace.
Leak-resistant by design: a one-shot cleanup strips the checkout to the agent-visible surface, and provenance / roofline numbers are never staged for the agent.

Environment

Everything runs inside a container — a clean runtime with PyTorch, GPU-accelerated kernels, Node 22, and the Claude Code / Codex CLIs, and no project code. You git clone the repo inside the container (see Usage), so the code you run is always your own checkout — no host mounts.

We provide prebuilt images for both AMD and NVIDIA GPUs:

Platform	Image
AMD (ROCm)	`treinfra/atrex-bench:rocm7.2_ubuntu22.04_py3.10_pytorch_release_2.10.0`
NVIDIA (CUDA)	`treinfra/atrex-bench:2.10.0-cuda12.8-cudnn9-devel`

1 · Pull or build the image for your GPU platform:

# AMD GPU (ROCm)
docker pull treinfra/atrex-bench:rocm7.2_ubuntu22.04_py3.10_pytorch_release_2.10.0
# or build from source:
docker build -t atrex-bench:rocm -f docker/Dockerfile.rocm .

# NVIDIA GPU (CUDA)
docker pull treinfra/atrex-bench:2.10.0-cuda12.8-cudnn9-devel
# or build from source:
docker build -t atrex-bench:cuda -f docker/Dockerfile.cuda .

2 · Start the container. Everything under Usage — including your API key — runs inside this shell:

# AMD GPU
docker run -it --rm \
  --device=/dev/kfd --device=/dev/dri --group-add video \
  treinfra/atrex-bench:rocm7.2_ubuntu22.04_py3.10_pytorch_release_2.10.0

# NVIDIA GPU
docker run -it --rm \
  --gpus all \
  treinfra/atrex-bench:2.10.0-cuda12.8-cudnn9-devel

Usage

Everything below runs inside the container you started above (Python ≥ 3.10 and PyTorch already ship in the image). The example uses the ROCm image; on the CUDA image the commands are the same, except you skip source /opt/venv/bin/activate (it ships PyTorch in the system Python).

1 · Configure the agent CLI

Point Claude Code at your API key. Append this to ~/.bashrc — fill in your own ANTHROPIC_API_KEY (and set ANTHROPIC_BASE_URL only if you go through a gateway instead of the official API):

case ":$PATH:" in
  *":$HOME/.local/bin:"*) ;;
  *) export PATH="$HOME/.local/bin:$PATH" ;;
esac
export CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1
export IS_SANDBOX=1
export ANTHROPIC_API_KEY="<YOUR_ANTHROPIC_API_KEY>"
# export ANTHROPIC_BASE_URL="<YOUR_API_BASE_URL>"   # optional: custom gateway
export ANTHROPIC_MODEL="<YOUR_MODEL>"
claude() {
  command claude --dangerously-skip-permissions --effort max "$@"
}

Reload it with source ~/.bashrc. (Prefer Codex? Export OPENAI_API_KEY instead and pass --cli codex below.)

2 · Clone and run

Generation and evaluation run as separate sessions (e.g. separate pods). The generation agent has full filesystem access, so its pod must never hold the answer files (metadata.json, roofline.json, the evaluator) — otherwise it can just read the targets. So you clone the repo and strip those files in place before generating; evaluation uses its own fresh, full clone.

Generate

git clone the repo, run the cleanup script, then generate:

git clone https://github.com/alibaba/atrex-bench.git && cd atrex-bench
source /opt/venv/bin/activate                    # activate the bundled venv (ROCm image)
python scripts/cleanup_for_generation.py        # strip the answer files in place
pip install -e . --no-deps --no-build-isolation  # deps already ship in the image

python scripts/run_generate.py \
  --operator attention_forward \
  --backend flydsl \
  --cli claude \
  --output-dir outputs/attention_forward_flydsl \
  --mirror-trace

cleanup_for_generation.py keeps only each operator's reference.py / input.py / shapes.json, the prompt/ templates, and the generation runner; it deletes metadata.json, roofline.json, configs/, the evaluator, tests/, the README, and the .git history, so provenance and SOL targets cannot leak. It rewrites the checkout in place — copy out your generated kernel before evaluating, and never run cleanup in an evaluation session.

Backends: triton, gluon, flydsl, cutedsl. CLI: --cli claude (default) or --cli codex.
Each run writes generated_kernel.py plus a generation.json bundle and a trace sidecar under the output directory.

Evaluate

Evaluation needs the full repo (metadata.json, roofline.json, configs/, and the evaluator), so run it in a separate session with a fresh clone you do not strip:

git clone https://github.com/alibaba/atrex-bench.git && cd atrex-bench
source /opt/venv/bin/activate                    # activate the bundled venv (ROCm image)
pip install -e . --no-deps --no-build-isolation  # deps already ship in the image

python scripts/run_eval.py \
  --input path/to/attention_forward_flydsl/generated_kernel.py \
  --reference-dir data/attention_forward \
  --output results/attention_forward_flydsl \
  --num-correctness-cases 5 \
  --bench-iters 3

--input — the candidate from the generate stage (a single file exposing class Model). Bring the generated_kernel.py you produced into this session and point --input at it.
--reference-dir — an operator directory under data/.
--num-correctness-cases / --bench-iters — correctness samples per shape (default 1) and timed perf iterations (default 100).
The output directory archives every input artifact plus eval_result.json.

Measure the torch.compile baseline instead of a candidate:

python scripts/run_eval.py --torch-compile --reference-dir data/attention_forward --output results/torch_compile

Repository Layout

atrex-bench/
├── configs/            # hardware SKU profiles (roofline peaks)
├── data/               # one directory per operator (+ data/README.md)
├── prompt/             # generation-stage prompt templates
├── scripts/            # CLI entrypoints: generate, cleanup, eval, roofline, trace
├── src/atrex_bench/    # package: generation runner, evaluator, roofline
└── tests/              # unit + end-to-end tests

Data Format

Each operator directory is self-describing:

data/<operator>/
├── reference.py    # class Model(nn.Module) — definition only
├── input.py        # _make_inputs(**kwargs) -> dict[str, Tensor]
├── shapes.json     # shape spec keyed by id (init_kwargs + input_kwargs)
├── metadata.json   # id, dtype, input/output dtypes, origin    (hidden from agent)
└── roofline.json   # cached W / Q / SOL_time_ms per device     (hidden from agent)

See data/README.md for the full schema and the data-maintenance workflow.

Operators

data/ ships 30 operator directories, each following the same five-file contract above. Most are trace-derived (status: "trace_reference"); the rest are curated. data/operator_importance.json holds trace-based prioritization scores.

Full operator list

Operator	id	dtype	Upstream	Status
`attention_forward`	`atrex_001`	`bf16`	`vllm.attention_forward_varlen`	`trace_reference`
`block_scaled_mm`	`atrex_002`	`fp8_e4m3`	`vllm.w8a8_triton_block_scaled_mm`	`curated`
`causal_conv1d`	`atrex_003`	`bf16`	`sglang.causal_conv1d_fn`	`trace_reference`
`chunk_delta_rule_output`	`atrex_004`	`bf16`	`sglang/vllm.chunk_fwd_o`	`trace_reference`
`chunk_gated_delta_rule_state`	`atrex_005`	`bf16`	`sglang/vllm.chunk_gated_delta_rule_fwd_h`	`trace_reference`
`fp8_blockscale_fused_moe`	`atrex_006`	`fp8_e4m3`	`aiter.fmoe_fp8_blockscale_g1u1`	`trace_reference`
`fp8_dynamic_per_token_quant`	`atrex_007`	`fp8_e4m3`	`rtp-llm.dynamic_per_token_scaled_quant`	`trace_reference`
`fused_add_rms_norm`	`atrex_008`	`bf16`	`vllm.fused_add_rms_norm`	`trace_reference`
`fused_moe`	`atrex_009`	`bf16`	`vllm.fused_experts`	`curated`
`fused_qk_rmsnorm`	`atrex_010`	`fp16`	`rtp-llm.fusedQkRmsNorm`	`trace_reference`
`fused_qkv_rope`	`atrex_011`	`fp16`	`rtp-llm.add_fusedQKV_bias_transpose_prefill_kernel`	`trace_reference`
`fused_rmsnorm_quant`	`atrex_012`	`fp8_e4m3`	`aiter.rmsnorm2d_fwd_with_add_dynamicquant`	`trace_reference`
`gated_delta_rule_update`	`atrex_013`	`bf16`	`sglang/vllm.fused_sigmoid_gating_delta_rule_update`	`trace_reference`
`gated_rms_norm`	`atrex_014`	`bf16`	`sglang.rms_norm_gated`	`trace_reference`
`l2_norm`	`atrex_015`	`bf16`	`vllm.l2norm_fwd`	`trace_reference`
`layer_norm`	`atrex_016`	`bf16`	`vllm.layer_norm`	`trace_reference`
`linear_sigmoid_mul`	`atrex_017`	`bf16`	`sglang.sgl_kernel.fused_linear_sigmoid_mul`	`trace_reference`
`mla_decode_attention`	`atrex_018`	`bf16`	`aiter.mla_decode_stage1_asm_fwd`	`trace_reference`
`moe_align_block_size`	`atrex_019`	`int32`	`vllm.moe_align_block_size`	`trace_reference`
`moe_count_and_sort`	`atrex_020`	`int32`	`vllm.moe_count_and_sort_expert_tokens`	`trace_reference`
`moe_sum_reduce`	`atrex_021`	`bf16`	`sglang.moe_sum_reduce_triton`	`trace_reference`
`moe_topk_gating_softmax`	`atrex_022`	`fp32`	`vllm.moe_topk_gating_softmax`	`trace_reference`
`mrope`	`atrex_023`	`bf16`	`vllm/sglang.triton_mrope`	`trace_reference`
`paged_attention_decode`	`atrex_024`	`bf16`	`rtp-llm.paged_attention_rocm`	`trace_reference`
`per_token_group_quant_fp8`	`atrex_025`	`fp8_e4m3`	`vllm.per_token_group_quant_fp8`	`curated`
`reshape_and_cache`	`atrex_026`	`bf16`	`vllm.reshape_and_cache_flash`	`trace_reference`
`rms_norm`	`atrex_027`	`bf16`	`vllm.rms_norm`	`trace_reference`
`silu_and_mul`	`atrex_028`	`bf16`	`vllm.vllm_silu_and_mul`	`trace_reference`
`topk_filter`	`atrex_029`	`fp32`	`vllm` / FlashInfer top-k masking	`curated`
`unified_attention`	`atrex_030`	`bf16`	`vllm.unified_attention`	`curated`

When adding or reshaping operator data, keep shapes.json and roofline.json aligned and refresh SOL_time_ms through scripts/roofline.py.

License

Licensed under the Apache License 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github/workflows		.github/workflows
configs/hardware		configs/hardware
data		data
docker		docker
prompt		prompt
scripts		scripts
site		site
src/atrex_bench		src/atrex_bench
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Atrex-Bench

Highlights

Environment

Usage

1 · Configure the agent CLI

2 · Clone and run

Generate

Evaluate

Repository Layout

Data Format

Operators

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Atrex-Bench

Highlights

Environment

Usage

1 · Configure the agent CLI

2 · Clone and run

Generate

Evaluate

Repository Layout

Data Format

Operators

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages