Skip to content

Add mixed-batch inference benchmark to perf_check#2050

Open
misko wants to merge 2 commits into
mainfrom
mixed-perf-check
Open

Add mixed-batch inference benchmark to perf_check#2050
misko wants to merge 2 commits into
mainfrom
mixed-perf-check

Conversation

@misko

@misko misko commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds a MixedPerfCheckRunner that extends perf_check to benchmark inference across varying batch sizes (4, 8, 16, 32, 64, 128, 256) over a diverse pool of systems spanning all 5 UMA tasks (oc20, omat, omol, odac, omc) at multiple size buckets.

  • Ground truth, computed once, cached on disk. Per-system fp64 predictions are generated using the existing BASELINE_SETTINGS and run_inference, then stored in mixed_baseline_cache.json keyed by checkpoint, pool signature, device, seed, and batch sizes. Subsequent runs reuse the cache.
  • Pre-materialized schedule, no adjacent duplicates. Batches are constructed up front via a deterministic round-robin through batch_sizes. The schedule guarantees no two consecutive batches share the same (size, sorted_indices) multiset, so warmup and benchmark walk the same prepared sequence.
  • Per-system accuracy comparison. Each batched forward pass is split per-system (using natoms offsets) and compared to the cached fp64 baseline — accuracy is decoupled from batch composition.
  • Diverse pool via fake_dataset.generate_structures. Reuses the same generator the training benchmark uses, so size distributions match production specs for each UMA task.
  • Two-table report. Throughput per batch size (steps, total time, samples/s, atoms/s) plus per-system accuracy (energy abs error, force MAE, force max).

Usage

# Default run
fairchem -c configs/uma/benchmark/perf_check/mixed_benchmark.yaml

# Compare against an optimized execution backend
fairchem -c configs/uma/benchmark/perf_check/mixed_benchmark.yaml \
  runner.inference_settings.execution_mode=umas_fast_gpu \
  runner.inference_settings.tf32=True

# Smaller pool for smoke runs
fairchem -c configs/uma/benchmark/perf_check/mixed_benchmark.yaml \
  runner.device=cpu 'runner.pool_size_buckets=[20,80]' runner.pool_n_per_bucket=1

Files

  • src/fairchem/core/components/benchmark/perf_check.py — adds MixedPerfCheckRunner, build_batch_schedule, run_mixed_inference, MixedInferenceResult, BatchTiming, format_mixed_report_table, _mixed_baseline_cache_key. Reuses existing _baseline_cache_key/_load_baseline_cache/_save_baseline_cache/run_inference/compare_results from the same file.
  • src/fairchem/core/components/benchmark/systems.py — adds SystemPool dataclass and get_diverse_benchmark_pool.
  • configs/uma/benchmark/perf_check/mixed_benchmark.yaml — Hydra config.
  • tests/core/components/benchmark/test_mixed_perf_check.py — 13 tests, all CPU, no real model weights required.

Test plan

  • pytest tests/core/components/benchmark/test_mixed_perf_check.py — 13/13 pass in ~15s
    • SystemPool signature stability and uniqueness invariant
    • get_diverse_benchmark_pool covers all requested UMA tasks and size buckets
    • Schedule determinism, round-robin order, index range, no-adjacent-duplicate invariant, empty-input handling
    • Cache key invalidation on pool change and on batch_sizes change
    • End-to-end runner with mocked predict unit: report written, return value matches disk, all batch sizes populated
    • Baseline cache reuse on second run (zero additional baseline calls)
    • Baseline cache invalidation when cache key is tampered
  • pytest tests/core/components/benchmark/ — full benchmark suite (22 passed, 2 skipped GPU-only smoke tests)
  • pre-commit run --files ... on every modified file
  • End-to-end run with a real UMA checkpoint on GPU (intended by reviewer/user; mocked tests confirm the harness wiring)

Extends perf_check with MixedPerfCheckRunner that benchmarks varying
batch sizes (4, 8, 16, 32, 64, 128, 256) across a diverse pool of
systems spanning all 5 UMA tasks (oc20, omat, omol, odac, omc).

Ground truth is computed once per pool entry at fp64 using the existing
BASELINE_SETTINGS and cached on disk, mirroring the singleton runner.
Batches are pre-materialized via a deterministic schedule that
round-robins through batch sizes and forbids adjacent duplicate batches,
so warmup and benchmark phases draw from the same prepared sequence.
Per-system predictions are split out of each batched forward pass and
compared against the cached baseline.

The pool reuses benchmark.fake_dataset.generate_structures so size
distributions match the training-benchmark datasets.

Usage:
  fairchem -c configs/uma/benchmark/perf_check/mixed_benchmark.yaml
@meta-cla meta-cla Bot added the cla signed label Jun 24, 2026
@misko misko added minor Minor version release enhancement New feature or request labels Jun 24, 2026
@misko misko requested a review from mshuaibii June 24, 2026 21:07
@lbluque lbluque self-requested a review June 25, 2026 00:29

@lbluque lbluque left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed enhancement New feature or request minor Minor version release

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants