Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
38f655c
Add all-to-all graph parallel communication primitives
May 2, 2026
530b685
Integrate A2A graph parallel into eSCN-MD backbone
May 2, 2026
f8c8388
Add A2A GP mode to MD consistency test
May 2, 2026
c45cb75
Harden A2A GP: multi-system guard, defensive assertions, expand test …
May 5, 2026
360f38c
Refactor build_gp_context and add allgather index exchange variant
Jun 3, 2026
3a06f49
Remove allgather index exchange variant, keep 2xA2A
Jun 3, 2026
45e218e
Remove all_to_all_collect_compiled (dead code)
Jun 5, 2026
13754ca
Replace _validate_gp_mappings with simple assertions
Jun 5, 2026
37f7393
Use torch._assert_async for GPU-only validation (no host sync)
Jun 5, 2026
1d6fefc
Add graph parallel verification doc from rgao_a2a_comms
Jun 5, 2026
b129bfa
Remove send_info precomputation (dead code with radius_pbc v2)
Jun 5, 2026
7558df4
Make all GPContext fields required (no optional defaults)
Jun 5, 2026
c20b024
Remove torchrun entrypoint; use fairchem CLI for multi-GPU tests
Jun 5, 2026
256a144
Add GPU/NCCL tests for A2A graph parallel primitives
Jun 5, 2026
7112ab8
Skip GPU/NCCL tests when fewer than 2 GPUs available
Jun 5, 2026
50ab32d
Use CI env var to skip multi-GPU tests instead of device_count check
Jun 5, 2026
af68f80
Add full-model GP correctness test (no-GP vs allgather vs A2A)
Jun 5, 2026
6168d23
Add GP correctness runner for fairchem CLI
Jun 5, 2026
1a87db0
Revert "Add GP correctness runner for fairchem CLI"
Jun 5, 2026
cc71ad7
Simplify build_gp_context: use remote_mask, rename needed_from_ranks_…
Jun 5, 2026
0467a4e
Remove redundant remote_mask filter in build_gp_context
Jun 5, 2026
88c776e
Update GPContext docstring with missing field descriptions
Jun 5, 2026
6dc9708
Slim GPContext to runtime-only fields
Jun 5, 2026
fb7ed9c
Increase Gloo process group init timeout from 10s to 120s
Jun 5, 2026
3e8059b
Remove AABB halo optimization from A2A graph parallel
Jun 7, 2026
83025be
Revert timeout_hr addition in slurm benchmark config
Jun 8, 2026
7b1e3b4
Merge main into rgao_a2a_model after rgao_a2a_comms landed
Jun 9, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
314 changes: 314 additions & 0 deletions src/fairchem/core/common/parallelism/graph_parallel_verfication.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,314 @@
# A2A Graph Parallel — Test & Benchmark Runbook

All commands are run-ready. Copy-paste from the fairchem repo root:

```bash
cd /home/rgao/fairchem
source .venv/bin/activate
```

---

## 1. Correctness Tests

### 1a. GPContext unit tests (CPU, <1s)

```bash
pytest tests/core/common/parallelism/test_graph_parallel.py::TestBuildGPContext -v
```

5 tests: `test_basic_context_building`, `test_global_to_local_mapping`,
`test_no_cross_partition_edges`, `test_edge_split_indices`, `test_edge_split_no_remote_edges`.

### 1b. Distributed A2A primitives (CPU/Gloo, ~10s)

```bash
pytest tests/core/common/parallelism/test_graph_parallel.py -v -k "not TestBuildGPContext"
```

8 tests: forward equivalence (2 graph topologies), backward gradient, multi-rank (2 and 3),
spatial partition, compiled-vs-autograd (index_split + spatial).

### 1c. A2A embedding correctness (CPU/Gloo, ~30s)

```bash
pytest tests/core/common/parallelism/test_a2a_correctness.py -v
```

8 tests: correctness at 100/500 atoms × 2 strategies, consistency across graph sizes × 2,
1536-dim embeddings × 2.

### 1d. Full-model GPU correctness (4+ GPUs, ~5 min)

```bash
pytest tests/core/units/mlip_unit/test_predict.py::test_full_model_gp_correctness -v
```

27 tests: 3 atom counts (10, 50, 100) × 9 configs (no-GP, allgather, A2A-spatial,
A2A-index_split at 1/2/4 workers). Compares energy/forces/stress against single-GPU
reference (tol: energy/stress 5e-4, forces 1e-4). Skipped on CI (`CI=true`).

### 1e. Predict pipeline + MD consistency (CPU, PR3 branch, ~2 min)

```bash
pytest tests/core/units/mlip_unit/test_predict.py::test_parallel_predict_unit_cpu -v -k "spatial or index_split"
pytest tests/core/units/mlip_unit/test_predict.py::test_parallel_predict_unit_gpu -v -k "spatial or index_split"
pytest tests/core/units/mlip_unit/test_predict.py::test_parallel_predict_unit_batch -v -k "spatial"
pytest tests/core/units/mlip_unit/test_predict.py::test_merge_mole_md_consistency -v -k "spatial or index_split"
```

### 1f. All correctness at once

```bash
pytest tests/core/common/parallelism/ -v
```

---

## 2. Regression Tests

### 2a. Existing tests must pass unchanged

```bash
pytest tests/core/common/test_gp_utils.py -v
pytest tests/core/models/uma/test_escn_md.py -v
pytest tests/core/models/uma/test_compile.py -v
pytest tests/core/components/test_uma_speed_benchmark.py -v
pytest tests/core/units/mlip_unit/test_predict.py::test_parallel_predict_unit_cpu -v -k "not spatial and not index_split"
```

### 2b. All regression at once

```bash
pytest tests/core/common/test_gp_utils.py tests/core/models/uma/test_escn_md.py tests/core/models/uma/test_compile.py tests/core/components/test_uma_speed_benchmark.py -v
```

---

## 3. Performance Benchmarks

All benchmarks use `InferenceBenchRunner` with FCC crystal systems (deterministic),
`timeiters=100`, `repeats=5`, turbo mode (compile=True, tf32=True).
Results written to `benchmark_results.json` in the run directory under `/checkpoint/ocp/rgao/speed/`.

### 3a. 8-GPU single-node (local, ~10 min each)

```bash
# BL baseline
fairchem -c configs/uma/speed/uma-speed.yaml \
job=local_8gpu \
runner.natoms_list=[8000,32000] \
runner.timeiters=100 \
runner.repeats=5

# A2A + spatial
fairchem -c configs/uma/speed/uma-speed.yaml \
job=local_8gpu \
runner.natoms_list=[8000,32000] \
runner.timeiters=100 \
runner.repeats=5 \
'+runner.overrides={backbone: {use_all_to_all_gp: true, gp_partition_strategy: spatial}}'
```

Expected: A2A within ±5% of BL (NVLink hides comm cost).

### 3b. 16-GPU 2-node (SLURM, ~15 min each)

```bash
# BL baseline
fairchem -c configs/uma/speed/uma-speed.yaml \
job=slurm \
job.scheduler.num_nodes=2 \
job.scheduler.ranks_per_node=8 \
job.graph_parallel_group_size=16 \
job.scheduler.slurm.qos=h200_dev \
job.scheduler.slurm.account=ocp \
runner.natoms_list=[16000,64000] \
runner.timeiters=100 \
runner.repeats=5

# A2A + spatial
fairchem -c configs/uma/speed/uma-speed.yaml \
job=slurm \
job.scheduler.num_nodes=2 \
job.scheduler.ranks_per_node=8 \
job.graph_parallel_group_size=16 \
job.scheduler.slurm.qos=h200_dev \
job.scheduler.slurm.account=ocp \
runner.natoms_list=[16000,64000] \
runner.timeiters=100 \
runner.repeats=5 \
'+runner.overrides={backbone: {use_all_to_all_gp: true, gp_partition_strategy: spatial}}'
```

Expected: A2A +2-3%.

### 3c. 32-GPU 4-node (SLURM, ~15 min each)

```bash
# BL baseline
fairchem -c configs/uma/speed/uma-speed.yaml \
job=slurm \
job.scheduler.num_nodes=4 \
job.scheduler.ranks_per_node=8 \
job.graph_parallel_group_size=32 \
job.scheduler.slurm.qos=h200_alignment_shared \
job.scheduler.slurm.account=ocp \
runner.natoms_list=[32000,128000] \
runner.timeiters=100 \
runner.repeats=5

# A2A + spatial
fairchem -c configs/uma/speed/uma-speed.yaml \
job=slurm \
job.scheduler.num_nodes=4 \
job.scheduler.ranks_per_node=8 \
job.graph_parallel_group_size=32 \
job.scheduler.slurm.qos=h200_alignment_shared \
job.scheduler.slurm.account=ocp \
runner.natoms_list=[32000,128000] \
runner.timeiters=100 \
runner.repeats=5 \
'+runner.overrides={backbone: {use_all_to_all_gp: true, gp_partition_strategy: spatial}}'
```

Expected: A2A +5-10%.

### 3d. 64-GPU 8-node — CANONICAL BENCHMARK (SLURM, ~30 min each)

```bash
# BL baseline
fairchem -c configs/uma/speed/uma-speed.yaml \
job=slurm \
job.scheduler.num_nodes=8 \
job.scheduler.ranks_per_node=8 \
job.graph_parallel_group_size=64 \
job.scheduler.slurm.qos=h200_alignment_shared \
job.scheduler.slurm.account=ocp \
runner.natoms_list=[64000,256000] \
runner.timeiters=100 \
runner.repeats=5

# A2A + spatial
fairchem -c configs/uma/speed/uma-speed.yaml \
job=slurm \
job.scheduler.num_nodes=8 \
job.scheduler.ranks_per_node=8 \
job.graph_parallel_group_size=64 \
job.scheduler.slurm.qos=h200_alignment_shared \
job.scheduler.slurm.account=ocp \
runner.natoms_list=[64000,256000] \
runner.timeiters=100 \
runner.repeats=5 \
'+runner.overrides={backbone: {use_all_to_all_gp: true, gp_partition_strategy: spatial}}'
```

Expected: A2A +10-20%.

### 3e. 128-GPU 16-node (SLURM, ~30 min each)

```bash
# BL baseline
fairchem -c configs/uma/speed/uma-speed.yaml \
job=slurm \
job.scheduler.num_nodes=16 \
job.scheduler.ranks_per_node=8 \
job.graph_parallel_group_size=128 \
job.scheduler.slurm.qos=h200_alignment_shared \
job.scheduler.slurm.account=ocp \
runner.natoms_list=[128000,512000] \
runner.timeiters=100 \
runner.repeats=5

# A2A + spatial
fairchem -c configs/uma/speed/uma-speed.yaml \
job=slurm \
job.scheduler.num_nodes=16 \
job.scheduler.ranks_per_node=8 \
job.graph_parallel_group_size=128 \
job.scheduler.slurm.qos=h200_alignment_shared \
job.scheduler.slurm.account=ocp \
runner.natoms_list=[128000,512000] \
runner.timeiters=100 \
runner.repeats=5 \
'+runner.overrides={backbone: {use_all_to_all_gp: true, gp_partition_strategy: spatial}}'
```

Expected: A2A +30-40%.

### 3f. 256-GPU 32-node (SLURM, ~45 min each)

```bash
# BL baseline
fairchem -c configs/uma/speed/uma-speed.yaml \
job=slurm \
job.scheduler.num_nodes=32 \
job.scheduler.ranks_per_node=8 \
job.graph_parallel_group_size=256 \
job.scheduler.slurm.qos=h200_alignment_shared \
job.scheduler.slurm.account=ocp \
runner.natoms_list=[256000,1024000] \
runner.timeiters=100 \
runner.repeats=5

# A2A + spatial
fairchem -c configs/uma/speed/uma-speed.yaml \
job=slurm \
job.scheduler.num_nodes=32 \
job.scheduler.ranks_per_node=8 \
job.graph_parallel_group_size=256 \
job.scheduler.slurm.qos=h200_alignment_shared \
job.scheduler.slurm.account=ocp \
runner.natoms_list=[256000,1024000] \
runner.timeiters=100 \
runner.repeats=5 \
'+runner.overrides={backbone: {use_all_to_all_gp: true, gp_partition_strategy: spatial}}'
```

Expected: A2A +60-80%.

---

## 4. Reading Results

```bash
# Find latest runs
ls -lt /checkpoint/ocp/rgao/speed/ | head -20

# Read a specific result
python3 -c "
import json, sys
with open(sys.argv[1]) as f:
d = json.load(f)
ws = d['config']['world_size']
for model, data in d['model_to_qps_data'].items():
print(f'Model: {model}, world_size={ws}')
for natoms, ns_day in data:
atoms_per_gpu = natoms // ws
print(f' {natoms:>7} atoms ({atoms_per_gpu}/gpu): {ns_day:.3f} ns/day')
" /checkpoint/ocp/rgao/speed/<RUN_DIR>/benchmark_results.json
```

---

## 5. Reference Results (H200 turbo)

### 4k atoms/rank (canonical):

| GPUs | BL ns/day | A2A ns/day | Speedup | BL WS Eff | A2A WS Eff |
|------|-----------|------------|---------|-----------|------------|
| 8 | 0.651 | 0.670 | +2.9% | 100% | 100% |
| 16 | 0.610 | 0.626 | +2.6% | 93.7% | 93.6% |
| 32 | 0.568 | 0.623 | +9.7% | 87.3% | 93.1% |
| 64 | 0.477 | 0.570 | +19.5% | 73.2% | 85.1% |
| 128 | 0.359 | 0.498 | +38.7% | 55.1% | 74.3% |
| 256 | 0.246 | 0.433 | +76.0% | 37.8% | 64.6% |

### 1k atoms/rank:

| GPUs | BL ns/day | A2A ns/day | Speedup |
|------|-----------|------------|---------|
| 8 | 2.088 | 2.010 | -3.7% |
| 16 | 2.043 | 1.990 | -2.6% |
| 32 | 1.769 | 1.848 | +4.5% |
| 64 | 1.488 | 1.626 | +9.3% |
Loading
Loading