Tests - Add LTP scripts to run module-level numerical tests by yzygitzh · Pull Request #79 · microsoft/ltp-megatron-lm

yzygitzh · 2025-07-26T08:27:59Z

Add LTP scripts to run module-level numerical tests. Including

Scripts to run and collect stats on different platforms, including NVIDIA H200 and AMD MI300X.
Script to compare stats between different platforms.

Copilot

Pull Request Overview

This PR adds LTP (Long-Term Performance) scripts to run module-level numerical tests across different hardware platforms, specifically targeting NVIDIA H200 and AMD MI300X GPUs. The scripts automate the collection of numerical test statistics and enable comparison between platforms to ensure computational consistency.

Scripts to execute and collect numerical test statistics on NVIDIA H200 and AMD MI300X platforms
Automated comparison functionality to analyze numerical differences between platforms
Support for running tests on multiple modules including attention, embedding, MLP, and others

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
run_numerical_tests_nvidia_h200_1n8g.sh	Script to run numerical tests on NVIDIA H200 platform with platform-specific environment setup
run_numerical_tests_amd_mi300x_1n8g.sh	Script to run numerical tests on AMD MI300X platform with ROCm and RCCL configurations
run_numerical_tests_platform_similarity.sh	Comparison script to analyze numerical similarity between NVIDIA H200 and AMD MI300X results

Comments suppressed due to low confidence (1)

tests/test_utils/ltp_scripts/run_numerical_tests_nvidia_h200_1n8g.sh:4

The version v1.1.4 for the grouped_gemm package may not exist. Please verify that this specific version tag exists in the repository before using it in the installation command.

pip install git+https://github.com/fanshiqing/grouped_gemm@v1.1.4

cp5555 · 2025-07-28T06:10:07Z

+  mkdir -p ${result_dir}/${1}/module_mean_and_std
+  for name in ${file_names}
+  do
+    for x in {0..19}


Should we add configuration for running times, e.g.19

cp5555 · 2025-07-28T06:11:39Z

+
+run_numerical_tests() {
+  # Get raw module test results
+  for x in {0..19}


for all 19, should we to use parameter to replace.

abuccts · 2025-07-29T00:13:10Z

+  # Calculate module mean and std
+  file_names=$(find ${result_dir}/${1}/module_test -type f -printf "%f\n" | sort | uniq)
+  mkdir -p ${result_dir}/${1}/module_mean_and_std
+  for name in ${file_names}
+  do
+    for x in {0..19}
+    do
+      echo "${result_dir}/${1}/module_test/${x}/${name}" >> ${result_dir}/${1}/module_mean_and_std/input_list.txt
+    done
+    python \
+      tests/numerical_tests/utils/module_mean_and_std.py \
+      --input-list ${result_dir}/${1}/module_mean_and_std/input_list.txt \
+      --output-mean-file ${result_dir}/${1}/module_mean_and_std/${name}.mean.pt \
+      --output-std-file ${result_dir}/${1}/module_mean_and_std/${name}.std.pt
+    rm ${result_dir}/${1}/module_mean_and_std/input_list.txt
+  done


why not do the loop in a Python function directly? and it can avoid duplicate code in amd/nvidia sh

abuccts · 2025-07-29T00:13:17Z

+  # Calculate intra-module similarity
+  mkdir -p ${result_dir}/${1}/module_similarity
+  for name in ${file_names}
+  do
+    for x in {0..19}
+    do
+      for y in {0..19}


abuccts · 2025-07-29T00:13:57Z

+run_numerical_tests attention
+run_numerical_tests bda
+run_numerical_tests embedding
+run_numerical_tests layer_norm
+run_numerical_tests logits
+run_numerical_tests mlp
+run_numerical_tests rope


add the script first and move these lines to corresponding pr?

abuccts · 2025-07-29T00:15:17Z

+  sleep 10
+}
+
+result_dir="./numerical_test_results/nvidia_h200"


will there be any issues if two runs use the same dir, maybe add a commit hash in the path?

abuccts · 2025-07-29T00:17:30Z

+    python \
+      tests/numerical_tests/utils/module_similarity.py \
+      --stats-a ${stats_dir_a}/${1}/module_mean_and_std/${name} \
+      --stats-b ${stats_dir_b}/${1}/module_mean_and_std/${name} \
+      --output-file ${result_dir}/${1}/module_similarity/${name}.json


what happens if there's mismatch? seems there's no assert in the code

github-actions · 2025-09-27T18:22:20Z

Marking as stale. No activity in 60 days.

add ltp scripts

e2394c5

yzygitzh added the CI/CD label Jul 26, 2025

yzygitzh requested a review from a team as a code owner July 26, 2025 08:28

yzygitzh requested a review from Copilot July 28, 2025 03:25

Copilot AI reviewed Jul 28, 2025

View reviewed changes

Comment thread tests/test_utils/ltp_scripts/run_numerical_tests_nvidia_h200_1n8g.sh

Comment thread tests/test_utils/ltp_scripts/run_numerical_tests_amd_mi300x_1n8g.sh

Comment thread tests/test_utils/ltp_scripts/run_numerical_tests_platform_similarity.sh

cp5555 reviewed Jul 28, 2025

View reviewed changes

cp5555 approved these changes Jul 28, 2025

View reviewed changes

cp5555 requested a review from abuccts July 28, 2025 21:59

abuccts reviewed Jul 29, 2025

View reviewed changes

github-actions Bot added the stale label Sep 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tests - Add LTP scripts to run module-level numerical tests#79

Tests - Add LTP scripts to run module-level numerical tests#79
yzygitzh wants to merge 1 commit into
devfrom
ziyue/pr-numerical-test-ltp

yzygitzh commented Jul 26, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cp5555 Jul 28, 2025

Uh oh!

cp5555 Jul 28, 2025

Uh oh!

abuccts Jul 29, 2025

Uh oh!

abuccts Jul 29, 2025

Uh oh!

abuccts Jul 29, 2025

Uh oh!

abuccts Jul 29, 2025

Uh oh!

abuccts Jul 29, 2025

Uh oh!

github-actions Bot commented Sep 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

yzygitzh commented Jul 26, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cp5555 Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

cp5555 Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

abuccts Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

abuccts Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

abuccts Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

abuccts Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

abuccts Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Sep 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants