Skip to content

Tests - Add LTP scripts to run module-level numerical tests#79

Open
yzygitzh wants to merge 1 commit into
devfrom
ziyue/pr-numerical-test-ltp
Open

Tests - Add LTP scripts to run module-level numerical tests#79
yzygitzh wants to merge 1 commit into
devfrom
ziyue/pr-numerical-test-ltp

Conversation

@yzygitzh

Copy link
Copy Markdown
Contributor

Add LTP scripts to run module-level numerical tests. Including

  • Scripts to run and collect stats on different platforms, including NVIDIA H200 and AMD MI300X.
  • Script to compare stats between different platforms.

@yzygitzh yzygitzh added the CI/CD label Jul 26, 2025
@yzygitzh yzygitzh requested a review from a team as a code owner July 26, 2025 08:28
@yzygitzh yzygitzh requested a review from Copilot July 28, 2025 03:25

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds LTP (Long-Term Performance) scripts to run module-level numerical tests across different hardware platforms, specifically targeting NVIDIA H200 and AMD MI300X GPUs. The scripts automate the collection of numerical test statistics and enable comparison between platforms to ensure computational consistency.

  • Scripts to execute and collect numerical test statistics on NVIDIA H200 and AMD MI300X platforms
  • Automated comparison functionality to analyze numerical differences between platforms
  • Support for running tests on multiple modules including attention, embedding, MLP, and others

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
run_numerical_tests_nvidia_h200_1n8g.sh Script to run numerical tests on NVIDIA H200 platform with platform-specific environment setup
run_numerical_tests_amd_mi300x_1n8g.sh Script to run numerical tests on AMD MI300X platform with ROCm and RCCL configurations
run_numerical_tests_platform_similarity.sh Comparison script to analyze numerical similarity between NVIDIA H200 and AMD MI300X results
Comments suppressed due to low confidence (1)

tests/test_utils/ltp_scripts/run_numerical_tests_nvidia_h200_1n8g.sh:4

  • The version v1.1.4 for the grouped_gemm package may not exist. Please verify that this specific version tag exists in the repository before using it in the installation command.
pip install git+https://github.com/fanshiqing/grouped_gemm@v1.1.4

Comment thread tests/test_utils/ltp_scripts/run_numerical_tests_amd_mi300x_1n8g.sh
mkdir -p ${result_dir}/${1}/module_mean_and_std
for name in ${file_names}
do
for x in {0..19}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add configuration for running times, e.g.19


run_numerical_tests() {
# Get raw module test results
for x in {0..19}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for all 19, should we to use parameter to replace.

@cp5555 cp5555 requested a review from abuccts July 28, 2025 21:59
Comment on lines +57 to +72
# Calculate module mean and std
file_names=$(find ${result_dir}/${1}/module_test -type f -printf "%f\n" | sort | uniq)
mkdir -p ${result_dir}/${1}/module_mean_and_std
for name in ${file_names}
do
for x in {0..19}
do
echo "${result_dir}/${1}/module_test/${x}/${name}" >> ${result_dir}/${1}/module_mean_and_std/input_list.txt
done
python \
tests/numerical_tests/utils/module_mean_and_std.py \
--input-list ${result_dir}/${1}/module_mean_and_std/input_list.txt \
--output-mean-file ${result_dir}/${1}/module_mean_and_std/${name}.mean.pt \
--output-std-file ${result_dir}/${1}/module_mean_and_std/${name}.std.pt
rm ${result_dir}/${1}/module_mean_and_std/input_list.txt
done

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not do the loop in a Python function directly? and it can avoid duplicate code in amd/nvidia sh

Comment on lines +73 to +79
# Calculate intra-module similarity
mkdir -p ${result_dir}/${1}/module_similarity
for name in ${file_names}
do
for x in {0..19}
do
for y in {0..19}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Comment on lines +95 to +101
run_numerical_tests attention
run_numerical_tests bda
run_numerical_tests embedding
run_numerical_tests layer_norm
run_numerical_tests logits
run_numerical_tests mlp
run_numerical_tests rope

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the script first and move these lines to corresponding pr?

sleep 10
}

result_dir="./numerical_test_results/nvidia_h200"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will there be any issues if two runs use the same dir, maybe add a commit hash in the path?

Comment on lines +12 to +16
python \
tests/numerical_tests/utils/module_similarity.py \
--stats-a ${stats_dir_a}/${1}/module_mean_and_std/${name} \
--stats-b ${stats_dir_b}/${1}/module_mean_and_std/${name} \
--output-file ${result_dir}/${1}/module_similarity/${name}.json

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if there's mismatch? seems there's no assert in the code

@github-actions

Copy link
Copy Markdown

Marking as stale. No activity in 60 days.

@github-actions github-actions Bot added the stale label Sep 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants