Skip to content

fix(dsp): correct dspDestoryHandle to use cluster ID (MY_RANK % dsp_count)#7357

Merged
Cstandardlib merged 1 commit into
deepmodeling:developfrom
Cstandardlib:fix/dsp-destroy-id
May 18, 2026
Merged

fix(dsp): correct dspDestoryHandle to use cluster ID (MY_RANK % dsp_count)#7357
Cstandardlib merged 1 commit into
deepmodeling:developfrom
Cstandardlib:fix/dsp-destroy-id

Conversation

@Cstandardlib
Copy link
Copy Markdown
Collaborator

@Cstandardlib Cstandardlib commented May 18, 2026

Summary

Closes #7269
Fix DSP segfault during memory statistics on multi-node runs (issue #7269).

Details

dspDestoryHandle(GlobalV::MY_RANK) in driver_run.cpp:153 uses the raw MPI rank as the DSP cluster handle ID, but dspInitHandle() uses MY_RANK % dsp_count. When MY_RANK >= dsp_count, destroying an uninitialized handle corrupts the heap. This manifests as a cfree() crash during Memory::print_all() at program exit.

Change

Fix dsp rank when handle destroyed:

// Before (broken):
mtfunc::dspDestoryHandle(GlobalV::MY_RANK);

// After (fixed):
mtfunc::dspDestoryHandle(GlobalV::MY_RANK % PARAM.inp.dsp_count);

Note

Need to verify on DSP device.

…ount)

dspInitHandle uses MY_RANK % dsp_count but dspDestoryHandle used raw MY_RANK, causing heap corruption when MY_RANK >= dsp_count. Fixes issue deepmodeling#7269.
Copilot AI review requested due to automatic review settings May 18, 2026 02:20
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes DSP handle teardown by using the same cluster ID mapping used during DSP initialization, preventing invalid handle destruction on multi-node runs.

Changes:

  • Updates DSP finalization to call dspDestoryHandle() with GlobalV::MY_RANK % PARAM.inp.dsp_count.
  • Aligns DSP teardown with existing init and routing sites that already use modulo-based cluster IDs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Cstandardlib Cstandardlib marked this pull request as draft May 18, 2026 02:25
@mohanchen mohanchen added Refactor Refactor ABACUS codes The Absolute Zero Reduce the "entropy" of the code to 0 GPU & DCU & HPC GPU and DCU and HPC related any issues and removed The Absolute Zero Reduce the "entropy" of the code to 0 labels May 18, 2026
@Cstandardlib Cstandardlib marked this pull request as ready for review May 18, 2026 06:54
@Cstandardlib
Copy link
Copy Markdown
Collaborator Author

Verified on DSP.
Now running with 8 DSP nodes will not cause segfault at the end.

@Cstandardlib Cstandardlib merged commit 8e50659 into deepmodeling:develop May 18, 2026
19 checks passed
@Cstandardlib Cstandardlib deleted the fix/dsp-destroy-id branch May 18, 2026 06:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

GPU & DCU & HPC GPU and DCU and HPC related any issues Refactor Refactor ABACUS codes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] DSP Segmentation fault on MEMORY print for many nodes

3 participants