Skip to content

Fix MPS config-manager PROCESS_TO_SIGNAL target matching#1827

Open
justinas-wix wants to merge 1 commit into
NVIDIA:mainfrom
justinas-wix:fix-mps-config-manager-signal-target
Open

Fix MPS config-manager PROCESS_TO_SIGNAL target matching#1827
justinas-wix wants to merge 1 commit into
NVIDIA:mainfrom
justinas-wix:fix-mps-config-manager-signal-target

Conversation

@justinas-wix

@justinas-wix justinas-wix commented Jun 2, 2026

Copy link
Copy Markdown

Summary

Fixes mps-control-daemon-sidecar (config-manager) exiting when it sends SIGHUP after updating the MPS config symlink.

Problem

Helm sets PROCESS_TO_SIGNAL=/usr/bin/mps-control-daemon, but the container runs command: [mps-control-daemon], so argv[0] is mps-control-daemon.

findPidToSignal compares cmdline[0] to the target with an exact string match. That fails, returns no process found, and the sidecar crashes (CrashLoopBackOff). This happens whenever SEND_SIGNAL runs after a real config update (including common startup races where the sidecar briefly applies DEFAULT_CONFIG before the node label is seen).

Fix

  • cmd/config-manager/main.go: match by exact string or basename (mps-control-daemon/usr/bin/mps-control-daemon)
  • Helm MPS DaemonSet: set PROCESS_TO_SIGNAL to mps-control-daemon

Related

Test plan

  • Deploy MPS control daemon on nodes with nvidia.com/device-plugin.config set to config-1 and config-2
  • Confirm mps-control-daemon-sidecar is 2/2 Ready after pod create (no no process found)
  • Verify readlink -f /config/config.yaml in mps-control-daemon-ctr matches the node label
  • Change node label between two valid configs and verify sidecar reloads without crash

@copy-pr-bot

copy-pr-bot Bot commented Jun 2, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@justinas-wix justinas-wix changed the title Fix MPS config-manager signal target in pod PID namespace Fix MPS config-manager PROCESS_TO_SIGNAL target matching Jun 2, 2026
Match PROCESS_TO_SIGNAL against argv[0] basename so the sidecar can SIGHUP
mps-control-daemon when shareProcessNamespace is enabled (enableHostPID: false).
Use "mps-control-daemon" in the MPS control daemon Helm template instead of
"/usr/bin/mps-control-daemon".

Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: Justinas Koreiva <justinask@wix.com>
@justinas-wix justinas-wix force-pushed the fix-mps-config-manager-signal-target branch from 34f3f81 to e432698 Compare June 2, 2026 13:31
@justinas-wix

Copy link
Copy Markdown
Author

Hey not sure who to tag for review so just trying:
@cdesiniotis @tariq1890

@tariq1890

Copy link
Copy Markdown
Contributor

Thanks @justinas-wix ! This looks good to me. Can you update your PR branch and also ensure your commits are signed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants