[cDAC] Copy all debuggee project-reference DLLs/PDBs into Helix symbols dir#128221
Conversation
|
Tagging subscribers to this area: @steveisok, @tommcdon, @dotnet/dotnet-diag |
There was a problem hiding this comment.
Pull request overview
This PR updates the cDAC DumpTests Helix work item command script to stage all debuggee-produced managed binaries (and best-effort symbols) into the dumps/local/symbols/debuggees/<DebuggeeName>/ symbol search path, rather than copying only the primary <DebuggeeName>.dll.
Changes:
- Windows: copy
*.dllfrom each debuggee publish folder into the Helix symbols directory. - Windows: attempt to copy
*.pdb(best-effort) from each debuggee publish folder into the Helix symbols directory. - Unix: copy
*.dlland best-effort*.pdbfrom each debuggee publish folder into the Helix symbols directory.
…ls dir The Helix symbol-prep step in cdac-dump-helix.proj copies only `%(DebuggeeName)\%(DebuggeeName).dll` from each debuggee's publish output into the dumps/local/symbols/debuggees/<DebuggeeName>/ directory that ClrMD uses as a symbol search path. Any DLLs contributed by ProjectReferences (e.g. InterpreterStack's Trampoline.dll, which lives next to InterpreterStack.dll in the publish folder) are silently dropped, so when the cDAC opens an x-plat Windows dump and ClrMD has to reload that module's metadata from disk (Windows mini-dump heap mode does not capture read-only image pages), the load fails with VirtualReadException in EcmaMetadata_1.GetMetadataProvider. This manifested as 3 failing InterpreterStackDumpTests on every host running the windows-x64 and windows-arm64 dumps in the CdacXPlatDumpTests stage — linux dumps were unaffected because Linux core dumps include file-backed mappings, so ClrMD never had to fall back to disk. Switch the per-debuggee copy to `*.dll` (and best-effort `*.pdb`) from the debuggee publish folder so project-reference outputs are also captured. The publish output is framework-dependent (--no-self-contained) so only user DLLs/PDBs are present; the shared framework lives in the correlation payload and is not affected.
…issing PDBs
Windows: replace `copy ... 2>nul || (exit /b 0)` (which can terminate the
enclosing Helix batch script early) with `if exist ... copy ...` — a true
no-op when no PDB matches the wildcard.
Unix: replace `cp ... 2>/dev/null || true` and the previous attempt at a
`for ...; do ...; done` (which would have been split by the Arcade Helix
SDK SplitCommands on the bare `;`) with `find -maxdepth 1 -name *.pdb
-exec cp -t <dst> {} +` — naturally a no-op when find matches nothing, and
contains no semicolons.
- Windows: add `/Y` to both copy commands so a retried Helix work item that
reuses the same payload directory doesn't sit at an overwrite prompt.
- Unix: quote the find `-name` pattern (`'*.pdb'`) so the shell doesn't
pre-expand it before find runs, and replace GNU-only `cp -t <dest> {} +`
with the portable form `cp {} <dest>/ +` so this also works on
`TargetOS=osx` where `cp` is BSD and doesn't support `-t`.
ce7c930 to
ab086e0
Compare
The earlier `find ... -exec cp {} <dst>/ +` form failed in the
`runtime-diagnostics` pipeline on every Unix platform (linux-arm,
linux-arm64, linux-x64, osx-arm64, osx-x64) with a dash parser error:
/root/helix/work/correlation/scripts/.../execute.sh: 95:
Syntax error: word unexpected (expecting ")")
The Helix containers use `/bin/sh` -> dash, which doesn't tolerate the
shape of the `find` line as the Helix SDK ends up embedding it into
`execute.sh`.
Switch to the simpler `cp <src>/*.pdb <dst>/ 2>/dev/null || true`,
which is what the surrounding debuggee-invocation line already does
(see line ~123). The shell expands the glob; if no PDB matches, `cp`
prints to stderr (silenced) and exits non-zero (ignored). `|| true` is
safe in dash and does not affect the surrounding script's control flow.
This keeps the macOS / BSD portability the earlier review asked for
(no `cp -t`) without the shell-syntax issue.
|
Why do we need the PDBs? |
The interpreter test uses another DLL as a trampoline to get a stack that has multiple interpreter frames. This causes failures because the secondary DLLs pdb is not copied over. See the description for more info. |
What are you trying to read from the PDB (as opposed to the DLL)? |
The cdac and ClrMD read ECMA-335 metadata directly from the PE (`.dll`) — there is no managed source code / line-number assertion in these dump tests that would require `.pdb` files. The original code in PR dotnet#126385 (which set up this `symbols/` directory) intentionally only copied DLLs for the same reason. The actual bug this PR fixes is that `%(Identity)\%(Identity).dll` only picks up the primary DLL and silently drops any project-reference output that ends up in the same publish folder (e.g. `InterpreterStack/Trampoline.dll`). When the cdac later asked ClrMD to read metadata for `Trampoline.dll` from a Windows dump it couldn't find the file and threw `VirtualReadException`. Switching the glob to `*.dll` fixes that. The PDB-copy scaffolding I added in earlier commits has been the source of every CI failure on this PR (`|| (exit /b 0)` on Windows, `cp -t` portability, dash shell parsing) and is not needed. Drop it.
Ahh that is correct. We should be using the DLL only. Previously we weren't sending the other DLL. I'll narrow the scope of this. |
Note
Drafted with Copilot CLI.
Problem
The Helix symbol-prep step in
cdac-dump-helix.projcopies only<DebuggeeName>/<DebuggeeName>.dllfrom each debuggee's publish output into thedumps/local/symbols/debuggees/<DebuggeeName>/directory that ClrMD uses as a symbol search path. DLLs contributed byProjectReference(e.g.InterpreterStack'sTrampoline.dll, which lives next toInterpreterStack.dllin the publish folder) are silently dropped.When the cDAC opens an x-plat Windows dump and ClrMD has to reload that module's metadata from disk (Windows mini-dump heap mode does not capture read-only image pages), the load fails with
VirtualReadExceptioninEcmaMetadata_1.GetMetadataProvider:This manifested as 3 failing
InterpreterStackDumpTestson every host running thewindows-x64andwindows-arm64dumps in theCdacXPlatDumpTestsstage ofruntime-diagnostics. Linux dumps were unaffected because Linux core dumps include file-backed mappings, so ClrMD never had to fall back to disk.Fix
Broaden the per-debuggee copy step from
<DebuggeeName>.dllto*.dllso project-reference outputs (e.g.Trampoline.dll) are also captured. cDAC and ClrMD read ECMA-335 metadata directly from the PE, so only DLLs need to be copied — no PDBs are required.Also add
/Yto the Windowscopycommand to keep it non-interactive on a retry that reuses the same payload directory.The publish output is framework-dependent (
--no-self-contained) so only user DLLs are present; the shared framework lives in the correlation payload and is not affected by this change.Testing
Validation requires a manual
runtime-diagnosticsxplat pipeline run — local repro requires Helix infrastructure.Build 1421241 was an earlier validation run; a fresh run will be triggered on the latest commit before merging.