Skip to content

Commit 1e6e255

Browse files
Max CharlambCopilot
andcommitted
[cdac] Stress framework cleanup + review feedback
Code changes: * Eliminate fixed-size caps in the comparator pipeline: - MAX_COLLECTED_REFS -> SArray<StackRef> in the runtime promote callback, with EX_TRY/EX_CATCH around Append so an OOM doesn't escape the stack-walker callback's NOTHROW contract. - MAX_GROUPS / MAX_FRAMES -> SArray<FrameRefGroup> / SArray<FrameResult>. - MAX_REFS_PER_FRAME -> per-frame dispositions stored out-of-band in a shared SArray<RefDisposition>; cUsed/rUsed use NewArrayHolder. FrameResult drops the embedded 256-entry arrays (~512 bytes -> 80). * CollectCdacStackRefs / CollectRuntimeStackRefs return HRESULT (was bool + side-channel logging / out-param overflow flag). Caller logs the hr in the [FAIL] line. * Treat any cDAC/runtime collection failure (FAIL hr or RT OOM) as a hard [FAIL] rather than silently continuing with partial data. Helix payload + harness: * cdac-stress-helix.proj: drop the per-framework-version for-loop and CORE_ROOT/chmod plumbing. The Helix command is now a single `dotnet exec xunit.console.dll StressTests.dll` matching the cdac-dump-helix pattern. * CdacStressTestBase.GetCoreRoot: walk HELIX_CORRELATION_PAYLOAD/ shared/Microsoft.NETCore.App/<version>/ to discover CORE_ROOT. * CdacStressTestBase: chmod +x corerun via File.SetUnixFileMode on non-Windows. Doc fixes (from Copilot review feedback): * Update `cDAC vs DAC vs runtime` references to `cDAC vs runtime'' in runtime-diagnostics.yml + GcScanner.cs (runtime is the oracle). * BasicCdacStressTests.cs: re-word `100% pass rate'' summary (we tolerate [KNOWN_ISSUE]); fix stale build recipe. * CdacStressTestBase.cs: re-word `GCStress may not have triggered'' failure message (we use DOTNET_CdacStress now, not DOTNET_GCStress). * known-issues.md: log-format example matches what cdacstress.cpp actually emits. Comment hygiene in cdacstress.cpp: * Trim file header, CDAC_LOG/CDAC_ERR doc blocks, forward-decl preambles, Phase A/B/C/D inline section banners. * Drop FilterRefs passthrough wrapper. * Net -90 lines in cdacstress.cpp comments + dead code; no functional changes from this section. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1 parent 93a1ccd commit 1e6e255

9 files changed

Lines changed: 259 additions & 426 deletions

File tree

eng/pipelines/runtime-diagnostics.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -321,7 +321,7 @@ extends:
321321
condition: always()
322322
323323
#
324-
# cDAC GC Stress Tests — runs in-process cDAC vs DAC vs runtime stack-ref
324+
# cDAC GC Stress Tests — runs in-process cDAC vs runtime stack-ref
325325
# verification at GC stress points. Independent stage with its own build
326326
# so its status/failures don't get conflated with the dump tests.
327327
#

src/coreclr/inc/clrconfigvalues.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -749,7 +749,7 @@ CONFIG_STRING_INFO(INTERNAL_PrestubHalt, W("PrestubHalt"), "")
749749
RETAIL_CONFIG_STRING_INFO(EXTERNAL_RestrictedGCStressExe, W("RestrictedGCStressExe"), "")
750750
RETAIL_CONFIG_DWORD_INFO(INTERNAL_CdacStressFailFast, W("CdacStressFailFast"), 0, "If nonzero, assert on cDAC/runtime GC ref mismatch during cDAC stress verification.")
751751
RETAIL_CONFIG_STRING_INFO(INTERNAL_CdacStressLogFile, W("CdacStressLogFile"), "Log file path for cDAC stress verification results.")
752-
RETAIL_CONFIG_DWORD_INFO(INTERNAL_CdacStress, W("CdacStress"), 0, "Enable cDAC stress verification. Bit flags: 0x1=alloc points.")
752+
RETAIL_CONFIG_DWORD_INFO(INTERNAL_CdacStress, W("CdacStress"), 0, "Enable cDAC stress verification. Bit flags: 0x1=alloc points, 0x200=verbose per-ref diagnostics.")
753753
CONFIG_DWORD_INFO(INTERNAL_ReturnSourceTypeForTesting, W("ReturnSourceTypeForTesting"), 0, "Allows returning the (internal only) source type of an IL to Native mapping for debugging purposes")
754754
RETAIL_CONFIG_DWORD_INFO(UNSUPPORTED_RSStressLog, W("RSStressLog"), 0, "Allows turning on logging for RS startup")
755755
CONFIG_DWORD_INFO(INTERNAL_SBDumpOnNewIndex, W("SBDumpOnNewIndex"), 0, "Used for Syncblock debugging. It's been a while since any of those have been used.")

src/coreclr/vm/cdacstress.cpp

Lines changed: 187 additions & 389 deletions
Large diffs are not rendered by default.

src/native/managed/cdac/Microsoft.Diagnostics.DataContractReader.Contracts/Contracts/StackWalk/GC/GcScanner.cs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -332,7 +332,7 @@ private TargetPointer FindGCRefMap(TargetPointer indirection)
332332
/// </summary>
333333
/// <remarks>
334334
/// Not yet ported. Every call records a deferred frame so the stress harness
335-
/// buckets the resulting cDAC-vs-DAC diff at this frame as a known issue
335+
/// buckets the resulting cDAC-vs-runtime diff at this frame as a known issue
336336
/// rather than a real cDAC bug. Will be replaced with a real port once the
337337
/// signature- and ArgIterator-based ref enumeration lands.
338338
/// </remarks>

src/native/managed/cdac/tests/StressTests/BasicCdacStressTests.cs

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,11 +12,13 @@ namespace Microsoft.Diagnostics.DataContractReader.Tests.GCStress;
1212

1313
/// <summary>
1414
/// Runs each debuggee app under corerun with DOTNET_CdacStress=0x001 (ALLOC)
15-
/// and asserts that the cDAC stack reference verification achieves 100% pass rate.
15+
/// and asserts that the cDAC stack reference verification produces no
16+
/// `[FAIL]` results. `[KNOWN_ISSUE]` verifications (where the cDAC explicitly
17+
/// marks a frame as deferred via `RecordDeferredFrame`) are tolerated.
1618
/// </summary>
1719
/// <remarks>
1820
/// Prerequisites:
19-
/// - Build CoreCLR native + cDAC: build.cmd -subset clr.native+tools.cdac -c Debug -rc Checked -lc Release
21+
/// - Build CoreCLR + cDAC (Checked): build.cmd -subset clr.runtime+tools.cdac -c Checked
2022
/// - Generate core_root: src\tests\build.cmd Checked generatelayoutonly /p:LibrariesConfiguration=Release
2123
/// - Build debuggees: dotnet build this test project
2224
///

src/native/managed/cdac/tests/StressTests/CdacStressTestBase.cs

Lines changed: 19 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,7 @@ internal async Task<CdacStressResults> RunGCStressAsync(string debuggeeName, int
3434
string coreRoot = GetCoreRoot();
3535
string corerun = Path.Combine(coreRoot, OperatingSystem.IsWindows() ? "corerun.exe" : "corerun");
3636
Assert.True(File.Exists(corerun), $"corerun not found at '{corerun}'");
37+
3738
string debuggeeDll = GetDebuggeePath(debuggeeName);
3839
// When running on Helix, write logs into HELIX_WORKITEM_UPLOAD_ROOT so
3940
// they're uploaded as work-item artifacts and visible via the Helix API.
@@ -123,7 +124,8 @@ internal static void AssertAllPassed(CdacStressResults results, string debuggeeN
123124
{
124125
Assert.True(results.TotalVerifications > 0,
125126
$"GC stress test '{debuggeeName}' produced zero verifications — " +
126-
"GCStress may not have triggered or cDAC may not be loaded.");
127+
"the cDAC stress framework may not be enabled (DOTNET_CdacStress unset, " +
128+
"or coreclr built without CDAC_STRESS).");
127129

128130
if (results.Failed > 0)
129131
{
@@ -138,11 +140,26 @@ internal static void AssertAllPassed(CdacStressResults results, string debuggeeN
138140

139141
private static string GetCoreRoot()
140142
{
141-
// Explicit override wins (typical in CI / when running under Helix).
143+
// Explicit override wins (typical when running locally with a custom layout).
142144
string? coreRoot = Environment.GetEnvironmentVariable("CORE_ROOT");
143145
if (!string.IsNullOrEmpty(coreRoot) && Directory.Exists(coreRoot))
144146
return coreRoot;
145147

148+
// Helix layout: testhost is unpacked under HELIX_CORRELATION_PAYLOAD and
149+
// corerun lives in shared/Microsoft.NETCore.App/<version>/. Pick the
150+
// first version directory; the payload should contain exactly one.
151+
string? helixPayload = Environment.GetEnvironmentVariable("HELIX_CORRELATION_PAYLOAD");
152+
if (!string.IsNullOrEmpty(helixPayload))
153+
{
154+
string frameworkRoot = Path.Combine(helixPayload, "shared", "Microsoft.NETCore.App");
155+
if (Directory.Exists(frameworkRoot))
156+
{
157+
string? versionDir = Directory.EnumerateDirectories(frameworkRoot).FirstOrDefault();
158+
if (versionDir is not null)
159+
return versionDir;
160+
}
161+
}
162+
146163
// Local fallback: derive from the repo's standard artifact layout.
147164
string os = OperatingSystem.IsWindows() ? "windows" : OperatingSystem.IsMacOS() ? "osx" : "linux";
148165
string arch = RuntimeInformation.ProcessArchitecture.ToString().ToLowerInvariant();

src/native/managed/cdac/tests/StressTests/cdac-stress-helix.proj

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -58,11 +58,17 @@
5858
(matches what CdacStressTestBase.GetCoreRunPath expects).
5959
-->
6060
<Target Name="_CreateHelixWorkItems" BeforeTargets="CoreTest">
61+
<!--
62+
Run the xUnit suite directly. CdacStressTestBase.GetCoreRoot discovers
63+
shared/Microsoft.NETCore.App/<version>/ from HELIX_CORRELATION_PAYLOAD
64+
and the harness ensures corerun is executable, so no env-var setup or
65+
framework-version loop is needed here.
66+
-->
6167
<PropertyGroup Condition="'$(TargetOS)' == 'windows'">
62-
<_StressTestCommand>for /D %25%25V in (%25HELIX_CORRELATION_PAYLOAD%25\shared\Microsoft.NETCore.App\*) do (set &quot;CORE_ROOT=%25%25V&quot; &amp;&amp; %25HELIX_CORRELATION_PAYLOAD%25\dotnet.exe exec --runtimeconfig %25HELIX_WORKITEM_PAYLOAD%25\tests\Microsoft.Diagnostics.DataContractReader.StressTests.runtimeconfig.json --depsfile %25HELIX_WORKITEM_PAYLOAD%25\tests\Microsoft.Diagnostics.DataContractReader.StressTests.deps.json %25HELIX_WORKITEM_PAYLOAD%25\tests\xunit.console.dll %25HELIX_WORKITEM_PAYLOAD%25\tests\Microsoft.Diagnostics.DataContractReader.StressTests.dll -xml testResults.xml -nologo)</_StressTestCommand>
68+
<_StressTestCommand>%25HELIX_CORRELATION_PAYLOAD%25\dotnet.exe exec --runtimeconfig %25HELIX_WORKITEM_PAYLOAD%25\tests\Microsoft.Diagnostics.DataContractReader.StressTests.runtimeconfig.json --depsfile %25HELIX_WORKITEM_PAYLOAD%25\tests\Microsoft.Diagnostics.DataContractReader.StressTests.deps.json %25HELIX_WORKITEM_PAYLOAD%25\tests\xunit.console.dll %25HELIX_WORKITEM_PAYLOAD%25\tests\Microsoft.Diagnostics.DataContractReader.StressTests.dll -xml testResults.xml -nologo</_StressTestCommand>
6369
</PropertyGroup>
6470
<PropertyGroup Condition="'$(TargetOS)' != 'windows'">
65-
<_StressTestCommand>for d in $HELIX_CORRELATION_PAYLOAD/shared/Microsoft.NETCore.App/*/; do export CORE_ROOT=&quot;$d&quot;; chmod +x $CORE_ROOT/corerun; $HELIX_CORRELATION_PAYLOAD/dotnet exec --runtimeconfig $HELIX_WORKITEM_PAYLOAD/tests/Microsoft.Diagnostics.DataContractReader.StressTests.runtimeconfig.json --depsfile $HELIX_WORKITEM_PAYLOAD/tests/Microsoft.Diagnostics.DataContractReader.StressTests.deps.json $HELIX_WORKITEM_PAYLOAD/tests/xunit.console.dll $HELIX_WORKITEM_PAYLOAD/tests/Microsoft.Diagnostics.DataContractReader.StressTests.dll -xml testResults.xml -nologo; done</_StressTestCommand>
71+
<_StressTestCommand>$HELIX_CORRELATION_PAYLOAD/dotnet exec --runtimeconfig $HELIX_WORKITEM_PAYLOAD/tests/Microsoft.Diagnostics.DataContractReader.StressTests.runtimeconfig.json --depsfile $HELIX_WORKITEM_PAYLOAD/tests/Microsoft.Diagnostics.DataContractReader.StressTests.deps.json $HELIX_WORKITEM_PAYLOAD/tests/xunit.console.dll $HELIX_WORKITEM_PAYLOAD/tests/Microsoft.Diagnostics.DataContractReader.StressTests.dll -xml testResults.xml -nologo</_StressTestCommand>
6672
</PropertyGroup>
6773

6874
<ItemGroup>

src/native/managed/cdac/tests/StressTests/known-issues.md

Lines changed: 35 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ When running `RunStressTests.ps1` (Checked, `DOTNET_CdacStress=0x001` =
1616
| `[FAIL]` | A real cDAC vs runtime discrepancy, or `GetStackReferences` failed at the API boundary. Investigate. |
1717

1818
The native harness detects the deferred-frame sentinels emitted by the
19-
cDAC managed code and relabels per-frame diffs as `[FRAME_KNOWN_NIE]`
19+
cDAC managed code and relabels per-frame diffs as `[KNOWN_NIE]`
2020
in the structured log.
2121

2222
## Open buckets
@@ -25,40 +25,40 @@ in the structured log.
2525

2626
`GcScanner.PromoteCallerStack` (in
2727
`src/native/managed/cdac/.../Contracts/StackWalk/GC/GcScanner.cs`)
28-
throws `NotImplementedException` deliberately. Producing correct
28+
is deliberately stubbed: instead of enumerating the caller's argument
29+
refs it records the frame as deferred and returns. Producing correct
2930
caller-argument layouts requires porting `ArgIterator` behind the
3031
`ICallingConvention` contract, which is a separate deferred work item.
3132

3233
To prevent these deferred frames from masquerading as real cDAC bugs,
3334
the managed code records each deferred frame on the `GcScanContext`
3435
via `RecordDeferredFrame`, which emits a sentinel `StackRefData` entry
3536
with `GcScanFlags.CDAC_DEFERRED_FRAME` (0x40000000) set. The native
36-
stress harness strips these sentinels and re-classifies any DAC-only
37-
diff at a deferred Source address as `[FRAME_KNOWN_NIE]`, and the
37+
stress harness strips these sentinels and re-classifies any RT-only
38+
diff at a deferred Source address as `[KNOWN_NIE]`, and the
3839
whole verification as `[KNOWN_ISSUE]` rather than `[FAIL]`.
3940

4041
Expected pattern in the log:
4142

4243
```
43-
[KNOWN_ISSUE] Thread=0x... IP=0x... cDAC=6 RT=7 (deferred frames: 1)
44-
[COMPARE cDAC-vs-RT]
45-
[FRAME_KNOWN_NIE] Source=0x... (<frame 0x...>): RT=1 (PromoteCallerStack deferred)
46-
[MATCH] All 6 refs matched
47-
[STACK_TRACE] (cDAC=6 RT=7)
48-
#0 System.AppContext.Setup(...) (cDAC=2 RT=2)
44+
[KNOWN_ISSUE] Thread=0x... IP=0x... cDAC=6 RT=7 frames=5 (match=4 mismatch=0 known_nie=1)
45+
Frame #4 <frame PrestubMethodFrame 0x...> [KNOWN_NIE] cDAC=0 RT=1 SP_cDAC=0x0 SP_RT=0x0
46+
[NIE(RT)] Addr=0x... Obj=0x... Flags=0x0 Reg=-1 Off=0
47+
[STACK_TRACE] (cDAC=6 RT=7 frames=5)
48+
#0 System.AppContext.Setup(...) (cDAC=2 RT=2)
4949
...
50-
#N <frame 0x...> (cDAC=0 RT=1) <-- KNOWN_NIE
50+
#4 <frame PrestubMethodFrame 0x...> (cDAC=0 RT=1) <-- KNOWN_NIE (PromoteCallerStack deferred)
5151
```
5252

5353
Every JIT frame's count matches exactly; the only discrepancy is on
5454
the explicit transition Frame that `PromoteCallerStack` would scan.
5555

5656
To re-enable: implement `ICallingConvention.PortableArgumentIterator`,
57-
then replace the `throw` in `PromoteCallerStack` with a call into the
58-
new contract. Once that lands, the previously-tracked
57+
then replace the `RecordDeferredFrame` stub in `PromoteCallerStack` with
58+
a call into the new contract. Once that lands, the previously-tracked
5959
`ELEMENT_TYPE_INTERNAL` (0x21) case in signature decoding will also
6060
need to be handled — that case currently isn't reachable because
61-
`PromoteCallerStack` short-circuits with `NotImplementedException`.
61+
`PromoteCallerStack` short-circuits without iterating the signature.
6262

6363
## Future work
6464

@@ -70,18 +70,28 @@ need to be handled — that case currently isn't reachable because
7070

7171
## Log Format
7272

73-
The stress log uses structured per-frame output with method-name
74-
resolution:
73+
Each verification emits a single header line followed by, on `[FAIL]` or
74+
`[KNOWN_ISSUE]`, a per-broken-frame block and a stack trace.
7575

7676
```
77-
[PASS] Thread=0x... IP=0x... cDAC=N RT=N
78-
[FAIL] Thread=0x... IP=0x... cDAC=N RT=M
79-
[COMPARE cDAC-vs-RT]
80-
[FRAME_DIFF] Source=0x... (MethodName): cDAC=X RT=Y
81-
[cDAC_ONLY] Addr=0x... Obj=0x... Flags=0x...
82-
[RT_ONLY] Addr=0x... Obj=0x... Flags=0x...
83-
[FRAME_cDAC_ONLY] Source=0x... (MethodName): cDAC=X
84-
[FRAME_RT_ONLY] Source=0x... (<frame 0x...>): RT=Y
85-
[STACK_TRACE] (cDAC=N RT=M)
77+
[PASS] Thread=0x... IP=0x... cDAC=N RT=N frames=N
78+
79+
[KNOWN_ISSUE] Thread=0x... IP=0x... cDAC=N RT=M frames=N (match=N mismatch=N known_nie=N)
80+
Frame #i <frame TypeName 0x...> [KNOWN_NIE] cDAC=X RT=Y SP_cDAC=0x... SP_RT=0x...
81+
[NIE(RT)] Addr=0x... Obj=0x... Flags=0x... Reg=N Off=N
82+
[STACK_TRACE] (cDAC=N RT=M frames=N)
83+
#i MethodName (cDAC=X RT=Y)
84+
#i <frame TypeName 0x...> (cDAC=X RT=Y) <-- KNOWN_NIE (PromoteCallerStack deferred)
85+
86+
[FAIL] Thread=0x... IP=0x... cDAC=N RT=M frames=N (match=N mismatch=N known_nie=N)
87+
Frame #i MethodName [MISMATCH] cDAC=X RT=Y SP_cDAC=0x... SP_RT=0x...
88+
[ONLY(cDAC)] Addr=0x... Obj=0x... Flags=0x... Reg=N Off=N
89+
[ONLY(RT)] Addr=0x... Obj=0x... Flags=0x... Reg=N Off=N
90+
[STACK_TRACE] (cDAC=N RT=M frames=N)
8691
#i MethodName (cDAC=X RT=Y) [<-- MISMATCH]
8792
```
93+
94+
Frames whose counts match are omitted from the per-frame block in
95+
concise mode; verbose mode (`DOTNET_CdacStress=0x201`) also emits the
96+
matched refs.
97+

src/native/managed/cdac/tests/UnitTests/README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,11 +5,11 @@ a target process without needing a real runtime.
55

66
For integration tests that exercise the cDAC against a real runtime, see:
77

8-
- [DumpTests](DumpTests/README.md) — validates cDAC contracts against crash dumps
8+
- [DumpTests](../DumpTests/README.md) — validates cDAC contracts against crash dumps
99
produced by purpose-built debuggees.
10-
- [StressTests](StressTests/README.md) — in-process GC stress verification that
10+
- [StressTests](../StressTests/README.md) — in-process GC stress verification that
1111
compares cDAC stack-reference enumeration against the runtime's own GC root
12-
scanning at every GC stress trigger point.
12+
scanning at every wired stress trigger point (currently managed allocation).
1313

1414
## Building and running
1515

0 commit comments

Comments
 (0)