Skip to content

radlink: perf + feature series — ICF/ICFSTATIC, GCTYPES, header-units, link-time + memory (rebased on dev, taken commits dropped)#842

Draft
honkstar1 wants to merge 40 commits into
EpicGames:devfrom
honkstar1:radlink-pr-series
Draft

radlink: perf + feature series — ICF/ICFSTATIC, GCTYPES, header-units, link-time + memory (rebased on dev, taken commits dropped)#842
honkstar1 wants to merge 40 commits into
EpicGames:devfrom
honkstar1:radlink-pr-series

Conversation

@honkstar1

@honkstar1 honkstar1 commented Jun 22, 2026

Copy link
Copy Markdown

What this is

The remaining radlink perf + feature contribution, rebased on latest dev with everything you've already taken dropped. Base: dev.

Kept as individual, reviewable, per-feature commits — cherry-pick whatever you want; nothing here is squashed and nothing here is something you already have.

40 commits, roughly:

  • ICF/OPT:ICF identical COMDAT folding (code + read-only data), /OPT:ICFSTATIC (static/internal-linkage COMDATs), leader-keying, and a parallelized refine pipeline (persistent-worker region, dense-color SoA, reloc-weight load-balance, parallel fold-verify).
  • Type-GC/OPT:GCTYPES (opt-in, default off): GC unreferenced CodeView types before PDB emit, frontier-worklist transitive closure.
  • C++ header-units — IFC debug-record resolution (0x1522) + ICF leader-keying so header-unit objects link and debug cleanly.
  • Link-time — memoize parsed COFF symbols per obj (kills Crash with access violation on null pointer #1 hotspot), cache symbol interp (lib-search 65s→18s), skip redundant library re-searches, per-lib frontier cursor, parallelize section-contrib sort / make_code_view_input / cand_map build, pow2 mask-index hash caps, batched thread-pool wake.
  • Peak memory — release the ~1GB image buffer early, size assigned-ti table by unique types (~3GB peak reclaim), slim/pack the parsed-symbol memo to 16B, decommit idle scratch, parallel COW-view release.
  • Diagnostics — env-gated RADLINK_PHASE_LOG per-phase micros (byte-neutral).

Output stays byte-identical to before unless the commit's whole point is to change bytes (the size / determinism work) — verified by relinking and cmp on DLL+PDB.

The cross-process shared thread pool stacks on top of this in #847 (dual-path).

honkstar1 and others added 28 commits June 24, 2026 21:56
get_cpu_features was the top main-thread hot spot (~5.6s for one Fortnite
link), 97% of it inside a single ATOMIC_LOAD(g_cpu_features). On MSVC,
blake3_dispatch.c defines ATOMIC_LOAD as _InterlockedOr(&x,0) -- a lock'd
RMW (full barrier) run on every BLAKE3 compress dispatch. The value is
written once and read-only after, so the barrier is pointless.

Enable BLAKE3's plain-load path (C11 _Atomic, a plain mov on x86) via build
flags only, leaving the vendored third_party/blake3 source untouched:
  /std:c11 /experimental:c11atomics -DBLAKE3_ATOMICS=1
Scoped to the radlink target. MSVC C11 atomics need both /std:c11 and
/experimental:c11atomics.

get_cpu_features: 5591ms -> 4ms (main thread).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
coff_read_symbol_name scans a cstr in the memory-mapped string table -- the
dominant, page-fault-bound cost of bulk symbol parsing. Many hot callers parse
a full symbol but only read scalar fields (value/section/storage_class/aux) to
interpret the symbol value; the name is never used.

Add name-skipping parse variants and route the interp-only paths through them:
  coff_parse_symbol{16,32}_no_name (coff_parse.c) -- and the full variants now
    call these + add the name, so the scalar logic lives in one place
  lnk_parsed_symbol_from_coff_symbol_idx_no_name (lnk_obj.c)
  lnk_interp_from_symbol / lnk_can_replace_symbol / lnk_on_symbol_replace
    (lnk_symbol_table.c) and the lnk_search_lib_task loop (lnk.c)

Where the name is still needed (lnk_search_lib) it uses the already-cached
LNK_Symbol.name instead of re-parsing. lnk_can_replace_symbol previously parsed
dst/src twice (full parse + a second parse for interp); collapsed to one
no-name parse each.

coff_parse_symbol32 on the main thread: 3922ms -> ~810ms (name-needed callers).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lnk_fixup_cv_type_indices did two open-addressing probes per type-index
reference: lnk_leaf_hash_table_search (leaf_ref -> canonical bucket), then
lnk_assigned_type_ht_search (canonical bucket -> assigned type index, via a
second hash table keyed by leaf-ref content). Both are cache-miss-bound and this
ran across every type-index reference in every obj.

Store the assigned type index directly on the leaf hash table: add a ti_arr
parallel to bucket_arr. lnk_assign_type_indices_task writes ti = min+i into the
leaf's bucket slot (each unique leaf owns a distinct slot, so worker writes
never collide), and the new lnk_leaf_hash_table_search_ti recovers it in one
probe. Removes the entire assigned_type_hts table and its build pass; deletes
the now-dead lnk_leaf_hash_table_search and lnk_assigned_type_ht_search.

Correctness: deduplicated leaves share the same ghash (debug_h value), so the
fixup query and the assign-time canonical bucket hash to the same slot.

lnk_fixup_cv_type_indices (main thread): 1445ms -> 390ms inclusive.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Input obj/lib files are mapped copy-on-write (PAGE_WRITECOPY/FILE_MAP_COPY) so
the linker can patch them in place. Pages touched during linking become
private-dirty; at process exit the kernel reclaims them in single-threaded
address-space rundown -- ~3s of lingering process time after the last thread
exits for a large (Fortnite-scale) link.

After all outputs are written and inputs are no longer read (post image-write
join), unmap the whole-file CoW views in parallel on the thread pool. The same
reclaim work then runs multi-threaded, off the serial post-exit path:
measured ~34s of aggregate UnmapViewOfFile CPU collapsing to ~0.55s wall, and
the post-exit process tail dropping from ~3s to ~0.5s.

Only the is_thin whole-file views are swept (lib-member substrings and linkgen
arena data are skipped), and only in the copy-on-write (read-only) mapping mode
-- read-write-shared mapping would flush dirty pages back to the input files on
unmap.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two micro-optimizations on hot parse helpers (profiled as the largest
aggregate-CPU functions in a Fortnite link):

- lnk_obj_section_from_sect_idx: split out a _no_name variant that skips the
  section-name string-table lookup (coff_name_from_section_header). The full
  variant now reuses it + adds the name. lnk_raw_directives_from_obj iterated
  every section of every obj building the full section struct just to test a
  flag, computing the name on ~all sections though only .drectve needs it -- now
  uses the no-name variant and resolves the name only inside the LnkInfo branch.

- lnk_parsed_symbol_from_coff_symbol_idx (+_no_name): return the coff_parse_*
  result directly instead of zero-initializing a local and assigning to it,
  removing a redundant ~48-byte COFF_ParsedSymbol copy + zero-init per call (RVO).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lnk_leaf_hash_table_search_ti (~EpicGames#2 radlink hotspot, ~48% of its self-time) spent
its time in lnk_match_leaf_ref, which is just a_hash==b_hash but fetches the
bucket's hash via lnk_hash_from_leaf_ref -> input->debug_h_arr[obj].v[leaf], a
scattered cache miss per probe step.

Add LNK_LeafHashTable.hash_arr (parallel to bucket_arr), populated at bucket
claim/update (both lnk_populate_leaf_ht and lnk_leaf_dedup_task) with the leaf's
debug_h hash. search_ti now matches via hash_arr[idx] == hash -- no deref. Exact
equivalent (match is pure hash compare); same value across same-hash updates.

Gated 65/65 linker torture (ghash_basic/match_debug_t, determ_test, p2r_determinism).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The image buffer was push'd on the shared link arena and only reclaimed in the
single-threaded process rundown at exit -- a multi-second kernel page-reclaim tail
(observed: one thread 100% in-kernel, zero user frames).

Allocate it as a standalone reserve_memory/commit_memory region and
release_memory() it the instant the background image-write thread joins (image is
on disk, no later reader). VirtualFree(MEM_RELEASE) returns fast; the kernel zeroes
the ~1GB on its background thread, overlapping the parallel input-view release +
exit instead of blocking rundown. Discard early so the kernel cleans up while the
app still runs -- don't defer to exit.

Gated 65/65 linker torture (determ_test + p2r_determinism: image correct).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…hotspot)

Parse each COFF symbol once in lnk_obj_initer into LNK_Obj.parsed_symbols;
lnk_parsed_symbol_from_coff_symbol_idx[_no_name] becomes an array index instead of
re-decoding the mmapped symbol table on every access (it was the EpicGames#1 hotspot,
lnk_parsed_symbol_from_coff_symbol_idx). All symbol-value patch sites (weak-replace,
COMDAT-leader, regular/common fixups) write obj->parsed_symbols[idx], decoupling
symbol values from the input mapping.

Extracted from the entangled WIP commit 8dc5fa6 (parsed-symbol memo + .rgd staging);
this is the memo half only -- no .rgd. Gated separately.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…im ~3GB peak)

The CV type-index fixup (05af760) had stored the assigned type index in arrays parallel
to the dedup hash table (leaf_ht), whose cap is the TOTAL pre-dedup leaf count summed
over all objs. On large links that ti_arr (+ the hash_arr added for deref-free probing)
added ~3GB to peak working set vs the prior unique-sized assigned_type_ht.

Split the two concerns:
 - LNK_LeafHashTable: just {cap, bucket_arr} for dedup (total-sized, as before / unavoidable).
 - LNK_AssignedTiHash {cap, ti_arr, hash_arr}: hash -> assigned ti, sized to the UNIQUE
   (post-dedup) leaf count. Built in lnk_assign_type_indices_task by hashing each unique
   leaf into its own slot (atomic claim; unique leaves have distinct hashes since dedup is
   by hash). search_ti probes it in one deref-free pass (occupant hash stored on the slot),
   exactly as before -- just on a table sized by unique instead of total.

Keeps the single-probe fixup speed of 05af760; removes its peak-memory regression.
ti==0 marks empty (assigned ti is always >= CV_MinComplexTypeIndex). Builds clean,
95 linker torture PASS, 0 fail.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
LNK_Obj.parsed_symbols memoized the full COFF_ParsedSymbol per symbol -- including the
16B String8 name -- sized by total symbol count, held to exit. Store a slim
LNK_ParsedSymbolLite (every field except name, ~24B vs ~40B) and re-decode the name from
the read-only symbol record in the named accessor only. The hot _no_name path
(can_replace/GC/resolution) and all symbol-value patching never touch the name, so they
stay fully memoized; only the named/push path pays a re-decode (cold relative to total).

FN no-rrt full link: peak commit 50.5GB -> 49.1GB (-1.4GB), wall flat (~8-9s no-debug),
valid 5.68GB PDB, 95/0 linker torture.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Store the COFF symbol record as a U32 byte-offset into obj->data instead of an 8B
pointer, and value as U32 (COFF symbol value is U32). Struct goes 24B -> 16B (no
padding): raw_symbol_off(4) value(4) section_number(4) type(2) storage_class(1) aux(1).
The named/no_name accessors reconstruct the pointer as obj->data.str + off (obj->data ==
input->data, so the offset is stable). Sized by total symbol count -> 8B/sym off peak.

FN no-rrt full link: peak commit 49.1GB -> 48.4GB (-0.67GB), wall flat, valid 5.68GB
PDB, 95/0 linker torture.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
After type merging, prune any merged TPI/IPI record not transitively reachable
from a surviving symbol. Roots are the type indices referenced by the symbols
that survive /OPT:REF (plus inlinee call-site types); the type graph is then
closed over and everything unreached is dropped before the streams are written.
Runs only under /OPT:REF -- it is the debug-info analogue of dead-section
stripping, and is otherwise transparent (no visible type is removed).

Implementation notes:
  - parallel transitive closure (bulk-synchronous rounds, atomic mark/expand)
  - fwdref<->definition pairing via a per-unique-name ring so a live forward
    reference keeps its definition (and vice-versa)
  - compaction is in place with the remap kept in scratch, so peak memory is
    unchanged

Numbers (UnrealEditorFortnite-Engine.dll, /OPT:REF /OPT:ICF, hashing NONE):
  PDB  5315 -> 5081 MB   (type-GC alone: -234 MB)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…arallelized

Fold byte-identical COMDAT sections whose relocations point at equivalent
targets, iterated to a fixpoint, then redirect each group's followers at their
shared symbol-table node so every reference resolves to one leader and /OPT:REF
collects the now-unreferenced follower sections (and their associated
.pdata/.xdata/.debug$S). Mirrors link.exe /OPT:ICF.

  - equivalence: round-0 key from content + reloc structure + non-candidate
    target identity; refine by candidate targets' colors until the partition is
    stable; a final byte-compare + per-reloc target-color check guards folds
    (no hash-collision can produce a bad fold)
  - folds code AND read-only data (vtables, const tables, string literals);
    folding identical read-only data lets the functions that reference it fold
    too (cascade)
  - fully parallel: candidate collection (count -> exact alloc -> fill),
    content hashing + reloc-target resolution, refinement, and final grouping
    via a parallel LSD radix sort (8-bit digits). A flat open-addressing map
    with an avalanche-scrambled key avoids O(n^2) probing on UE-scale inputs.

Only externally-defined COMDATs are folded (the follower redirects through its
symbol). Static/internal-linkage folding is intentionally out of scope here.

Numbers (UnrealEditorFortnite-Engine.dll, vs /OPT:NOICF, hashing NONE):
  .text  727 -> 643 MiB   (-84)
  .rdata 218 -> 194 MiB   (-24)
  PDB   5562 -> 5081 MB   (-481)
  DLL    999 -> 882 MB    (-117)
  link    33 -> 20 s      (-13; less to relocate and emit downstream)

This commit also adds the shared parallel radix-sort helper
(lnk_radix_sort_u64_pairs) used here and by the PDB GSI/PSI sort.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ize GSI/PSI sort

DBI section contributions:
  - after sorting, merge contiguous contributions that share (section, module,
    flags), absorbing the alignment-padding gaps between them. On UE-scale
    input this collapses ~12.5M contribution records to ~2.0M and shrinks the
    DBI stream 367 -> 72 MB, with no change to the address map.

GSI/PSI publics sort:
  - the comparators got element-stable tiebreakers (sort by record offset /
    dereferenced symbol identity, not by slot pointer) so the median-of-9
    quicksort cannot degrade to O(n^2) on the large runs of equal-address /
    equal-name records that ICF now produces.
  - gsi_record_sort_by_sc returns a radix-sorted permutation index (via the
    shared lnk_radix_sort_u64_pairs) and the PSI address map is built from it,
    replacing a comparator sort that stalled multiple seconds on ICF-heavy links.

Numbers (UnrealEditorFortnite-Engine.dll):
  DBI stream  367 -> 72 MB
  removes a multi-second GSI/PSI sort stall on ICF-folded inputs

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The closure re-scanned every merged type each round (O(rounds * total types)) to
find marked-but-unexpanded leaves. In a full-link trace of UnrealEditorFortnite
that round-rescan dominated the type-GC: lnk_gc_expand_task ~13.3 s of CPU.

Replace it with a frontier worklist: the atomic mark now gates a single append per
leaf, and each round expands only the slice newly marked by the previous round, so
total work is O(reachable types) instead of O(rounds * total types). Drops the
per-round `expanded` bitmap and full-array sweeps.

Output is unchanged -- same reachable set, PDB byte-identical (5081 MB on the same
input), and debugger-fidelity checks (addr->symbol 100%, core types resolve) match
the pre-change build.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…-parse

lnk_opt_icf re-parsed every candidate section serially (lnk_coff_relocs_from_
section_header per candidate) just to size the flattened reloc-target arrays.
Move the per-candidate reloc count into lnk_icf_fill_task -- which is already
parallel and has the section in hand -- so lnk_opt_icf only does a cheap serial
prefix sum for reloc_first. No output change (PDB/.text/.pdata identical).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The frontier mark did an interlocked op on every reference edge. Add a plain
non-atomic check first (the mark bit only ever goes 0->1, so a stale "already
set" read is safe), so the interlocked op runs once per leaf at its 0->1
transition instead of once per edge. Output identical (PDB 5081 MB).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Type-GC prunes CodeView types not referenced by any surviving symbol. That is a
PDB-size win, but it removes types a debugger can still legitimately cast to in the
watch window (the reachable-from-symbols set is a subset of the castable-type set) --
which is why the same approach was reverted before after users reported losing the
ability to cast in the watch. So gate it behind /OPT:GCTYPES, default OFF; only opt in
when the smaller PDB is worth the reduced castable-type set.

Numbers (UnrealEditorFortnite-Engine.dll): default PDB 5315 MB; with /OPT:GCTYPES
5081 MB (-234). LINK ok, no RelocationAgainstRemovedSection either way.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lnk_sort_contribs_task ran one serial radsort per chunk. A section's contribs live
in a single chunk sized to the whole section, so the merged .text chunk (millions of
entries) was sorted serially on one worker while every other thread idled -- the
straggler that stretched the "Sort Section Contribs" phase.

Sort chunks >= 64K entries with the parallel radix sort (key = Compose64Bit(obj_idx,
obj_sect_idx), which is unique per contrib so the order matches the comparator) using
all threads, before the per-chunk task pass handles the small remainder. Output is
unchanged: section sizes identical, byte diff within the pre-existing relink noise.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ICF refinement re-densified all ~N candidates every round, radix-sorting
the full set each iteration to a fixpoint. But a candidate alone in its
equivalence class can never split or merge again -- its color is final.

Track an active set of only the candidates still sharing a class with
another, and re-densify just that set each round (ids drawn from an
ever-increasing base so they never collide with the colors already
finalized for singletons). The per-round sort shrinks from all candidates
to those that still have a content+reloc twin, and converged classes drop
out as they fragment into singletons.

Output is unchanged -- relinking UnrealEditorFortnite-Engine.dll is
byte-identical to the prior ICF (same folds, same size) and reproducible
across runs. On that link the refinement loop drops from ~888ms to ~765ms
(first round prunes ~2.1M of 3.98M candidates to singletons immediately).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ved symbols

lnk_search_lib_task scans the entire search-chunk symbol set once per library
(2733 dispatches on the UE editor link). A symbol that started Undefined/Weak
stays in search_chunks even after a definition resolves it, so every later
library pass re-parsed it -- lnk_ref_from_symbol + lnk_parsed_symbol_from_coff
_symbol_idx -- just to recompute its interp and skip it. That parse faults the
COFF symbol record out of the mmap'd obj, and the profile showed those two
lines at ~65% of the task and a matching wall of page-fault kernel time.

The interp is already computed once in lnk_symbol_table_push_; cache it on
LNK_Symbol and read it in the hot loop. Only genuinely Weak symbols still parse
(for the weak-extension characteristics). The hash-trie node always points at
the current leader symbol, so the cached interp reflects the resolved state.

Output unchanged (byte-identical DLL+PDB, reproducible). Lib-search dispatch
wall drops ~18% (~1.5s -> ~1.2s on the UE editor link) with a larger drop in
aggregate CPU and page-fault traffic.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lnk_link_inputs resolves libraries to a fixpoint: an outer pass loops over
every library, and each library is re-searched (a full tp_for_parallel over
all workers, scanning every undefined/weak symbol in search_chunks) once per
drained input batch until nothing new resolves. On the UE editor link that is
~2733 dispatches, each waking and joining ~60 workers -- and the phase is
barrier-bound, so that wake/join is the cost, not the scan.

Most re-searches are redundant: search_chunks only grows during the loop
(symbols are never removed until the end) and member-queue dedup is idempotent,
so a re-search can only queue new members if the undefined/weak symbol set grew
or anti-dep searching was just enabled since this library was last searched.
Stamp each LNK_Lib with the search_chunks symbol count + anti-dep mode at its
last search and skip the dispatch when neither changed.

~24% fewer dispatches (2733 -> ~2089) and ~0.2s of wake/join wall-time removed.
Output is byte-identical and reproducible (which dispatch coalesces is timing
dependent, but a skipped one provably queued nothing, so the result is
unchanged -- verified relink-twice byte-identical across many runs).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
External ICF folds a follower by redirecting its defining symbol to the
leader (Fnode->symbol = Lnode->symbol); /OPT:REF then finds the follower
unreferenced and dead-strips it along with its associated .pdata/.xdata.
Static COMDATs have no external symbol -- they're reached only by
section-relative relocs to a static symbol that resolves to themselves --
so that path can't fold them, leaving a large .text gap vs MSVC.

Add an opt-in static fold. lnk_icf_section_kind now returns candidates for
static COMDATs too; they join the same content+reloc-equivalence classes.
Leader selection prefers a non-static member so an external is never folded
into a static leader. A static follower records a per-section icf_fold map
(LNK_Obj.icf_fold: follower section -> leader obj/section) instead of a
symbol redirect.

The /OPT:REF mark-live walk consults that map: when a reference (or an
associative-section walk) reaches a folded static follower, it marks the
LEADER section live cross-obj and enqueues the leader's relocs/associated
sections instead -- so the follower dead-strips, taking its .pdata/.xdata
with it. Crucially the redirect is applied in the associated-section walk
too: folded static .text are often associative COMDATs (EH funclets/thunks)
pulled in via associated_sections[], and keeping those followers live was
the prior attempt's bug (~250K malformed .pdata + nondeterminism).
lnk_set_icf_static_leader_contribs_task then redirects folded followers'
sect_map entries to the leader's contrib so any residual reloc resolves to
the identical leader.

Gated behind /OPT:ICFSTATIC (default off): runtime-unvalidated, like
/OPT:GCTYPES. Default ICF output is byte-identical to before.

On UnrealEditorFortnite-Engine.dll with /OPT:ICFSTATIC: DLL 925 -> 844 MB
(.text 643.1 -> 594.3 MiB, .rdata 194.9 -> 169.2 MiB), output reproducible
(relink byte-identical), .pdata clean (1,740,509 records, 0 malformed,
0 out-of-order), link exits 0. No runtime validation performed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Resolve MSVC C++ header-unit IFC debug records (LF_IFC_RECORD 0x1522) by
merging the .ifc .msvc.trait.debug-records CodeView stream and redirecting
each record to its real type -> fixes VS debugger AV stepping into header-
unit code (BAD 921->0, 44/44 MSVC name-match, live debug verified).

ICF: key non-candidate Regular COMDAT reloc targets by their resolved
COMDAT leader instead of per-obj (input_idx,section) -> folds identical
funcs referencing per-obj-dup COMDATs (.text 623->536 MiB, -82.56 MiB,
deterministic, .pdata valid).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…vor-emit), dense-color SoA+U32, reloc-weight load-balance, parallel fold-verify + IFC scan

optG: ~35s->22-23s warm on monolithic UnrealEditorFortnite-Engine link, all
determinism-verified (PDB byte-identical to fixes bar /BREPRO GUID). Stacks
on header-unit IFC + ICF-keying fixes (69ac14f).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…llel prefix-sum)

The per-round serial group-scan in lnk_icf_dense_colors_active (walk sorted
keys, assign a dense color id per distinct-key group, mark size>=2 survivors,
exclusive-prefix their emit slots) was the remaining refine-window sawtooth
bottleneck (inlined into lnk_link_image self). Replace it with a 3-phase
parallel prefix sum:

  mark  (parallel): per sorted position compute boundary (run start) and keep
                    (run size>=2 survivor) bits -- each reads only sk[k-1..k+1]
                    so it is chunk-local -- plus per-chunk local boundary/keep/
                    surviving-class counts.
  prefix (serial):  tiny exclusive prefix over the worker_count chunk totals
                    (NOT over n) -> each chunk's exclusive class-id / emit-slot
                    base; sum surviving classes.
  apply (parallel): each chunk re-derives color_at[]/out_slot[] from its base.

Determinism: the per-position values are a pure prefix of independent
per-position bits, byte-identical to the old serial running counter regardless
of how tp_divide_work splits the chunks; next_active survivors are emitted in
ascending sorted-color order via the prefix slots. The existing parallel color
scatter + survivor emit are unchanged. Verified: linked DLL byte-identical to
the optG canonical (modulo the /BREPRO GUID block), self-deterministic across
relinks and reproducible across recompiles; PDB dia_types BAD=0. A gated
ICF_SCAN_SELFCHECK build asserts the parallel scan matches the serial scan
byte-for-byte.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… region

Collapse the ICF color-refine round loop (~18 rounds x ~10 tp_for_parallel
phases = ~126 fork-join cycles) into a SINGLE tp_for_parallel region of
worker_count participants (lnk_icf_refine_region_task). Every former phase
boundary becomes a barrier_wait(tp->barrier) inside the region; the tiny serial
glue (radix per-pass 256xW prefix, group-scan W-chunk prefix, convergence +
buffer swap) runs on worker 0 behind a barrier while the others wait. This kills
the thread-pool wake->work->sleep sawtooth that dominated lnk_link_image self
time without changing any task body.

Order is preserved exactly: ranges are rebuilt each round into preallocated
buffers (in-place lnk_icf_divide_by_reloc / tp_divide_work), all phase math is
byte-identical to the per-round path (refine, gather, LSD radix, scan mark/apply,
color scatter, survivor emit). Per-round scratch is preallocated once to
cand_count and reused; the radix double-buffer / pass-count / pointer swap are
driven by worker 0 across barriers. 1:1 worker<->task is guaranteed because every
participant blocks on the first barrier before any can steal a second task.

Determinism: relinked UnrealEditorFortnite-Engine.dll is byte-identical to a
freshly-built e32b662 (group-scan) DLL except the /BREPRO GUID block and the
known pre-existing offset-261 export-dir Size field. ICF fold counts unchanged
(folded 10082697 of 19045185 into 5661141 classes); dia_types BAD=0.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…er) atop persistent region

lnk_icf_divide_by_reloc_into: replace two O(n) random gathers (cands[active[i]].reloc_count,
cache-miss-bound, x~18 rounds = ~5.6s serial, found via no-inline trace) with O(worker_count)
count-based split. Work-split only -> output byte-identical (verified: combo == canonical, 0 diffs).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
honkstar1 and others added 12 commits June 24, 2026 22:10
…re-sort, ~3-4s)

The persistent ICF refine region re-sorts the entire ~10.65M-candidate active
set every round until a round splits nothing. Measured churn (live UE Engine
link): the partition is ~99.99% stable by round ~7; the last ~11 rounds each
still pay the full ~10.65M-key sort just to resolve a few hundred cumulative
splits. That tail is ~3-4s of pure COMPUTE waste.

Run the persistent region for a bounded warm-up (8 rounds, where churn is large
and the full parallel sort wins), then hand the still-active set to a dirty-class
worklist (Hopcroft "process only what can change"):
 - Per-color member slab; the work unit is a CLASS.
 - Each round re-keys ONLY members of dirty classes against a FROZEN colors[]
   snapshot (Jacobi commit at the round barrier -- reading mid-round colors would
   over-refine past the unique coarsest stable partition and may never terminate;
   this is the trap that hung the prior arm).
 - A class can only split if its own color or a reloc target's color changed;
   after a split, enqueue the split class's referrers via a CSR reverse
   target->referrers index. Dirty set empty => fixpoint.
 - Per-class re-key uses a serial pool-free sort (classes are tiny; the prior
   arm re-entered the thread pool thousands of times per round -> hang).

Both paths compute the unique coarsest stable partition, so the final colors[]
partition is identical; ids are renamed but every fold consumer (group-by-color,
identity-keyed leader election, colors[a]==colors[b] verify) depends only on the
equivalence relation. Output DLL byte-identical to canonical (bar /BREPRO GUID).

Hang-guard: hard 40-round cap + non-progress (dirty set not shrinking) detector;
on trip, AssertAlways -> never ships a spinning binary. Gated -DICF_WORKLIST_SELFCHECK=1
build runs BOTH the worklist and the uncapped region each link and AssertAlways
the partitions are identical (verified clean across the whole UE link).

Verified on UE Editor Fortnite Engine.dll:
 - worklist tail: 10 rounds (next_dirty 14->0), handoff active=10654290
 - fold count EXACT: folded 10082697 of 19045185 into 5661141 classes
 - self-cmp GUID-only (17 bytes); vs canonical a597eb6 GUID-only (26 bytes, all
   in the debug-dir/GUID window, zero diffs elsewhere)
 - dia_types: UDTs=2559825 children=39928767 BAD=0
 - ICF_WORKLIST_SELFCHECK: partition identical to full region every link

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Stacks worklist refine (1692a9d) + parallel lnk_apply_ifc_debug_records discovery
scan (766deb7e). Byte-identical to canonical; BAD=0; 0x1522->0.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The worklist's CSR reverse-index (rev_adj = U32 x candidate-edge-count) costs ~10GB
peak on the UE link -- net-negative on the page-fault-bound critical path vs the
~3-4s of tail re-sorts it saves. region_cap=64 -> region converges (~19 rounds),
worklist handoff skipped, no reverse-index. Re-enable by lowering region_cap.
Size wins intact (.text 536MB), sound, links clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ng insert

lnk_opt_icf built the (input_idx,sn)->cand_idx+1 lookup serially via
lnk_icf_map_put; Superluminal showed ~642ms there, cache-miss-bound on the
scrambled-slot keys[] probe. Parallelize the insert loop over the thread pool.

Keys are UNIQUE (1:1 map) and the map is read only by key afterward
(lnk_icf_map_get, post-barrier), so insert ORDER is output-neutral. New
lnk_icf_map_put_atomic claims each empty slot with a CAS EMPTY->key
(ins_atomic_u64_eval_cond_assign); only the CAS winner writes vals[slot].
Load factor <=0.5 (lnk_icf_map_make oversizes cap>=capacity*2). Output
byte-identical to canonical bar /BREPRO GUID (35 bytes, 2 RSDS records).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
(cherry picked from commit ff632aab8f81b3de37eb2a0a77c10845ce019e4c)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…path

cand_map keys fill: LNK_ICF_EMPTY is all-0xFF -> single MemorySet vs scalar per-U64
store loop (256MB serial first-touch on the monolithic link). image_fill_task: direct
copy fast-path for single-data-node contribs (the vast majority), skipping the
list-walk + cursor on the hot 739MB image write. Byte-identical (27B: chksum+GUID).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…merge dedup probes)

leaf_ht + assigned_ti caps rounded via u64_up_to_pow2 so the bucket index is
hash & (cap-1) instead of hash % cap -- removes a 64-bit DIV from the densest
type-dedup/fixup probe loops (lnk_leaf_hash_table_search_ti, lnk_leaf_dedup_task,
lnk_hash_debug_t_task, assigned-ti pass). Byte-identical (back-to-back A/B: 29B
chksum+GUID, control 17B; the load factor stays <=~0.65).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… already-searched symbols)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
After merge-types reaches the per-thread scratch high-water (~9GB of tctx
scratch arenas stay committed but idle through the PDB peak), release the
committed-but-unused scratch pages back to the OS before the PDB build
re-grows them. Drops recorded peak working set on the monolithic
UnrealEditorFortnite-Engine.dll link.

- arena_decommit_unused(): decommit committed pages strictly above each
  block's live pos in the active chain, and the unused bodies of free-list
  blocks (keeping the header page). Reservation kept; push path re-commits
  on demand, so reuse is transparent and output byte-identical.
- tctx_scratch_decommit(): decommit the calling thread's two equipped
  scratch arenas.
- lnk_scratch_decommit_worker + tp_for_parallel(worker_count) with an
  in-task barrier: every worker (worker 0 IS the main thread) decommits its
  own scratch exactly once between lnk_merge_types and lnk_build_pdb.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ostic, byte-neutral)

When RADLINK_PHASE_LOG is set, lnk_log_timers writes machine-parseable raw
per-phase microseconds (Image/PDB/RDI/Lib/Debug + TOTAL) to that path, for
automated perf A/B. Env-unset -> identical code path, DLL/PDB byte-identical.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@honkstar1 honkstar1 force-pushed the radlink-pr-series branch from d70c7b1 to bd71560 Compare June 25, 2026 06:01
@honkstar1 honkstar1 changed the title radlink: link-time perf, peak-memory, /OPT:ICF[STATIC], /OPT:GCTYPES, C++ header-units radlink: perf + feature series — ICF/ICFSTATIC, GCTYPES, header-units, link-time + memory (rebased on dev, taken commits dropped) Jun 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant