radlink: perf + feature series — ICF/ICFSTATIC, GCTYPES, header-units, link-time + memory (rebased on dev, taken commits dropped) by honkstar1 · Pull Request #842 · EpicGames/raddebugger

honkstar1 · 2026-06-22T05:02:35Z

What this is

The remaining radlink perf + feature contribution, rebased on latest dev with everything you've already taken dropped. Base: dev.

Kept as individual, reviewable, per-feature commits — cherry-pick whatever you want; nothing here is squashed and nothing here is something you already have.

40 commits, roughly:

ICF — /OPT:ICF identical COMDAT folding (code + read-only data), /OPT:ICFSTATIC (static/internal-linkage COMDATs), leader-keying, and a parallelized refine pipeline (persistent-worker region, dense-color SoA, reloc-weight load-balance, parallel fold-verify).
Type-GC — /OPT:GCTYPES (opt-in, default off): GC unreferenced CodeView types before PDB emit, frontier-worklist transitive closure.
C++ header-units — IFC debug-record resolution (0x1522) + ICF leader-keying so header-unit objects link and debug cleanly.
Link-time — memoize parsed COFF symbols per obj (kills Crash with access violation on null pointer #1 hotspot), cache symbol interp (lib-search ~~65s→~~18s), skip redundant library re-searches, per-lib frontier cursor, parallelize section-contrib sort / make_code_view_input / cand_map build, pow2 mask-index hash caps, batched thread-pool wake.
Peak memory — release the ~1GB image buffer early, size assigned-ti table by unique types (~3GB peak reclaim), slim/pack the parsed-symbol memo to 16B, decommit idle scratch, parallel COW-view release.
Diagnostics — env-gated RADLINK_PHASE_LOG per-phase micros (byte-neutral).

Output stays byte-identical to before unless the commit's whole point is to change bytes (the size / determinism work) — verified by relinking and cmp on DLL+PDB.

The cross-process shared thread pool stacks on top of this in #847 (dual-path).

get_cpu_features was the top main-thread hot spot (~5.6s for one Fortnite link), 97% of it inside a single ATOMIC_LOAD(g_cpu_features). On MSVC, blake3_dispatch.c defines ATOMIC_LOAD as _InterlockedOr(&x,0) -- a lock'd RMW (full barrier) run on every BLAKE3 compress dispatch. The value is written once and read-only after, so the barrier is pointless. Enable BLAKE3's plain-load path (C11 _Atomic, a plain mov on x86) via build flags only, leaving the vendored third_party/blake3 source untouched: /std:c11 /experimental:c11atomics -DBLAKE3_ATOMICS=1 Scoped to the radlink target. MSVC C11 atomics need both /std:c11 and /experimental:c11atomics. get_cpu_features: 5591ms -> 4ms (main thread). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

coff_read_symbol_name scans a cstr in the memory-mapped string table -- the dominant, page-fault-bound cost of bulk symbol parsing. Many hot callers parse a full symbol but only read scalar fields (value/section/storage_class/aux) to interpret the symbol value; the name is never used. Add name-skipping parse variants and route the interp-only paths through them: coff_parse_symbol{16,32}_no_name (coff_parse.c) -- and the full variants now call these + add the name, so the scalar logic lives in one place lnk_parsed_symbol_from_coff_symbol_idx_no_name (lnk_obj.c) lnk_interp_from_symbol / lnk_can_replace_symbol / lnk_on_symbol_replace (lnk_symbol_table.c) and the lnk_search_lib_task loop (lnk.c) Where the name is still needed (lnk_search_lib) it uses the already-cached LNK_Symbol.name instead of re-parsing. lnk_can_replace_symbol previously parsed dst/src twice (full parse + a second parse for interp); collapsed to one no-name parse each. coff_parse_symbol32 on the main thread: 3922ms -> ~810ms (name-needed callers). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

lnk_fixup_cv_type_indices did two open-addressing probes per type-index reference: lnk_leaf_hash_table_search (leaf_ref -> canonical bucket), then lnk_assigned_type_ht_search (canonical bucket -> assigned type index, via a second hash table keyed by leaf-ref content). Both are cache-miss-bound and this ran across every type-index reference in every obj. Store the assigned type index directly on the leaf hash table: add a ti_arr parallel to bucket_arr. lnk_assign_type_indices_task writes ti = min+i into the leaf's bucket slot (each unique leaf owns a distinct slot, so worker writes never collide), and the new lnk_leaf_hash_table_search_ti recovers it in one probe. Removes the entire assigned_type_hts table and its build pass; deletes the now-dead lnk_leaf_hash_table_search and lnk_assigned_type_ht_search. Correctness: deduplicated leaves share the same ghash (debug_h value), so the fixup query and the assign-time canonical bucket hash to the same slot. lnk_fixup_cv_type_indices (main thread): 1445ms -> 390ms inclusive. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Input obj/lib files are mapped copy-on-write (PAGE_WRITECOPY/FILE_MAP_COPY) so the linker can patch them in place. Pages touched during linking become private-dirty; at process exit the kernel reclaims them in single-threaded address-space rundown -- ~3s of lingering process time after the last thread exits for a large (Fortnite-scale) link. After all outputs are written and inputs are no longer read (post image-write join), unmap the whole-file CoW views in parallel on the thread pool. The same reclaim work then runs multi-threaded, off the serial post-exit path: measured ~34s of aggregate UnmapViewOfFile CPU collapsing to ~0.55s wall, and the post-exit process tail dropping from ~3s to ~0.5s. Only the is_thin whole-file views are swept (lib-member substrings and linkgen arena data are skipped), and only in the copy-on-write (read-only) mapping mode -- read-write-shared mapping would flush dirty pages back to the input files on unmap. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Two micro-optimizations on hot parse helpers (profiled as the largest aggregate-CPU functions in a Fortnite link): - lnk_obj_section_from_sect_idx: split out a _no_name variant that skips the section-name string-table lookup (coff_name_from_section_header). The full variant now reuses it + adds the name. lnk_raw_directives_from_obj iterated every section of every obj building the full section struct just to test a flag, computing the name on ~all sections though only .drectve needs it -- now uses the no-name variant and resolves the name only inside the LnkInfo branch. - lnk_parsed_symbol_from_coff_symbol_idx (+_no_name): return the coff_parse_* result directly instead of zero-initializing a local and assigning to it, removing a redundant ~48-byte COFF_ParsedSymbol copy + zero-init per call (RVO). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

lnk_leaf_hash_table_search_ti (~EpicGames#2 radlink hotspot, ~48% of its self-time) spent its time in lnk_match_leaf_ref, which is just a_hash==b_hash but fetches the bucket's hash via lnk_hash_from_leaf_ref -> input->debug_h_arr[obj].v[leaf], a scattered cache miss per probe step. Add LNK_LeafHashTable.hash_arr (parallel to bucket_arr), populated at bucket claim/update (both lnk_populate_leaf_ht and lnk_leaf_dedup_task) with the leaf's debug_h hash. search_ti now matches via hash_arr[idx] == hash -- no deref. Exact equivalent (match is pure hash compare); same value across same-hash updates. Gated 65/65 linker torture (ghash_basic/match_debug_t, determ_test, p2r_determinism). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The image buffer was push'd on the shared link arena and only reclaimed in the single-threaded process rundown at exit -- a multi-second kernel page-reclaim tail (observed: one thread 100% in-kernel, zero user frames). Allocate it as a standalone reserve_memory/commit_memory region and release_memory() it the instant the background image-write thread joins (image is on disk, no later reader). VirtualFree(MEM_RELEASE) returns fast; the kernel zeroes the ~1GB on its background thread, overlapping the parallel input-view release + exit instead of blocking rundown. Discard early so the kernel cleans up while the app still runs -- don't defer to exit. Gated 65/65 linker torture (determ_test + p2r_determinism: image correct). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…hotspot) Parse each COFF symbol once in lnk_obj_initer into LNK_Obj.parsed_symbols; lnk_parsed_symbol_from_coff_symbol_idx[_no_name] becomes an array index instead of re-decoding the mmapped symbol table on every access (it was the EpicGames#1 hotspot, lnk_parsed_symbol_from_coff_symbol_idx). All symbol-value patch sites (weak-replace, COMDAT-leader, regular/common fixups) write obj->parsed_symbols[idx], decoupling symbol values from the input mapping. Extracted from the entangled WIP commit 8dc5fa6 (parsed-symbol memo + .rgd staging); this is the memo half only -- no .rgd. Gated separately. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…im ~3GB peak) The CV type-index fixup (05af760) had stored the assigned type index in arrays parallel to the dedup hash table (leaf_ht), whose cap is the TOTAL pre-dedup leaf count summed over all objs. On large links that ti_arr (+ the hash_arr added for deref-free probing) added ~3GB to peak working set vs the prior unique-sized assigned_type_ht. Split the two concerns: - LNK_LeafHashTable: just {cap, bucket_arr} for dedup (total-sized, as before / unavoidable). - LNK_AssignedTiHash {cap, ti_arr, hash_arr}: hash -> assigned ti, sized to the UNIQUE (post-dedup) leaf count. Built in lnk_assign_type_indices_task by hashing each unique leaf into its own slot (atomic claim; unique leaves have distinct hashes since dedup is by hash). search_ti probes it in one deref-free pass (occupant hash stored on the slot), exactly as before -- just on a table sized by unique instead of total. Keeps the single-probe fixup speed of 05af760; removes its peak-memory regression. ti==0 marks empty (assigned ti is always >= CV_MinComplexTypeIndex). Builds clean, 95 linker torture PASS, 0 fail. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

LNK_Obj.parsed_symbols memoized the full COFF_ParsedSymbol per symbol -- including the 16B String8 name -- sized by total symbol count, held to exit. Store a slim LNK_ParsedSymbolLite (every field except name, ~24B vs ~40B) and re-decode the name from the read-only symbol record in the named accessor only. The hot _no_name path (can_replace/GC/resolution) and all symbol-value patching never touch the name, so they stay fully memoized; only the named/push path pays a re-decode (cold relative to total). FN no-rrt full link: peak commit 50.5GB -> 49.1GB (-1.4GB), wall flat (~8-9s no-debug), valid 5.68GB PDB, 95/0 linker torture. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Store the COFF symbol record as a U32 byte-offset into obj->data instead of an 8B pointer, and value as U32 (COFF symbol value is U32). Struct goes 24B -> 16B (no padding): raw_symbol_off(4) value(4) section_number(4) type(2) storage_class(1) aux(1). The named/no_name accessors reconstruct the pointer as obj->data.str + off (obj->data == input->data, so the offset is stable). Sized by total symbol count -> 8B/sym off peak. FN no-rrt full link: peak commit 49.1GB -> 48.4GB (-0.67GB), wall flat, valid 5.68GB PDB, 95/0 linker torture. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

After type merging, prune any merged TPI/IPI record not transitively reachable from a surviving symbol. Roots are the type indices referenced by the symbols that survive /OPT:REF (plus inlinee call-site types); the type graph is then closed over and everything unreached is dropped before the streams are written. Runs only under /OPT:REF -- it is the debug-info analogue of dead-section stripping, and is otherwise transparent (no visible type is removed). Implementation notes: - parallel transitive closure (bulk-synchronous rounds, atomic mark/expand) - fwdref<->definition pairing via a per-unique-name ring so a live forward reference keeps its definition (and vice-versa) - compaction is in place with the remap kept in scratch, so peak memory is unchanged Numbers (UnrealEditorFortnite-Engine.dll, /OPT:REF /OPT:ICF, hashing NONE): PDB 5315 -> 5081 MB (type-GC alone: -234 MB) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…arallelized Fold byte-identical COMDAT sections whose relocations point at equivalent targets, iterated to a fixpoint, then redirect each group's followers at their shared symbol-table node so every reference resolves to one leader and /OPT:REF collects the now-unreferenced follower sections (and their associated .pdata/.xdata/.debug$S). Mirrors link.exe /OPT:ICF. - equivalence: round-0 key from content + reloc structure + non-candidate target identity; refine by candidate targets' colors until the partition is stable; a final byte-compare + per-reloc target-color check guards folds (no hash-collision can produce a bad fold) - folds code AND read-only data (vtables, const tables, string literals); folding identical read-only data lets the functions that reference it fold too (cascade) - fully parallel: candidate collection (count -> exact alloc -> fill), content hashing + reloc-target resolution, refinement, and final grouping via a parallel LSD radix sort (8-bit digits). A flat open-addressing map with an avalanche-scrambled key avoids O(n^2) probing on UE-scale inputs. Only externally-defined COMDATs are folded (the follower redirects through its symbol). Static/internal-linkage folding is intentionally out of scope here. Numbers (UnrealEditorFortnite-Engine.dll, vs /OPT:NOICF, hashing NONE): .text 727 -> 643 MiB (-84) .rdata 218 -> 194 MiB (-24) PDB 5562 -> 5081 MB (-481) DLL 999 -> 882 MB (-117) link 33 -> 20 s (-13; less to relocate and emit downstream) This commit also adds the shared parallel radix-sort helper (lnk_radix_sort_u64_pairs) used here and by the PDB GSI/PSI sort. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ize GSI/PSI sort DBI section contributions: - after sorting, merge contiguous contributions that share (section, module, flags), absorbing the alignment-padding gaps between them. On UE-scale input this collapses ~12.5M contribution records to ~2.0M and shrinks the DBI stream 367 -> 72 MB, with no change to the address map. GSI/PSI publics sort: - the comparators got element-stable tiebreakers (sort by record offset / dereferenced symbol identity, not by slot pointer) so the median-of-9 quicksort cannot degrade to O(n^2) on the large runs of equal-address / equal-name records that ICF now produces. - gsi_record_sort_by_sc returns a radix-sorted permutation index (via the shared lnk_radix_sort_u64_pairs) and the PSI address map is built from it, replacing a comparator sort that stalled multiple seconds on ICF-heavy links. Numbers (UnrealEditorFortnite-Engine.dll): DBI stream 367 -> 72 MB removes a multi-second GSI/PSI sort stall on ICF-folded inputs Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The closure re-scanned every merged type each round (O(rounds * total types)) to find marked-but-unexpanded leaves. In a full-link trace of UnrealEditorFortnite that round-rescan dominated the type-GC: lnk_gc_expand_task ~13.3 s of CPU. Replace it with a frontier worklist: the atomic mark now gates a single append per leaf, and each round expands only the slice newly marked by the previous round, so total work is O(reachable types) instead of O(rounds * total types). Drops the per-round `expanded` bitmap and full-array sweeps. Output is unchanged -- same reachable set, PDB byte-identical (5081 MB on the same input), and debugger-fidelity checks (addr->symbol 100%, core types resolve) match the pre-change build. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…-parse lnk_opt_icf re-parsed every candidate section serially (lnk_coff_relocs_from_ section_header per candidate) just to size the flattened reloc-target arrays. Move the per-candidate reloc count into lnk_icf_fill_task -- which is already parallel and has the section in hand -- so lnk_opt_icf only does a cheap serial prefix sum for reloc_first. No output change (PDB/.text/.pdata identical). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The frontier mark did an interlocked op on every reference edge. Add a plain non-atomic check first (the mark bit only ever goes 0->1, so a stale "already set" read is safe), so the interlocked op runs once per leaf at its 0->1 transition instead of once per edge. Output identical (PDB 5081 MB). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Type-GC prunes CodeView types not referenced by any surviving symbol. That is a PDB-size win, but it removes types a debugger can still legitimately cast to in the watch window (the reachable-from-symbols set is a subset of the castable-type set) -- which is why the same approach was reverted before after users reported losing the ability to cast in the watch. So gate it behind /OPT:GCTYPES, default OFF; only opt in when the smaller PDB is worth the reduced castable-type set. Numbers (UnrealEditorFortnite-Engine.dll): default PDB 5315 MB; with /OPT:GCTYPES 5081 MB (-234). LINK ok, no RelocationAgainstRemovedSection either way. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

lnk_sort_contribs_task ran one serial radsort per chunk. A section's contribs live in a single chunk sized to the whole section, so the merged .text chunk (millions of entries) was sorted serially on one worker while every other thread idled -- the straggler that stretched the "Sort Section Contribs" phase. Sort chunks >= 64K entries with the parallel radix sort (key = Compose64Bit(obj_idx, obj_sect_idx), which is unique per contrib so the order matches the comparator) using all threads, before the per-chunk task pass handles the small remainder. Output is unchanged: section sizes identical, byte diff within the pre-existing relink noise. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

ICF refinement re-densified all ~N candidates every round, radix-sorting the full set each iteration to a fixpoint. But a candidate alone in its equivalence class can never split or merge again -- its color is final. Track an active set of only the candidates still sharing a class with another, and re-densify just that set each round (ids drawn from an ever-increasing base so they never collide with the colors already finalized for singletons). The per-round sort shrinks from all candidates to those that still have a content+reloc twin, and converged classes drop out as they fragment into singletons. Output is unchanged -- relinking UnrealEditorFortnite-Engine.dll is byte-identical to the prior ICF (same folds, same size) and reproducible across runs. On that link the refinement loop drops from ~888ms to ~765ms (first round prunes ~2.1M of 3.98M candidates to singletons immediately). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ved symbols lnk_search_lib_task scans the entire search-chunk symbol set once per library (2733 dispatches on the UE editor link). A symbol that started Undefined/Weak stays in search_chunks even after a definition resolves it, so every later library pass re-parsed it -- lnk_ref_from_symbol + lnk_parsed_symbol_from_coff _symbol_idx -- just to recompute its interp and skip it. That parse faults the COFF symbol record out of the mmap'd obj, and the profile showed those two lines at ~65% of the task and a matching wall of page-fault kernel time. The interp is already computed once in lnk_symbol_table_push_; cache it on LNK_Symbol and read it in the hot loop. Only genuinely Weak symbols still parse (for the weak-extension characteristics). The hash-trie node always points at the current leader symbol, so the cached interp reflects the resolved state. Output unchanged (byte-identical DLL+PDB, reproducible). Lib-search dispatch wall drops ~18% (~1.5s -> ~1.2s on the UE editor link) with a larger drop in aggregate CPU and page-fault traffic. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

lnk_link_inputs resolves libraries to a fixpoint: an outer pass loops over every library, and each library is re-searched (a full tp_for_parallel over all workers, scanning every undefined/weak symbol in search_chunks) once per drained input batch until nothing new resolves. On the UE editor link that is ~2733 dispatches, each waking and joining ~60 workers -- and the phase is barrier-bound, so that wake/join is the cost, not the scan. Most re-searches are redundant: search_chunks only grows during the loop (symbols are never removed until the end) and member-queue dedup is idempotent, so a re-search can only queue new members if the undefined/weak symbol set grew or anti-dep searching was just enabled since this library was last searched. Stamp each LNK_Lib with the search_chunks symbol count + anti-dep mode at its last search and skip the dispatch when neither changed. ~24% fewer dispatches (2733 -> ~2089) and ~0.2s of wake/join wall-time removed. Output is byte-identical and reproducible (which dispatch coalesces is timing dependent, but a skipped one provably queued nothing, so the result is unchanged -- verified relink-twice byte-identical across many runs). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

External ICF folds a follower by redirecting its defining symbol to the leader (Fnode->symbol = Lnode->symbol); /OPT:REF then finds the follower unreferenced and dead-strips it along with its associated .pdata/.xdata. Static COMDATs have no external symbol -- they're reached only by section-relative relocs to a static symbol that resolves to themselves -- so that path can't fold them, leaving a large .text gap vs MSVC. Add an opt-in static fold. lnk_icf_section_kind now returns candidates for static COMDATs too; they join the same content+reloc-equivalence classes. Leader selection prefers a non-static member so an external is never folded into a static leader. A static follower records a per-section icf_fold map (LNK_Obj.icf_fold: follower section -> leader obj/section) instead of a symbol redirect. The /OPT:REF mark-live walk consults that map: when a reference (or an associative-section walk) reaches a folded static follower, it marks the LEADER section live cross-obj and enqueues the leader's relocs/associated sections instead -- so the follower dead-strips, taking its .pdata/.xdata with it. Crucially the redirect is applied in the associated-section walk too: folded static .text are often associative COMDATs (EH funclets/thunks) pulled in via associated_sections[], and keeping those followers live was the prior attempt's bug (~250K malformed .pdata + nondeterminism). lnk_set_icf_static_leader_contribs_task then redirects folded followers' sect_map entries to the leader's contrib so any residual reloc resolves to the identical leader. Gated behind /OPT:ICFSTATIC (default off): runtime-unvalidated, like /OPT:GCTYPES. Default ICF output is byte-identical to before. On UnrealEditorFortnite-Engine.dll with /OPT:ICFSTATIC: DLL 925 -> 844 MB (.text 643.1 -> 594.3 MiB, .rdata 194.9 -> 169.2 MiB), output reproducible (relink byte-identical), .pdata clean (1,740,509 records, 0 malformed, 0 out-of-order), link exits 0. No runtime validation performed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Resolve MSVC C++ header-unit IFC debug records (LF_IFC_RECORD 0x1522) by merging the .ifc .msvc.trait.debug-records CodeView stream and redirecting each record to its real type -> fixes VS debugger AV stepping into header- unit code (BAD 921->0, 44/44 MSVC name-match, live debug verified). ICF: key non-candidate Regular COMDAT reloc targets by their resolved COMDAT leader instead of per-obj (input_idx,section) -> folds identical funcs referencing per-obj-dup COMDATs (.text 623->536 MiB, -82.56 MiB, deterministic, .pdata valid). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…vor-emit), dense-color SoA+U32, reloc-weight load-balance, parallel fold-verify + IFC scan optG: ~35s->22-23s warm on monolithic UnrealEditorFortnite-Engine link, all determinism-verified (PDB byte-identical to fixes bar /BREPRO GUID). Stacks on header-unit IFC + ICF-keying fixes (69ac14f). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…llel prefix-sum) The per-round serial group-scan in lnk_icf_dense_colors_active (walk sorted keys, assign a dense color id per distinct-key group, mark size>=2 survivors, exclusive-prefix their emit slots) was the remaining refine-window sawtooth bottleneck (inlined into lnk_link_image self). Replace it with a 3-phase parallel prefix sum: mark (parallel): per sorted position compute boundary (run start) and keep (run size>=2 survivor) bits -- each reads only sk[k-1..k+1] so it is chunk-local -- plus per-chunk local boundary/keep/ surviving-class counts. prefix (serial): tiny exclusive prefix over the worker_count chunk totals (NOT over n) -> each chunk's exclusive class-id / emit-slot base; sum surviving classes. apply (parallel): each chunk re-derives color_at[]/out_slot[] from its base. Determinism: the per-position values are a pure prefix of independent per-position bits, byte-identical to the old serial running counter regardless of how tp_divide_work splits the chunks; next_active survivors are emitted in ascending sorted-color order via the prefix slots. The existing parallel color scatter + survivor emit are unchanged. Verified: linked DLL byte-identical to the optG canonical (modulo the /BREPRO GUID block), self-deterministic across relinks and reproducible across recompiles; PDB dia_types BAD=0. A gated ICF_SCAN_SELFCHECK build asserts the parallel scan matches the serial scan byte-for-byte. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… region Collapse the ICF color-refine round loop (~18 rounds x ~10 tp_for_parallel phases = ~126 fork-join cycles) into a SINGLE tp_for_parallel region of worker_count participants (lnk_icf_refine_region_task). Every former phase boundary becomes a barrier_wait(tp->barrier) inside the region; the tiny serial glue (radix per-pass 256xW prefix, group-scan W-chunk prefix, convergence + buffer swap) runs on worker 0 behind a barrier while the others wait. This kills the thread-pool wake->work->sleep sawtooth that dominated lnk_link_image self time without changing any task body. Order is preserved exactly: ranges are rebuilt each round into preallocated buffers (in-place lnk_icf_divide_by_reloc / tp_divide_work), all phase math is byte-identical to the per-round path (refine, gather, LSD radix, scan mark/apply, color scatter, survivor emit). Per-round scratch is preallocated once to cand_count and reused; the radix double-buffer / pass-count / pointer swap are driven by worker 0 across barriers. 1:1 worker<->task is guaranteed because every participant blocks on the first barrier before any can steal a second task. Determinism: relinked UnrealEditorFortnite-Engine.dll is byte-identical to a freshly-built e32b662 (group-scan) DLL except the /BREPRO GUID block and the known pre-existing offset-261 export-dir Size field. ICF fold counts unchanged (folded 10082697 of 19045185 into 5661141 classes); dia_types BAD=0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…er) atop persistent region lnk_icf_divide_by_reloc_into: replace two O(n) random gathers (cands[active[i]].reloc_count, cache-miss-bound, x~18 rounds = ~5.6s serial, found via no-inline trace) with O(worker_count) count-based split. Work-split only -> output byte-identical (verified: combo == canonical, 0 diffs). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…re-sort, ~3-4s) The persistent ICF refine region re-sorts the entire ~10.65M-candidate active set every round until a round splits nothing. Measured churn (live UE Engine link): the partition is ~99.99% stable by round ~7; the last ~11 rounds each still pay the full ~10.65M-key sort just to resolve a few hundred cumulative splits. That tail is ~3-4s of pure COMPUTE waste. Run the persistent region for a bounded warm-up (8 rounds, where churn is large and the full parallel sort wins), then hand the still-active set to a dirty-class worklist (Hopcroft "process only what can change"): - Per-color member slab; the work unit is a CLASS. - Each round re-keys ONLY members of dirty classes against a FROZEN colors[] snapshot (Jacobi commit at the round barrier -- reading mid-round colors would over-refine past the unique coarsest stable partition and may never terminate; this is the trap that hung the prior arm). - A class can only split if its own color or a reloc target's color changed; after a split, enqueue the split class's referrers via a CSR reverse target->referrers index. Dirty set empty => fixpoint. - Per-class re-key uses a serial pool-free sort (classes are tiny; the prior arm re-entered the thread pool thousands of times per round -> hang). Both paths compute the unique coarsest stable partition, so the final colors[] partition is identical; ids are renamed but every fold consumer (group-by-color, identity-keyed leader election, colors[a]==colors[b] verify) depends only on the equivalence relation. Output DLL byte-identical to canonical (bar /BREPRO GUID). Hang-guard: hard 40-round cap + non-progress (dirty set not shrinking) detector; on trip, AssertAlways -> never ships a spinning binary. Gated -DICF_WORKLIST_SELFCHECK=1 build runs BOTH the worklist and the uncapped region each link and AssertAlways the partitions are identical (verified clean across the whole UE link). Verified on UE Editor Fortnite Engine.dll: - worklist tail: 10 rounds (next_dirty 14->0), handoff active=10654290 - fold count EXACT: folded 10082697 of 19045185 into 5661141 classes - self-cmp GUID-only (17 bytes); vs canonical a597eb6 GUID-only (26 bytes, all in the debug-dir/GUID window, zero diffs elsewhere) - dia_types: UDTs=2559825 children=39928767 BAD=0 - ICF_WORKLIST_SELFCHECK: partition identical to full region every link Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Stacks worklist refine (1692a9d) + parallel lnk_apply_ifc_debug_records discovery scan (766deb7e). Byte-identical to canonical; BAD=0; 0x1522->0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The worklist's CSR reverse-index (rev_adj = U32 x candidate-edge-count) costs ~10GB peak on the UE link -- net-negative on the page-fault-bound critical path vs the ~3-4s of tail re-sorts it saves. region_cap=64 -> region converges (~19 rounds), worklist handoff skipped, no reverse-index. Re-enable by lowering region_cap. Size wins intact (.text 536MB), sound, links clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ng insert lnk_opt_icf built the (input_idx,sn)->cand_idx+1 lookup serially via lnk_icf_map_put; Superluminal showed ~642ms there, cache-miss-bound on the scrambled-slot keys[] probe. Parallelize the insert loop over the thread pool. Keys are UNIQUE (1:1 map) and the map is read only by key afterward (lnk_icf_map_get, post-barrier), so insert ORDER is output-neutral. New lnk_icf_map_put_atomic claims each empty slot with a CAS EMPTY->key (ins_atomic_u64_eval_cond_assign); only the CAS winner writes vals[slot]. Load factor <=0.5 (lnk_icf_map_make oversizes cap>=capacity*2). Output byte-identical to canonical bar /BREPRO GUID (35 bytes, 2 RSDS records). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> (cherry picked from commit ff632aab8f81b3de37eb2a0a77c10845ce019e4c)

…arse

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…path cand_map keys fill: LNK_ICF_EMPTY is all-0xFF -> single MemorySet vs scalar per-U64 store loop (256MB serial first-touch on the monolithic link). image_fill_task: direct copy fast-path for single-data-node contribs (the vast majority), skipping the list-walk + cursor on the hot 739MB image write. Byte-identical (27B: chksum+GUID). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…merge dedup probes) leaf_ht + assigned_ti caps rounded via u64_up_to_pow2 so the bucket index is hash & (cap-1) instead of hash % cap -- removes a 64-bit DIV from the densest type-dedup/fixup probe loops (lnk_leaf_hash_table_search_ti, lnk_leaf_dedup_task, lnk_hash_debug_t_task, assigned-ti pass). Byte-identical (back-to-back A/B: 29B chksum+GUID, control 17B; the load factor stays <=~0.65). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… already-searched symbols) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

After merge-types reaches the per-thread scratch high-water (~9GB of tctx scratch arenas stay committed but idle through the PDB peak), release the committed-but-unused scratch pages back to the OS before the PDB build re-grows them. Drops recorded peak working set on the monolithic UnrealEditorFortnite-Engine.dll link. - arena_decommit_unused(): decommit committed pages strictly above each block's live pos in the active chain, and the unused bodies of free-list blocks (keeping the header page). Reservation kept; push path re-commits on demand, so reuse is transparent and output byte-identical. - tctx_scratch_decommit(): decommit the calling thread's two equipped scratch arenas. - lnk_scratch_decommit_worker + tp_for_parallel(worker_count) with an in-task barrier: every worker (worker 0 IS the main thread) decommits its own scratch exactly once between lnk_merge_types and lnk_build_pdb. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ostic, byte-neutral) When RADLINK_PHASE_LOG is set, lnk_log_timers writes machine-parseable raw per-phase microseconds (Image/PDB/RDI/Lib/Debug + TOTAL) to that path, for automated perf A/B. Env-unset -> identical code path, DLL/PDB byte-identical. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

honkstar1 force-pushed the radlink-pr-series branch from 8b67cdb to d70c7b1 Compare June 22, 2026 05:20

honkstar1 mentioned this pull request Jun 24, 2026

radlink: cross-process shared thread pool — dual-path (default=upstream barrier, opt-in governor) #847

Open

honkstar1 and others added 28 commits June 24, 2026 21:56

honkstar1 and others added 12 commits June 24, 2026 22:10

radlink: consolidate ICF worklist + parallel IFC 0x1522 scan

c234f3d

Stacks worklist refine (1692a9d) + parallel lnk_apply_ifc_debug_records discovery scan (766deb7e). Byte-identical to canonical; BAD=0; 0x1522->0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

radlink ICF: parallelize apply_ifc nonblob_complete set merge + ifc p…

3cf511c

…arse

radlink perf: parallelize make_code_view_input serial setup loops

1ba8b8f

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

radlink perf: per-lib frontier cursor in lib search (skip re-scanning…

580ad44

… already-searched symbols) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

radlink perf: free leaf bucket_arr probe tables before merge-types peak

20eb8a1

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

honkstar1 force-pushed the radlink-pr-series branch from d70c7b1 to bd71560 Compare June 25, 2026 06:01

honkstar1 mentioned this pull request Jun 25, 2026

radlink: reproducibility, link-time, memory, and output-size pass #839

Closed

honkstar1 changed the title ~~radlink: link-time perf, peak-memory, /OPT:ICF[STATIC], /OPT:GCTYPES, C++ header-units~~ radlink: perf + feature series — ICF/ICFSTATIC, GCTYPES, header-units, link-time + memory (rebased on dev, taken commits dropped) Jun 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

radlink: perf + feature series — ICF/ICFSTATIC, GCTYPES, header-units, link-time + memory (rebased on dev, taken commits dropped)#842

radlink: perf + feature series — ICF/ICFSTATIC, GCTYPES, header-units, link-time + memory (rebased on dev, taken commits dropped)#842
honkstar1 wants to merge 40 commits into
EpicGames:devfrom
honkstar1:radlink-pr-series

honkstar1 commented Jun 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

honkstar1 commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this is

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant

honkstar1 commented Jun 22, 2026 •

edited

Loading