radlink: reproducibility, link-time, memory, and output-size pass by honkstar1 · Pull Request #839 · EpicGames/raddebugger

honkstar1 · 2026-06-19T16:33:55Z

Summary

A pass over radlink covering, roughly in order: build reproducibility, link time, peak memory, and output size (.text / .rdata / PDB). Everything is measured on UnrealEditorFortnite-Engine.dll (a large UE editor module) against a fixed set of object inputs, comparing to MSVC link.exe where relevant.

Every change keeps the linked output byte-identical unless its whole point is to change the bytes (the size and determinism commits) — verified by relinking and cmp-ing the DLL+PDB. Each commit message carries its own rationale and numbers; this is a per-area overview.

Reproducibility

Import-table determinism. Two data races in the parallel library search made radlink output non-reproducible: relinking the same inputs differed by ~2.78M bytes. Both were in the generated DLL import objs — a misindexed dedup flag (lnk_queue_lib_member used the wrong lib's member_idx) that made jump-thunk emission race, and link->imports being appended in nondeterministic discovery-round order. After the fix, relinking is byte-identical except the intentional PE timestamp + PDB GUID; with /BREPRO (timestamp 0) plus the default content-hash GUID, the DLL and PDB are bit-identical across runs.

This one commit is also up as #840 on its own, for independent review/merge ahead of the rest of this branch.

Link time

Throughput work across the hot phases, each byte-identical:

Memoize parsed COFF symbols per obj — kills the Crash with access violation on null pointer #1 link hotspot (repeated symbol re-parse).
Cache symbol interp on the symbol — the library search re-parsed (and page-faulted) every resolved symbol on every lib pass just to recompute its interp; now a cached field read. Lib-search CPU ~65s → ~18s aggregate.
Skip redundant library re-searches — the resolution fixpoint re-dispatched a full parallel search per lib even when nothing new resolved (~2733 dispatches → ~2089), removing wasted worker wake/join.
Parallelize the section-contrib sort — the merged .text chunk (millions of entries) was radix-sorted on one worker; split big chunks across all threads. Sort phase ~300ms → ~145ms.
ICF refine: skip converged classes — re-densify only the still-ambiguous candidate set each round instead of all ~4M.
Plus: O(1) symbol ref-list merge (tail pointer), batched thread-pool wake, C11 atomics so BLAKE3 skips the locked CPU-feature probe, single-probe CV type-index fixup, leaner symbol/section parse, count ICF reloc slices in the parallel fill rather than a serial re-parse.

Peak memory

Size the assigned-ti table by unique types, not total — reclaims ~3GB peak on this link.
Release the ~1GB image buffer early so reclaim overlaps the run; release copy-on-write input views in parallel before exit.
Slim + pack the parsed-symbol memo (LNK_ParsedSymbolLite to 16B, decode name on demand).

Output size

All on UnrealEditorFortnite-Engine.dll, /OPT:REF /OPT:ICF, type-name hashing NONE.

/OPT:ICF identical COMDAT folding (code + read-only data), parallelized. Fold byte-identical COMDATs whose relocations point at equivalent targets (iterated to a fixpoint), redirecting followers through the existing COMDAT symlink so /OPT:REF collects them. Folding identical read-only data lets the functions referencing it fold too (cascade).

vs `/OPT:NOICF`	before	after	Δ
`.text`	727 MiB	643 MiB	−84
`.rdata`	218 MiB	194 MiB	−24
PDB	5562 MB	5081 MB	−481
DLL	999 MB	882 MB	−117

Garbage-collect unreferenced CodeView types, gated behind /OPT:GCTYPES (default off). After type merging, drop any TPI/IPI record not transitively reachable from a surviving symbol (frontier-worklist closure). Default-off because the rad owner noted GC'd types can break casting in the watch window; opt-in keeps that out of the default path.

(with `/OPT:GCTYPES`)	before	after	Δ
PDB	5315 MB	5081 MB	−234

Coalesce DBI section contributions; stabilize + parallelize GSI/PSI sort. Merge contiguous same-(section, module, flags) contributions; give the GSI/PSI comparators element-stable tiebreakers so the sort can't go quadratic on the equal-address/equal-name runs ICF produces.

	before	after
DBI stream	367 MB	72 MB
contribution records	~12.5M	~2.0M

Combined size result

`UnrealEditorFortnite-Engine.dll`	radlink (this branch)	MSVC `link.exe`
PDB	5081 MB	5421 MB
DLL	882 MB	724 MB
`.text`	643 MiB	524 MiB

PDB is smaller than MSVC's. The DLL is still ~158 MB larger, almost entirely .text: MSVC folds more because it also folds static / internal-linkage COMDATs — deliberately out of scope here (see below).

Also on this branch

Accept GNU ar (.a) archives as lib input — a standalone feature, also up as radlink: accept GNU ar (.a) archives as lib input #836.

Validation

No runtime test harness was available for the editor, so changes were validated structurally + by reproducibility:

relink-twice cmp (byte-identical DLL+PDB) gates every non-size change and proves the determinism fix;
link exits 0, no RelocationAgainstRemovedSection or similar;
.pdata scanned for malformed RUNTIME_FUNCTION records (begin/end RVAs in range) — 0 bad on every output;
output section sizes + PDB validity checked; module/section inputs confirmed identical to the MSVC reference, so size deltas are folding, not codegen;
debugger fidelity (names/types/lines) spot-checked via DIA/dbghelp against the MSVC PDB.

Out of scope / follow-ups

Static-linkage ICF (/OPT:ICFSTATIC): would close most of the remaining .text gap, but folding a static follower leaves its associated .pdata/.xdata with begin/end RVAs that don't survive the redirect, producing malformed unwind records. Left out until that's fixed and runtime-validated.
The link is increasingly page-fault bound (streaming the multi-GB mmap'd input working set, with kernel working-set-lock contention across workers). Read-only input mapping and bulk prefetch were both tried and measured as neutral-to-negative; cutting it further needs to touch less data (e.g. lazier CodeView parsing), not a flag.

🤖 Generated with Claude Code

get_cpu_features was the top main-thread hot spot (~5.6s for one Fortnite link), 97% of it inside a single ATOMIC_LOAD(g_cpu_features). On MSVC, blake3_dispatch.c defines ATOMIC_LOAD as _InterlockedOr(&x,0) -- a lock'd RMW (full barrier) run on every BLAKE3 compress dispatch. The value is written once and read-only after, so the barrier is pointless. Enable BLAKE3's plain-load path (C11 _Atomic, a plain mov on x86) via build flags only, leaving the vendored third_party/blake3 source untouched: /std:c11 /experimental:c11atomics -DBLAKE3_ATOMICS=1 Scoped to the radlink target. MSVC C11 atomics need both /std:c11 and /experimental:c11atomics. get_cpu_features: 5591ms -> 4ms (main thread). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

coff_read_symbol_name scans a cstr in the memory-mapped string table -- the dominant, page-fault-bound cost of bulk symbol parsing. Many hot callers parse a full symbol but only read scalar fields (value/section/storage_class/aux) to interpret the symbol value; the name is never used. Add name-skipping parse variants and route the interp-only paths through them: coff_parse_symbol{16,32}_no_name (coff_parse.c) -- and the full variants now call these + add the name, so the scalar logic lives in one place lnk_parsed_symbol_from_coff_symbol_idx_no_name (lnk_obj.c) lnk_interp_from_symbol / lnk_can_replace_symbol / lnk_on_symbol_replace (lnk_symbol_table.c) and the lnk_search_lib_task loop (lnk.c) Where the name is still needed (lnk_search_lib) it uses the already-cached LNK_Symbol.name instead of re-parsing. lnk_can_replace_symbol previously parsed dst/src twice (full parse + a second parse for interp); collapsed to one no-name parse each. coff_parse_symbol32 on the main thread: 3922ms -> ~810ms (name-needed callers). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

lnk_on_symbol_replace merged ref lists by walking the destination's singly linked refs list to its tail on every merge. Across repeated COMDAT merges into one accumulating leader this is O(n^2) and was 96% of the function. Add a refs_tail pointer to LNK_Symbol so the append is O(1): src->refs_tail->next = dst->refs; src->refs_tail = dst->refs_tail; maintained at all ref-list write sites (lnk_make_symbol, the null_symbol and import-stub sites in lnk.c). Order and head identity are preserved exactly, so this is a pure perf change: the head node stays the primary ref, and interior order is irrelevant (every multi-ref consumer sorts). lnk_on_symbol_replace (main thread): 1306ms -> 161ms exclusive. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

tp_for_parallel woke workers with a loop of single semaphore_drop calls -- one ReleaseSemaphore syscall per worker (twice in shared mode). The main thread spent ~3.3s in ReleaseSemaphore over a Fortnite link. Add semaphore_drop_n(sem, count) (a single ReleaseSemaphore(h, count, 0) on Windows; a loop on POSIX) and wake all drop_count workers in one call. Two details keep the batched release correct: - Wake the full drop_count (NOT drop_count-1). The main thread runs as worker 0, but tasks that nest a tp_broadcast_ barrier span all workers; under-waking by one leaves that barrier a participant short and deadlocks on small dispatches. - Give the exec/task semaphores 2x max-count headroom. A single batched release can land while up to worker_count-1 previously-woken workers have not yet re-taken their permit, so the count can transiently approach 2*worker_count; a tight max would make ReleaseSemaphore fail outright and deadlock at the next barrier. ReleaseSemaphore (main thread): 3274ms -> 940ms.

The capped cstr length scan was a byte-by-byte loop. Switch to memchr, which is SIMD-accelerated in the CRT. Speeds up every capped-cstr scan in the codebase; notably cv_name_from_symbol (CodeView symbol-name scan during GSI build). cv_name_from_symbol (main thread): 1098ms -> 201ms. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

lnk_fixup_cv_type_indices did two open-addressing probes per type-index reference: lnk_leaf_hash_table_search (leaf_ref -> canonical bucket), then lnk_assigned_type_ht_search (canonical bucket -> assigned type index, via a second hash table keyed by leaf-ref content). Both are cache-miss-bound and this ran across every type-index reference in every obj. Store the assigned type index directly on the leaf hash table: add a ti_arr parallel to bucket_arr. lnk_assign_type_indices_task writes ti = min+i into the leaf's bucket slot (each unique leaf owns a distinct slot, so worker writes never collide), and the new lnk_leaf_hash_table_search_ti recovers it in one probe. Removes the entire assigned_type_hts table and its build pass; deletes the now-dead lnk_leaf_hash_table_search and lnk_assigned_type_ht_search. Correctness: deduplicated leaves share the same ghash (debug_h value), so the fixup query and the assign-time canonical bucket hash to the same slot. lnk_fixup_cv_type_indices (main thread): 1445ms -> 390ms inclusive. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Input obj/lib files are mapped copy-on-write (PAGE_WRITECOPY/FILE_MAP_COPY) so the linker can patch them in place. Pages touched during linking become private-dirty; at process exit the kernel reclaims them in single-threaded address-space rundown -- ~3s of lingering process time after the last thread exits for a large (Fortnite-scale) link. After all outputs are written and inputs are no longer read (post image-write join), unmap the whole-file CoW views in parallel on the thread pool. The same reclaim work then runs multi-threaded, off the serial post-exit path: measured ~34s of aggregate UnmapViewOfFile CPU collapsing to ~0.55s wall, and the post-exit process tail dropping from ~3s to ~0.5s. Only the is_thin whole-file views are swept (lib-member substrings and linkgen arena data are skipped), and only in the copy-on-write (read-only) mapping mode -- read-write-shared mapping would flush dirty pages back to the input files on unmap. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Two micro-optimizations on hot parse helpers (profiled as the largest aggregate-CPU functions in a Fortnite link): - lnk_obj_section_from_sect_idx: split out a _no_name variant that skips the section-name string-table lookup (coff_name_from_section_header). The full variant now reuses it + adds the name. lnk_raw_directives_from_obj iterated every section of every obj building the full section struct just to test a flag, computing the name on ~all sections though only .drectve needs it -- now uses the no-name variant and resolves the name only inside the LnkInfo branch. - lnk_parsed_symbol_from_coff_symbol_idx (+_no_name): return the coff_parse_* result directly instead of zero-initializing a local and assigning to it, removing a redundant ~48-byte COFF_ParsedSymbol copy + zero-init per call (RVO). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

lnk_leaf_hash_table_search_ti (~EpicGames#2 radlink hotspot, ~48% of its self-time) spent its time in lnk_match_leaf_ref, which is just a_hash==b_hash but fetches the bucket's hash via lnk_hash_from_leaf_ref -> input->debug_h_arr[obj].v[leaf], a scattered cache miss per probe step. Add LNK_LeafHashTable.hash_arr (parallel to bucket_arr), populated at bucket claim/update (both lnk_populate_leaf_ht and lnk_leaf_dedup_task) with the leaf's debug_h hash. search_ti now matches via hash_arr[idx] == hash -- no deref. Exact equivalent (match is pure hash compare); same value across same-hash updates. Gated 65/65 linker torture (ghash_basic/match_debug_t, determ_test, p2r_determinism). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The image buffer was push'd on the shared link arena and only reclaimed in the single-threaded process rundown at exit -- a multi-second kernel page-reclaim tail (observed: one thread 100% in-kernel, zero user frames). Allocate it as a standalone reserve_memory/commit_memory region and release_memory() it the instant the background image-write thread joins (image is on disk, no later reader). VirtualFree(MEM_RELEASE) returns fast; the kernel zeroes the ~1GB on its background thread, overlapping the parallel input-view release + exit instead of blocking rundown. Discard early so the kernel cleans up while the app still runs -- don't defer to exit. Gated 65/65 linker torture (determ_test + p2r_determinism: image correct). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…hotspot) Parse each COFF symbol once in lnk_obj_initer into LNK_Obj.parsed_symbols; lnk_parsed_symbol_from_coff_symbol_idx[_no_name] becomes an array index instead of re-decoding the mmapped symbol table on every access (it was the EpicGames#1 hotspot, lnk_parsed_symbol_from_coff_symbol_idx). All symbol-value patch sites (weak-replace, COMDAT-leader, regular/common fixups) write obj->parsed_symbols[idx], decoupling symbol values from the input mapping. Extracted from the entangled WIP commit 8dc5fa6 (parsed-symbol memo + .rgd staging); this is the memo half only -- no .rgd. Gated separately. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…im ~3GB peak) The CV type-index fixup (05af760) had stored the assigned type index in arrays parallel to the dedup hash table (leaf_ht), whose cap is the TOTAL pre-dedup leaf count summed over all objs. On large links that ti_arr (+ the hash_arr added for deref-free probing) added ~3GB to peak working set vs the prior unique-sized assigned_type_ht. Split the two concerns: - LNK_LeafHashTable: just {cap, bucket_arr} for dedup (total-sized, as before / unavoidable). - LNK_AssignedTiHash {cap, ti_arr, hash_arr}: hash -> assigned ti, sized to the UNIQUE (post-dedup) leaf count. Built in lnk_assign_type_indices_task by hashing each unique leaf into its own slot (atomic claim; unique leaves have distinct hashes since dedup is by hash). search_ti probes it in one deref-free pass (occupant hash stored on the slot), exactly as before -- just on a table sized by unique instead of total. Keeps the single-probe fixup speed of 05af760; removes its peak-memory regression. ti==0 marks empty (assigned ti is always >= CV_MinComplexTypeIndex). Builds clean, 95 linker torture PASS, 0 fail. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

g_input_type_map mapped o/obj/lib/rlib/res/rrt but not .a, so clang/meson-built GNU ar archives (e.g. ThirdParty libdav1d.a) hit Error(002) 'unknown file format'. rlib (also GNU ar) already routes to LNK_Input_Lib and parses, so map .a the same. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

LNK_Obj.parsed_symbols memoized the full COFF_ParsedSymbol per symbol -- including the 16B String8 name -- sized by total symbol count, held to exit. Store a slim LNK_ParsedSymbolLite (every field except name, ~24B vs ~40B) and re-decode the name from the read-only symbol record in the named accessor only. The hot _no_name path (can_replace/GC/resolution) and all symbol-value patching never touch the name, so they stay fully memoized; only the named/push path pays a re-decode (cold relative to total). FN no-rrt full link: peak commit 50.5GB -> 49.1GB (-1.4GB), wall flat (~8-9s no-debug), valid 5.68GB PDB, 95/0 linker torture. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Store the COFF symbol record as a U32 byte-offset into obj->data instead of an 8B pointer, and value as U32 (COFF symbol value is U32). Struct goes 24B -> 16B (no padding): raw_symbol_off(4) value(4) section_number(4) type(2) storage_class(1) aux(1). The named/no_name accessors reconstruct the pointer as obj->data.str + off (obj->data == input->data, so the offset is stable). Sized by total symbol count -> 8B/sym off peak. FN no-rrt full link: peak commit 49.1GB -> 48.4GB (-0.67GB), wall flat, valid 5.68GB PDB, 95/0 linker torture. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

After type merging, prune any merged TPI/IPI record not transitively reachable from a surviving symbol. Roots are the type indices referenced by the symbols that survive /OPT:REF (plus inlinee call-site types); the type graph is then closed over and everything unreached is dropped before the streams are written. Runs only under /OPT:REF -- it is the debug-info analogue of dead-section stripping, and is otherwise transparent (no visible type is removed). Implementation notes: - parallel transitive closure (bulk-synchronous rounds, atomic mark/expand) - fwdref<->definition pairing via a per-unique-name ring so a live forward reference keeps its definition (and vice-versa) - compaction is in place with the remap kept in scratch, so peak memory is unchanged Numbers (UnrealEditorFortnite-Engine.dll, /OPT:REF /OPT:ICF, hashing NONE): PDB 5315 -> 5081 MB (type-GC alone: -234 MB) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…arallelized Fold byte-identical COMDAT sections whose relocations point at equivalent targets, iterated to a fixpoint, then redirect each group's followers at their shared symbol-table node so every reference resolves to one leader and /OPT:REF collects the now-unreferenced follower sections (and their associated .pdata/.xdata/.debug$S). Mirrors link.exe /OPT:ICF. - equivalence: round-0 key from content + reloc structure + non-candidate target identity; refine by candidate targets' colors until the partition is stable; a final byte-compare + per-reloc target-color check guards folds (no hash-collision can produce a bad fold) - folds code AND read-only data (vtables, const tables, string literals); folding identical read-only data lets the functions that reference it fold too (cascade) - fully parallel: candidate collection (count -> exact alloc -> fill), content hashing + reloc-target resolution, refinement, and final grouping via a parallel LSD radix sort (8-bit digits). A flat open-addressing map with an avalanche-scrambled key avoids O(n^2) probing on UE-scale inputs. Only externally-defined COMDATs are folded (the follower redirects through its symbol). Static/internal-linkage folding is intentionally out of scope here. Numbers (UnrealEditorFortnite-Engine.dll, vs /OPT:NOICF, hashing NONE): .text 727 -> 643 MiB (-84) .rdata 218 -> 194 MiB (-24) PDB 5562 -> 5081 MB (-481) DLL 999 -> 882 MB (-117) link 33 -> 20 s (-13; less to relocate and emit downstream) This commit also adds the shared parallel radix-sort helper (lnk_radix_sort_u64_pairs) used here and by the PDB GSI/PSI sort. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ize GSI/PSI sort DBI section contributions: - after sorting, merge contiguous contributions that share (section, module, flags), absorbing the alignment-padding gaps between them. On UE-scale input this collapses ~12.5M contribution records to ~2.0M and shrinks the DBI stream 367 -> 72 MB, with no change to the address map. GSI/PSI publics sort: - the comparators got element-stable tiebreakers (sort by record offset / dereferenced symbol identity, not by slot pointer) so the median-of-9 quicksort cannot degrade to O(n^2) on the large runs of equal-address / equal-name records that ICF now produces. - gsi_record_sort_by_sc returns a radix-sorted permutation index (via the shared lnk_radix_sort_u64_pairs) and the PSI address map is built from it, replacing a comparator sort that stalled multiple seconds on ICF-heavy links. Numbers (UnrealEditorFortnite-Engine.dll): DBI stream 367 -> 72 MB removes a multi-second GSI/PSI sort stall on ICF-folded inputs Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The closure re-scanned every merged type each round (O(rounds * total types)) to find marked-but-unexpanded leaves. In a full-link trace of UnrealEditorFortnite that round-rescan dominated the type-GC: lnk_gc_expand_task ~13.3 s of CPU. Replace it with a frontier worklist: the atomic mark now gates a single append per leaf, and each round expands only the slice newly marked by the previous round, so total work is O(reachable types) instead of O(rounds * total types). Drops the per-round `expanded` bitmap and full-array sweeps. Output is unchanged -- same reachable set, PDB byte-identical (5081 MB on the same input), and debugger-fidelity checks (addr->symbol 100%, core types resolve) match the pre-change build. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…-parse lnk_opt_icf re-parsed every candidate section serially (lnk_coff_relocs_from_ section_header per candidate) just to size the flattened reloc-target arrays. Move the per-candidate reloc count into lnk_icf_fill_task -- which is already parallel and has the section in hand -- so lnk_opt_icf only does a cheap serial prefix sum for reloc_first. No output change (PDB/.text/.pdata identical). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The frontier mark did an interlocked op on every reference edge. Add a plain non-atomic check first (the mark bit only ever goes 0->1, so a stale "already set" read is safe), so the interlocked op runs once per leaf at its 0->1 transition instead of once per edge. Output identical (PDB 5081 MB). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Type-GC prunes CodeView types not referenced by any surviving symbol. That is a PDB-size win, but it removes types a debugger can still legitimately cast to in the watch window (the reachable-from-symbols set is a subset of the castable-type set) -- which is why the same approach was reverted before after users reported losing the ability to cast in the watch. So gate it behind /OPT:GCTYPES, default OFF; only opt in when the smaller PDB is worth the reduced castable-type set. Numbers (UnrealEditorFortnite-Engine.dll): default PDB 5315 MB; with /OPT:GCTYPES 5081 MB (-234). LINK ok, no RelocationAgainstRemovedSection either way. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Two data races in the parallel library search made radlink output nonreproducible: relinking the same inputs produced ~2.1M differing bytes (every reloc against an import could shift). Both originate in the generated DLL import objs (".idata"), whose symbol values (IAT slots, jump thunks) are laid out in the order the imports appear -- so any nondeterminism in the import set or order propagates to every call site that references an imported function. 1. Import member order. The parallel lib search appends discovered import members to link->imports in worker/round completion order, which is nondeterministic. Sort them into a stable total order (by link_symbol, member_idx tie-break) before generating the import objs. 2. Misindexed dedup flag. When a second reference to an already-queued import is found, lnk_queue_lib_member OR'd LinkedRegular/LinkedImp into import_member_infos[member_idx] -- but member_idx indexes the *currently searched* lib, not the import's lib. It must be is_queued_import->member_idx. The wrong (race-determined) slot got flagged, so whether an import emitted a jump thunk varied run to run, changing the import obj's symbol count. After both fixes, relinking UnrealEditorFortnite-Engine.dll is byte-identical except for the 20 bytes of intentional PE timestamp and PDB GUID/age (verified: 2,110,092 -> 20 differing bytes; output size unchanged). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

lnk_sort_contribs_task ran one serial radsort per chunk. A section's contribs live in a single chunk sized to the whole section, so the merged .text chunk (millions of entries) was sorted serially on one worker while every other thread idled -- the straggler that stretched the "Sort Section Contribs" phase. Sort chunks >= 64K entries with the parallel radix sort (key = Compose64Bit(obj_idx, obj_sect_idx), which is unique per contrib so the order matches the comparator) using all threads, before the per-chunk task pass handles the small remainder. Output is unchanged: section sizes identical, byte diff within the pre-existing relink noise. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

ICF refinement re-densified all ~N candidates every round, radix-sorting the full set each iteration to a fixpoint. But a candidate alone in its equivalence class can never split or merge again -- its color is final. Track an active set of only the candidates still sharing a class with another, and re-densify just that set each round (ids drawn from an ever-increasing base so they never collide with the colors already finalized for singletons). The per-round sort shrinks from all candidates to those that still have a content+reloc twin, and converged classes drop out as they fragment into singletons. Output is unchanged -- relinking UnrealEditorFortnite-Engine.dll is byte-identical to the prior ICF (same folds, same size) and reproducible across runs. On that link the refinement loop drops from ~888ms to ~765ms (first round prunes ~2.1M of 3.98M candidates to singletons immediately). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ved symbols lnk_search_lib_task scans the entire search-chunk symbol set once per library (2733 dispatches on the UE editor link). A symbol that started Undefined/Weak stays in search_chunks even after a definition resolves it, so every later library pass re-parsed it -- lnk_ref_from_symbol + lnk_parsed_symbol_from_coff _symbol_idx -- just to recompute its interp and skip it. That parse faults the COFF symbol record out of the mmap'd obj, and the profile showed those two lines at ~65% of the task and a matching wall of page-fault kernel time. The interp is already computed once in lnk_symbol_table_push_; cache it on LNK_Symbol and read it in the hot loop. Only genuinely Weak symbols still parse (for the weak-extension characteristics). The hash-trie node always points at the current leader symbol, so the cached interp reflects the resolved state. Output unchanged (byte-identical DLL+PDB, reproducible). Lib-search dispatch wall drops ~18% (~1.5s -> ~1.2s on the UE editor link) with a larger drop in aggregate CPU and page-fault traffic. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

lnk_link_inputs resolves libraries to a fixpoint: an outer pass loops over every library, and each library is re-searched (a full tp_for_parallel over all workers, scanning every undefined/weak symbol in search_chunks) once per drained input batch until nothing new resolves. On the UE editor link that is ~2733 dispatches, each waking and joining ~60 workers -- and the phase is barrier-bound, so that wake/join is the cost, not the scan. Most re-searches are redundant: search_chunks only grows during the loop (symbols are never removed until the end) and member-queue dedup is idempotent, so a re-search can only queue new members if the undefined/weak symbol set grew or anti-dep searching was just enabled since this library was last searched. Stamp each LNK_Lib with the search_chunks symbol count + anti-dep mode at its last search and skip the dispatch when neither changed. ~24% fewer dispatches (2733 -> ~2089) and ~0.2s of wake/join wall-time removed. Output is byte-identical and reproducible (which dispatch coalesces is timing dependent, but a skipped one provably queued nothing, so the result is unchanged -- verified relink-twice byte-identical across many runs). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

honkstar1 · 2026-06-25T06:01:43Z

Closing in favor of #842. This was an older umbrella branch; every commit here (COFF-symbol memoization, cached symbol interp, redundant lib re-search skip, type-GC + /OPT:GCTYPES, /OPT:ICF, import-determinism, GNU-ar input, section-contrib coalesce/sort, etc.) is absorbed or superseded by the refreshed #842, which is rebased on latest dev with already-taken commits dropped and kept as clean per-feature commits for cherry-picking. Continue review there.

honkstar1 and others added 13 commits June 16, 2026 20:06

honkstar1 mentioned this pull request Jun 19, 2026

radlink: link-time performance pass (~1.9x on a Fortnite link) #830

Closed

honkstar1 and others added 14 commits June 19, 2026 16:59

honkstar1 force-pushed the perf/radlink-link-time-plus branch from 1bb78ed to 9223d30 Compare June 20, 2026 00:00

honkstar1 changed the title ~~radlink: link-time, memory, and output-size optimizations~~ radlink: reproducibility, link-time, memory, and output-size pass Jun 20, 2026

ryanfleury force-pushed the dev branch from a9dd68c to 4f899ac Compare June 20, 2026 03:12

honkstar1 closed this Jun 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

radlink: reproducibility, link-time, memory, and output-size pass#839

radlink: reproducibility, link-time, memory, and output-size pass#839
honkstar1 wants to merge 27 commits into
EpicGames:devfrom
honkstar1:perf/radlink-link-time-plus

honkstar1 commented Jun 19, 2026 •

edited

Loading

Uh oh!

honkstar1 commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

honkstar1 commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Reproducibility

Link time

Peak memory

Output size

Combined size result

Also on this branch

Validation

Out of scope / follow-ups

Uh oh!

honkstar1 commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant

honkstar1 commented Jun 19, 2026 •

edited

Loading