radlink: reproducibility, link-time, memory, and output-size pass#839
Closed
honkstar1 wants to merge 27 commits into
Closed
radlink: reproducibility, link-time, memory, and output-size pass#839honkstar1 wants to merge 27 commits into
honkstar1 wants to merge 27 commits into
Conversation
get_cpu_features was the top main-thread hot spot (~5.6s for one Fortnite link), 97% of it inside a single ATOMIC_LOAD(g_cpu_features). On MSVC, blake3_dispatch.c defines ATOMIC_LOAD as _InterlockedOr(&x,0) -- a lock'd RMW (full barrier) run on every BLAKE3 compress dispatch. The value is written once and read-only after, so the barrier is pointless. Enable BLAKE3's plain-load path (C11 _Atomic, a plain mov on x86) via build flags only, leaving the vendored third_party/blake3 source untouched: /std:c11 /experimental:c11atomics -DBLAKE3_ATOMICS=1 Scoped to the radlink target. MSVC C11 atomics need both /std:c11 and /experimental:c11atomics. get_cpu_features: 5591ms -> 4ms (main thread). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
coff_read_symbol_name scans a cstr in the memory-mapped string table -- the
dominant, page-fault-bound cost of bulk symbol parsing. Many hot callers parse
a full symbol but only read scalar fields (value/section/storage_class/aux) to
interpret the symbol value; the name is never used.
Add name-skipping parse variants and route the interp-only paths through them:
coff_parse_symbol{16,32}_no_name (coff_parse.c) -- and the full variants now
call these + add the name, so the scalar logic lives in one place
lnk_parsed_symbol_from_coff_symbol_idx_no_name (lnk_obj.c)
lnk_interp_from_symbol / lnk_can_replace_symbol / lnk_on_symbol_replace
(lnk_symbol_table.c) and the lnk_search_lib_task loop (lnk.c)
Where the name is still needed (lnk_search_lib) it uses the already-cached
LNK_Symbol.name instead of re-parsing. lnk_can_replace_symbol previously parsed
dst/src twice (full parse + a second parse for interp); collapsed to one
no-name parse each.
coff_parse_symbol32 on the main thread: 3922ms -> ~810ms (name-needed callers).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lnk_on_symbol_replace merged ref lists by walking the destination's singly linked refs list to its tail on every merge. Across repeated COMDAT merges into one accumulating leader this is O(n^2) and was 96% of the function. Add a refs_tail pointer to LNK_Symbol so the append is O(1): src->refs_tail->next = dst->refs; src->refs_tail = dst->refs_tail; maintained at all ref-list write sites (lnk_make_symbol, the null_symbol and import-stub sites in lnk.c). Order and head identity are preserved exactly, so this is a pure perf change: the head node stays the primary ref, and interior order is irrelevant (every multi-ref consumer sorts). lnk_on_symbol_replace (main thread): 1306ms -> 161ms exclusive. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
tp_for_parallel woke workers with a loop of single semaphore_drop calls -- one ReleaseSemaphore syscall per worker (twice in shared mode). The main thread spent ~3.3s in ReleaseSemaphore over a Fortnite link. Add semaphore_drop_n(sem, count) (a single ReleaseSemaphore(h, count, 0) on Windows; a loop on POSIX) and wake all drop_count workers in one call. Two details keep the batched release correct: - Wake the full drop_count (NOT drop_count-1). The main thread runs as worker 0, but tasks that nest a tp_broadcast_ barrier span all workers; under-waking by one leaves that barrier a participant short and deadlocks on small dispatches. - Give the exec/task semaphores 2x max-count headroom. A single batched release can land while up to worker_count-1 previously-woken workers have not yet re-taken their permit, so the count can transiently approach 2*worker_count; a tight max would make ReleaseSemaphore fail outright and deadlock at the next barrier. ReleaseSemaphore (main thread): 3274ms -> 940ms.
The capped cstr length scan was a byte-by-byte loop. Switch to memchr, which is SIMD-accelerated in the CRT. Speeds up every capped-cstr scan in the codebase; notably cv_name_from_symbol (CodeView symbol-name scan during GSI build). cv_name_from_symbol (main thread): 1098ms -> 201ms. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lnk_fixup_cv_type_indices did two open-addressing probes per type-index reference: lnk_leaf_hash_table_search (leaf_ref -> canonical bucket), then lnk_assigned_type_ht_search (canonical bucket -> assigned type index, via a second hash table keyed by leaf-ref content). Both are cache-miss-bound and this ran across every type-index reference in every obj. Store the assigned type index directly on the leaf hash table: add a ti_arr parallel to bucket_arr. lnk_assign_type_indices_task writes ti = min+i into the leaf's bucket slot (each unique leaf owns a distinct slot, so worker writes never collide), and the new lnk_leaf_hash_table_search_ti recovers it in one probe. Removes the entire assigned_type_hts table and its build pass; deletes the now-dead lnk_leaf_hash_table_search and lnk_assigned_type_ht_search. Correctness: deduplicated leaves share the same ghash (debug_h value), so the fixup query and the assign-time canonical bucket hash to the same slot. lnk_fixup_cv_type_indices (main thread): 1445ms -> 390ms inclusive. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Input obj/lib files are mapped copy-on-write (PAGE_WRITECOPY/FILE_MAP_COPY) so the linker can patch them in place. Pages touched during linking become private-dirty; at process exit the kernel reclaims them in single-threaded address-space rundown -- ~3s of lingering process time after the last thread exits for a large (Fortnite-scale) link. After all outputs are written and inputs are no longer read (post image-write join), unmap the whole-file CoW views in parallel on the thread pool. The same reclaim work then runs multi-threaded, off the serial post-exit path: measured ~34s of aggregate UnmapViewOfFile CPU collapsing to ~0.55s wall, and the post-exit process tail dropping from ~3s to ~0.5s. Only the is_thin whole-file views are swept (lib-member substrings and linkgen arena data are skipped), and only in the copy-on-write (read-only) mapping mode -- read-write-shared mapping would flush dirty pages back to the input files on unmap. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two micro-optimizations on hot parse helpers (profiled as the largest aggregate-CPU functions in a Fortnite link): - lnk_obj_section_from_sect_idx: split out a _no_name variant that skips the section-name string-table lookup (coff_name_from_section_header). The full variant now reuses it + adds the name. lnk_raw_directives_from_obj iterated every section of every obj building the full section struct just to test a flag, computing the name on ~all sections though only .drectve needs it -- now uses the no-name variant and resolves the name only inside the LnkInfo branch. - lnk_parsed_symbol_from_coff_symbol_idx (+_no_name): return the coff_parse_* result directly instead of zero-initializing a local and assigning to it, removing a redundant ~48-byte COFF_ParsedSymbol copy + zero-init per call (RVO). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lnk_leaf_hash_table_search_ti (~EpicGames#2 radlink hotspot, ~48% of its self-time) spent its time in lnk_match_leaf_ref, which is just a_hash==b_hash but fetches the bucket's hash via lnk_hash_from_leaf_ref -> input->debug_h_arr[obj].v[leaf], a scattered cache miss per probe step. Add LNK_LeafHashTable.hash_arr (parallel to bucket_arr), populated at bucket claim/update (both lnk_populate_leaf_ht and lnk_leaf_dedup_task) with the leaf's debug_h hash. search_ti now matches via hash_arr[idx] == hash -- no deref. Exact equivalent (match is pure hash compare); same value across same-hash updates. Gated 65/65 linker torture (ghash_basic/match_debug_t, determ_test, p2r_determinism). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The image buffer was push'd on the shared link arena and only reclaimed in the single-threaded process rundown at exit -- a multi-second kernel page-reclaim tail (observed: one thread 100% in-kernel, zero user frames). Allocate it as a standalone reserve_memory/commit_memory region and release_memory() it the instant the background image-write thread joins (image is on disk, no later reader). VirtualFree(MEM_RELEASE) returns fast; the kernel zeroes the ~1GB on its background thread, overlapping the parallel input-view release + exit instead of blocking rundown. Discard early so the kernel cleans up while the app still runs -- don't defer to exit. Gated 65/65 linker torture (determ_test + p2r_determinism: image correct). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…hotspot) Parse each COFF symbol once in lnk_obj_initer into LNK_Obj.parsed_symbols; lnk_parsed_symbol_from_coff_symbol_idx[_no_name] becomes an array index instead of re-decoding the mmapped symbol table on every access (it was the EpicGames#1 hotspot, lnk_parsed_symbol_from_coff_symbol_idx). All symbol-value patch sites (weak-replace, COMDAT-leader, regular/common fixups) write obj->parsed_symbols[idx], decoupling symbol values from the input mapping. Extracted from the entangled WIP commit 8dc5fa6 (parsed-symbol memo + .rgd staging); this is the memo half only -- no .rgd. Gated separately. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…im ~3GB peak) The CV type-index fixup (05af760) had stored the assigned type index in arrays parallel to the dedup hash table (leaf_ht), whose cap is the TOTAL pre-dedup leaf count summed over all objs. On large links that ti_arr (+ the hash_arr added for deref-free probing) added ~3GB to peak working set vs the prior unique-sized assigned_type_ht. Split the two concerns: - LNK_LeafHashTable: just {cap, bucket_arr} for dedup (total-sized, as before / unavoidable). - LNK_AssignedTiHash {cap, ti_arr, hash_arr}: hash -> assigned ti, sized to the UNIQUE (post-dedup) leaf count. Built in lnk_assign_type_indices_task by hashing each unique leaf into its own slot (atomic claim; unique leaves have distinct hashes since dedup is by hash). search_ti probes it in one deref-free pass (occupant hash stored on the slot), exactly as before -- just on a table sized by unique instead of total. Keeps the single-probe fixup speed of 05af760; removes its peak-memory regression. ti==0 marks empty (assigned ti is always >= CV_MinComplexTypeIndex). Builds clean, 95 linker torture PASS, 0 fail. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
g_input_type_map mapped o/obj/lib/rlib/res/rrt but not .a, so clang/meson-built GNU ar archives (e.g. ThirdParty libdav1d.a) hit Error(002) 'unknown file format'. rlib (also GNU ar) already routes to LNK_Input_Lib and parses, so map .a the same. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
LNK_Obj.parsed_symbols memoized the full COFF_ParsedSymbol per symbol -- including the 16B String8 name -- sized by total symbol count, held to exit. Store a slim LNK_ParsedSymbolLite (every field except name, ~24B vs ~40B) and re-decode the name from the read-only symbol record in the named accessor only. The hot _no_name path (can_replace/GC/resolution) and all symbol-value patching never touch the name, so they stay fully memoized; only the named/push path pays a re-decode (cold relative to total). FN no-rrt full link: peak commit 50.5GB -> 49.1GB (-1.4GB), wall flat (~8-9s no-debug), valid 5.68GB PDB, 95/0 linker torture. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Store the COFF symbol record as a U32 byte-offset into obj->data instead of an 8B pointer, and value as U32 (COFF symbol value is U32). Struct goes 24B -> 16B (no padding): raw_symbol_off(4) value(4) section_number(4) type(2) storage_class(1) aux(1). The named/no_name accessors reconstruct the pointer as obj->data.str + off (obj->data == input->data, so the offset is stable). Sized by total symbol count -> 8B/sym off peak. FN no-rrt full link: peak commit 49.1GB -> 48.4GB (-0.67GB), wall flat, valid 5.68GB PDB, 95/0 linker torture. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
After type merging, prune any merged TPI/IPI record not transitively reachable
from a surviving symbol. Roots are the type indices referenced by the symbols
that survive /OPT:REF (plus inlinee call-site types); the type graph is then
closed over and everything unreached is dropped before the streams are written.
Runs only under /OPT:REF -- it is the debug-info analogue of dead-section
stripping, and is otherwise transparent (no visible type is removed).
Implementation notes:
- parallel transitive closure (bulk-synchronous rounds, atomic mark/expand)
- fwdref<->definition pairing via a per-unique-name ring so a live forward
reference keeps its definition (and vice-versa)
- compaction is in place with the remap kept in scratch, so peak memory is
unchanged
Numbers (UnrealEditorFortnite-Engine.dll, /OPT:REF /OPT:ICF, hashing NONE):
PDB 5315 -> 5081 MB (type-GC alone: -234 MB)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…arallelized
Fold byte-identical COMDAT sections whose relocations point at equivalent
targets, iterated to a fixpoint, then redirect each group's followers at their
shared symbol-table node so every reference resolves to one leader and /OPT:REF
collects the now-unreferenced follower sections (and their associated
.pdata/.xdata/.debug$S). Mirrors link.exe /OPT:ICF.
- equivalence: round-0 key from content + reloc structure + non-candidate
target identity; refine by candidate targets' colors until the partition is
stable; a final byte-compare + per-reloc target-color check guards folds
(no hash-collision can produce a bad fold)
- folds code AND read-only data (vtables, const tables, string literals);
folding identical read-only data lets the functions that reference it fold
too (cascade)
- fully parallel: candidate collection (count -> exact alloc -> fill),
content hashing + reloc-target resolution, refinement, and final grouping
via a parallel LSD radix sort (8-bit digits). A flat open-addressing map
with an avalanche-scrambled key avoids O(n^2) probing on UE-scale inputs.
Only externally-defined COMDATs are folded (the follower redirects through its
symbol). Static/internal-linkage folding is intentionally out of scope here.
Numbers (UnrealEditorFortnite-Engine.dll, vs /OPT:NOICF, hashing NONE):
.text 727 -> 643 MiB (-84)
.rdata 218 -> 194 MiB (-24)
PDB 5562 -> 5081 MB (-481)
DLL 999 -> 882 MB (-117)
link 33 -> 20 s (-13; less to relocate and emit downstream)
This commit also adds the shared parallel radix-sort helper
(lnk_radix_sort_u64_pairs) used here and by the PDB GSI/PSI sort.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ize GSI/PSI sort
DBI section contributions:
- after sorting, merge contiguous contributions that share (section, module,
flags), absorbing the alignment-padding gaps between them. On UE-scale
input this collapses ~12.5M contribution records to ~2.0M and shrinks the
DBI stream 367 -> 72 MB, with no change to the address map.
GSI/PSI publics sort:
- the comparators got element-stable tiebreakers (sort by record offset /
dereferenced symbol identity, not by slot pointer) so the median-of-9
quicksort cannot degrade to O(n^2) on the large runs of equal-address /
equal-name records that ICF now produces.
- gsi_record_sort_by_sc returns a radix-sorted permutation index (via the
shared lnk_radix_sort_u64_pairs) and the PSI address map is built from it,
replacing a comparator sort that stalled multiple seconds on ICF-heavy links.
Numbers (UnrealEditorFortnite-Engine.dll):
DBI stream 367 -> 72 MB
removes a multi-second GSI/PSI sort stall on ICF-folded inputs
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The closure re-scanned every merged type each round (O(rounds * total types)) to find marked-but-unexpanded leaves. In a full-link trace of UnrealEditorFortnite that round-rescan dominated the type-GC: lnk_gc_expand_task ~13.3 s of CPU. Replace it with a frontier worklist: the atomic mark now gates a single append per leaf, and each round expands only the slice newly marked by the previous round, so total work is O(reachable types) instead of O(rounds * total types). Drops the per-round `expanded` bitmap and full-array sweeps. Output is unchanged -- same reachable set, PDB byte-identical (5081 MB on the same input), and debugger-fidelity checks (addr->symbol 100%, core types resolve) match the pre-change build. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…-parse lnk_opt_icf re-parsed every candidate section serially (lnk_coff_relocs_from_ section_header per candidate) just to size the flattened reloc-target arrays. Move the per-candidate reloc count into lnk_icf_fill_task -- which is already parallel and has the section in hand -- so lnk_opt_icf only does a cheap serial prefix sum for reloc_first. No output change (PDB/.text/.pdata identical). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The frontier mark did an interlocked op on every reference edge. Add a plain non-atomic check first (the mark bit only ever goes 0->1, so a stale "already set" read is safe), so the interlocked op runs once per leaf at its 0->1 transition instead of once per edge. Output identical (PDB 5081 MB). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Type-GC prunes CodeView types not referenced by any surviving symbol. That is a PDB-size win, but it removes types a debugger can still legitimately cast to in the watch window (the reachable-from-symbols set is a subset of the castable-type set) -- which is why the same approach was reverted before after users reported losing the ability to cast in the watch. So gate it behind /OPT:GCTYPES, default OFF; only opt in when the smaller PDB is worth the reduced castable-type set. Numbers (UnrealEditorFortnite-Engine.dll): default PDB 5315 MB; with /OPT:GCTYPES 5081 MB (-234). LINK ok, no RelocationAgainstRemovedSection either way. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two data races in the parallel library search made radlink output
nonreproducible: relinking the same inputs produced ~2.1M differing
bytes (every reloc against an import could shift).
Both originate in the generated DLL import objs (".idata"), whose
symbol values (IAT slots, jump thunks) are laid out in the order the
imports appear -- so any nondeterminism in the import set or order
propagates to every call site that references an imported function.
1. Import member order. The parallel lib search appends discovered
import members to link->imports in worker/round completion order,
which is nondeterministic. Sort them into a stable total order
(by link_symbol, member_idx tie-break) before generating the
import objs.
2. Misindexed dedup flag. When a second reference to an already-queued
import is found, lnk_queue_lib_member OR'd LinkedRegular/LinkedImp
into import_member_infos[member_idx] -- but member_idx indexes the
*currently searched* lib, not the import's lib. It must be
is_queued_import->member_idx. The wrong (race-determined) slot got
flagged, so whether an import emitted a jump thunk varied run to
run, changing the import obj's symbol count.
After both fixes, relinking UnrealEditorFortnite-Engine.dll is
byte-identical except for the 20 bytes of intentional PE timestamp
and PDB GUID/age (verified: 2,110,092 -> 20 differing bytes; output
size unchanged).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lnk_sort_contribs_task ran one serial radsort per chunk. A section's contribs live in a single chunk sized to the whole section, so the merged .text chunk (millions of entries) was sorted serially on one worker while every other thread idled -- the straggler that stretched the "Sort Section Contribs" phase. Sort chunks >= 64K entries with the parallel radix sort (key = Compose64Bit(obj_idx, obj_sect_idx), which is unique per contrib so the order matches the comparator) using all threads, before the per-chunk task pass handles the small remainder. Output is unchanged: section sizes identical, byte diff within the pre-existing relink noise. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ICF refinement re-densified all ~N candidates every round, radix-sorting the full set each iteration to a fixpoint. But a candidate alone in its equivalence class can never split or merge again -- its color is final. Track an active set of only the candidates still sharing a class with another, and re-densify just that set each round (ids drawn from an ever-increasing base so they never collide with the colors already finalized for singletons). The per-round sort shrinks from all candidates to those that still have a content+reloc twin, and converged classes drop out as they fragment into singletons. Output is unchanged -- relinking UnrealEditorFortnite-Engine.dll is byte-identical to the prior ICF (same folds, same size) and reproducible across runs. On that link the refinement loop drops from ~888ms to ~765ms (first round prunes ~2.1M of 3.98M candidates to singletons immediately). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ved symbols lnk_search_lib_task scans the entire search-chunk symbol set once per library (2733 dispatches on the UE editor link). A symbol that started Undefined/Weak stays in search_chunks even after a definition resolves it, so every later library pass re-parsed it -- lnk_ref_from_symbol + lnk_parsed_symbol_from_coff _symbol_idx -- just to recompute its interp and skip it. That parse faults the COFF symbol record out of the mmap'd obj, and the profile showed those two lines at ~65% of the task and a matching wall of page-fault kernel time. The interp is already computed once in lnk_symbol_table_push_; cache it on LNK_Symbol and read it in the hot loop. Only genuinely Weak symbols still parse (for the weak-extension characteristics). The hash-trie node always points at the current leader symbol, so the cached interp reflects the resolved state. Output unchanged (byte-identical DLL+PDB, reproducible). Lib-search dispatch wall drops ~18% (~1.5s -> ~1.2s on the UE editor link) with a larger drop in aggregate CPU and page-fault traffic. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lnk_link_inputs resolves libraries to a fixpoint: an outer pass loops over every library, and each library is re-searched (a full tp_for_parallel over all workers, scanning every undefined/weak symbol in search_chunks) once per drained input batch until nothing new resolves. On the UE editor link that is ~2733 dispatches, each waking and joining ~60 workers -- and the phase is barrier-bound, so that wake/join is the cost, not the scan. Most re-searches are redundant: search_chunks only grows during the loop (symbols are never removed until the end) and member-queue dedup is idempotent, so a re-search can only queue new members if the undefined/weak symbol set grew or anti-dep searching was just enabled since this library was last searched. Stamp each LNK_Lib with the search_chunks symbol count + anti-dep mode at its last search and skip the dispatch when neither changed. ~24% fewer dispatches (2733 -> ~2089) and ~0.2s of wake/join wall-time removed. Output is byte-identical and reproducible (which dispatch coalesces is timing dependent, but a skipped one provably queued nothing, so the result is unchanged -- verified relink-twice byte-identical across many runs). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1bb78ed to
9223d30
Compare
Author
|
Closing in favor of #842. This was an older umbrella branch; every commit here (COFF-symbol memoization, cached symbol interp, redundant lib re-search skip, type-GC + /OPT:GCTYPES, /OPT:ICF, import-determinism, GNU-ar input, section-contrib coalesce/sort, etc.) is absorbed or superseded by the refreshed #842, which is rebased on latest dev with already-taken commits dropped and kept as clean per-feature commits for cherry-picking. Continue review there. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
A pass over radlink covering, roughly in order: build reproducibility, link time, peak memory, and output size (
.text/.rdata/ PDB). Everything is measured onUnrealEditorFortnite-Engine.dll(a large UE editor module) against a fixed set of object inputs, comparing to MSVClink.exewhere relevant.Every change keeps the linked output byte-identical unless its whole point is to change the bytes (the size and determinism commits) — verified by relinking and
cmp-ing the DLL+PDB. Each commit message carries its own rationale and numbers; this is a per-area overview.Reproducibility
Import-table determinism. Two data races in the parallel library search made radlink output non-reproducible: relinking the same inputs differed by ~2.78M bytes. Both were in the generated DLL import objs — a misindexed dedup flag (
lnk_queue_lib_memberused the wrong lib'smember_idx) that made jump-thunk emission race, andlink->importsbeing appended in nondeterministic discovery-round order. After the fix, relinking is byte-identical except the intentional PE timestamp + PDB GUID; with/BREPRO(timestamp 0) plus the default content-hash GUID, the DLL and PDB are bit-identical across runs.Link time
Throughput work across the hot phases, each byte-identical:
.textchunk (millions of entries) was radix-sorted on one worker; split big chunks across all threads. Sort phase ~300ms → ~145ms.Peak memory
LNK_ParsedSymbolLiteto 16B, decode name on demand).Output size
All on
UnrealEditorFortnite-Engine.dll,/OPT:REF /OPT:ICF, type-name hashingNONE./OPT:ICFidentical COMDAT folding (code + read-only data), parallelized. Fold byte-identical COMDATs whose relocations point at equivalent targets (iterated to a fixpoint), redirecting followers through the existing COMDAT symlink so/OPT:REFcollects them. Folding identical read-only data lets the functions referencing it fold too (cascade)./OPT:NOICF.text.rdataGarbage-collect unreferenced CodeView types, gated behind
/OPT:GCTYPES(default off). After type merging, drop any TPI/IPI record not transitively reachable from a surviving symbol (frontier-worklist closure). Default-off because the rad owner noted GC'd types can break casting in the watch window; opt-in keeps that out of the default path./OPT:GCTYPES)Coalesce DBI section contributions; stabilize + parallelize GSI/PSI sort. Merge contiguous same-
(section, module, flags)contributions; give the GSI/PSI comparators element-stable tiebreakers so the sort can't go quadratic on the equal-address/equal-name runs ICF produces.Combined size result
UnrealEditorFortnite-Engine.dlllink.exe.textPDB is smaller than MSVC's. The DLL is still ~158 MB larger, almost entirely
.text: MSVC folds more because it also folds static / internal-linkage COMDATs — deliberately out of scope here (see below).Also on this branch
ar(.a) archives as lib input — a standalone feature, also up as radlink: accept GNU ar (.a) archives as lib input #836.Validation
No runtime test harness was available for the editor, so changes were validated structurally + by reproducibility:
cmp(byte-identical DLL+PDB) gates every non-size change and proves the determinism fix;RelocationAgainstRemovedSectionor similar;.pdatascanned for malformedRUNTIME_FUNCTIONrecords (begin/end RVAs in range) — 0 bad on every output;Out of scope / follow-ups
/OPT:ICFSTATIC): would close most of the remaining.textgap, but folding a static follower leaves its associated.pdata/.xdatawith begin/end RVAs that don't survive the redirect, producing malformed unwind records. Left out until that's fixed and runtime-validated.🤖 Generated with Claude Code