Skip to content

radlink: reproducibility, link-time, memory, and output-size pass#839

Closed
honkstar1 wants to merge 27 commits into
EpicGames:devfrom
honkstar1:perf/radlink-link-time-plus
Closed

radlink: reproducibility, link-time, memory, and output-size pass#839
honkstar1 wants to merge 27 commits into
EpicGames:devfrom
honkstar1:perf/radlink-link-time-plus

Conversation

@honkstar1

@honkstar1 honkstar1 commented Jun 19, 2026

Copy link
Copy Markdown

Summary

A pass over radlink covering, roughly in order: build reproducibility, link time, peak memory, and output size (.text / .rdata / PDB). Everything is measured on UnrealEditorFortnite-Engine.dll (a large UE editor module) against a fixed set of object inputs, comparing to MSVC link.exe where relevant.

Every change keeps the linked output byte-identical unless its whole point is to change the bytes (the size and determinism commits) — verified by relinking and cmp-ing the DLL+PDB. Each commit message carries its own rationale and numbers; this is a per-area overview.

Reproducibility

Import-table determinism. Two data races in the parallel library search made radlink output non-reproducible: relinking the same inputs differed by ~2.78M bytes. Both were in the generated DLL import objs — a misindexed dedup flag (lnk_queue_lib_member used the wrong lib's member_idx) that made jump-thunk emission race, and link->imports being appended in nondeterministic discovery-round order. After the fix, relinking is byte-identical except the intentional PE timestamp + PDB GUID; with /BREPRO (timestamp 0) plus the default content-hash GUID, the DLL and PDB are bit-identical across runs.

This one commit is also up as #840 on its own, for independent review/merge ahead of the rest of this branch.

Link time

Throughput work across the hot phases, each byte-identical:

  • Memoize parsed COFF symbols per obj — kills the Crash with access violation on null pointer #1 link hotspot (repeated symbol re-parse).
  • Cache symbol interp on the symbol — the library search re-parsed (and page-faulted) every resolved symbol on every lib pass just to recompute its interp; now a cached field read. Lib-search CPU ~65s → ~18s aggregate.
  • Skip redundant library re-searches — the resolution fixpoint re-dispatched a full parallel search per lib even when nothing new resolved (~2733 dispatches → ~2089), removing wasted worker wake/join.
  • Parallelize the section-contrib sort — the merged .text chunk (millions of entries) was radix-sorted on one worker; split big chunks across all threads. Sort phase ~300ms → ~145ms.
  • ICF refine: skip converged classes — re-densify only the still-ambiguous candidate set each round instead of all ~4M.
  • Plus: O(1) symbol ref-list merge (tail pointer), batched thread-pool wake, C11 atomics so BLAKE3 skips the locked CPU-feature probe, single-probe CV type-index fixup, leaner symbol/section parse, count ICF reloc slices in the parallel fill rather than a serial re-parse.

Peak memory

  • Size the assigned-ti table by unique types, not total — reclaims ~3GB peak on this link.
  • Release the ~1GB image buffer early so reclaim overlaps the run; release copy-on-write input views in parallel before exit.
  • Slim + pack the parsed-symbol memo (LNK_ParsedSymbolLite to 16B, decode name on demand).

Output size

All on UnrealEditorFortnite-Engine.dll, /OPT:REF /OPT:ICF, type-name hashing NONE.

/OPT:ICF identical COMDAT folding (code + read-only data), parallelized. Fold byte-identical COMDATs whose relocations point at equivalent targets (iterated to a fixpoint), redirecting followers through the existing COMDAT symlink so /OPT:REF collects them. Folding identical read-only data lets the functions referencing it fold too (cascade).

vs /OPT:NOICF before after Δ
.text 727 MiB 643 MiB −84
.rdata 218 MiB 194 MiB −24
PDB 5562 MB 5081 MB −481
DLL 999 MB 882 MB −117

Garbage-collect unreferenced CodeView types, gated behind /OPT:GCTYPES (default off). After type merging, drop any TPI/IPI record not transitively reachable from a surviving symbol (frontier-worklist closure). Default-off because the rad owner noted GC'd types can break casting in the watch window; opt-in keeps that out of the default path.

(with /OPT:GCTYPES) before after Δ
PDB 5315 MB 5081 MB −234

Coalesce DBI section contributions; stabilize + parallelize GSI/PSI sort. Merge contiguous same-(section, module, flags) contributions; give the GSI/PSI comparators element-stable tiebreakers so the sort can't go quadratic on the equal-address/equal-name runs ICF produces.

before after
DBI stream 367 MB 72 MB
contribution records ~12.5M ~2.0M

Combined size result

UnrealEditorFortnite-Engine.dll radlink (this branch) MSVC link.exe
PDB 5081 MB 5421 MB
DLL 882 MB 724 MB
.text 643 MiB 524 MiB

PDB is smaller than MSVC's. The DLL is still ~158 MB larger, almost entirely .text: MSVC folds more because it also folds static / internal-linkage COMDATs — deliberately out of scope here (see below).

Also on this branch

Validation

No runtime test harness was available for the editor, so changes were validated structurally + by reproducibility:

  • relink-twice cmp (byte-identical DLL+PDB) gates every non-size change and proves the determinism fix;
  • link exits 0, no RelocationAgainstRemovedSection or similar;
  • .pdata scanned for malformed RUNTIME_FUNCTION records (begin/end RVAs in range) — 0 bad on every output;
  • output section sizes + PDB validity checked; module/section inputs confirmed identical to the MSVC reference, so size deltas are folding, not codegen;
  • debugger fidelity (names/types/lines) spot-checked via DIA/dbghelp against the MSVC PDB.

Out of scope / follow-ups

  • Static-linkage ICF (/OPT:ICFSTATIC): would close most of the remaining .text gap, but folding a static follower leaves its associated .pdata/.xdata with begin/end RVAs that don't survive the redirect, producing malformed unwind records. Left out until that's fixed and runtime-validated.
  • The link is increasingly page-fault bound (streaming the multi-GB mmap'd input working set, with kernel working-set-lock contention across workers). Read-only input mapping and bulk prefetch were both tried and measured as neutral-to-negative; cutting it further needs to touch less data (e.g. lazier CodeView parsing), not a flag.

🤖 Generated with Claude Code

honkstar1 and others added 13 commits June 16, 2026 20:06
get_cpu_features was the top main-thread hot spot (~5.6s for one Fortnite
link), 97% of it inside a single ATOMIC_LOAD(g_cpu_features). On MSVC,
blake3_dispatch.c defines ATOMIC_LOAD as _InterlockedOr(&x,0) -- a lock'd
RMW (full barrier) run on every BLAKE3 compress dispatch. The value is
written once and read-only after, so the barrier is pointless.

Enable BLAKE3's plain-load path (C11 _Atomic, a plain mov on x86) via build
flags only, leaving the vendored third_party/blake3 source untouched:
  /std:c11 /experimental:c11atomics -DBLAKE3_ATOMICS=1
Scoped to the radlink target. MSVC C11 atomics need both /std:c11 and
/experimental:c11atomics.

get_cpu_features: 5591ms -> 4ms (main thread).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
coff_read_symbol_name scans a cstr in the memory-mapped string table -- the
dominant, page-fault-bound cost of bulk symbol parsing. Many hot callers parse
a full symbol but only read scalar fields (value/section/storage_class/aux) to
interpret the symbol value; the name is never used.

Add name-skipping parse variants and route the interp-only paths through them:
  coff_parse_symbol{16,32}_no_name (coff_parse.c) -- and the full variants now
    call these + add the name, so the scalar logic lives in one place
  lnk_parsed_symbol_from_coff_symbol_idx_no_name (lnk_obj.c)
  lnk_interp_from_symbol / lnk_can_replace_symbol / lnk_on_symbol_replace
    (lnk_symbol_table.c) and the lnk_search_lib_task loop (lnk.c)

Where the name is still needed (lnk_search_lib) it uses the already-cached
LNK_Symbol.name instead of re-parsing. lnk_can_replace_symbol previously parsed
dst/src twice (full parse + a second parse for interp); collapsed to one
no-name parse each.

coff_parse_symbol32 on the main thread: 3922ms -> ~810ms (name-needed callers).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lnk_on_symbol_replace merged ref lists by walking the destination's singly
linked refs list to its tail on every merge. Across repeated COMDAT merges
into one accumulating leader this is O(n^2) and was 96% of the function.

Add a refs_tail pointer to LNK_Symbol so the append is O(1):
  src->refs_tail->next = dst->refs;
  src->refs_tail       = dst->refs_tail;
maintained at all ref-list write sites (lnk_make_symbol, the null_symbol and
import-stub sites in lnk.c). Order and head identity are preserved exactly, so
this is a pure perf change: the head node stays the primary ref, and interior
order is irrelevant (every multi-ref consumer sorts).

lnk_on_symbol_replace (main thread): 1306ms -> 161ms exclusive.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
tp_for_parallel woke workers with a loop of single semaphore_drop calls -- one
ReleaseSemaphore syscall per worker (twice in shared mode). The main thread spent
~3.3s in ReleaseSemaphore over a Fortnite link.

Add semaphore_drop_n(sem, count) (a single ReleaseSemaphore(h, count, 0) on
Windows; a loop on POSIX) and wake all drop_count workers in one call.

Two details keep the batched release correct:
 - Wake the full drop_count (NOT drop_count-1). The main thread runs as worker 0,
   but tasks that nest a tp_broadcast_ barrier span all workers; under-waking by
   one leaves that barrier a participant short and deadlocks on small dispatches.
 - Give the exec/task semaphores 2x max-count headroom. A single batched release
   can land while up to worker_count-1 previously-woken workers have not yet
   re-taken their permit, so the count can transiently approach 2*worker_count; a
   tight max would make ReleaseSemaphore fail outright and deadlock at the next
   barrier.

ReleaseSemaphore (main thread): 3274ms -> 940ms.
The capped cstr length scan was a byte-by-byte loop. Switch to memchr, which is
SIMD-accelerated in the CRT. Speeds up every capped-cstr scan in the codebase;
notably cv_name_from_symbol (CodeView symbol-name scan during GSI build).

cv_name_from_symbol (main thread): 1098ms -> 201ms.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lnk_fixup_cv_type_indices did two open-addressing probes per type-index
reference: lnk_leaf_hash_table_search (leaf_ref -> canonical bucket), then
lnk_assigned_type_ht_search (canonical bucket -> assigned type index, via a
second hash table keyed by leaf-ref content). Both are cache-miss-bound and this
ran across every type-index reference in every obj.

Store the assigned type index directly on the leaf hash table: add a ti_arr
parallel to bucket_arr. lnk_assign_type_indices_task writes ti = min+i into the
leaf's bucket slot (each unique leaf owns a distinct slot, so worker writes
never collide), and the new lnk_leaf_hash_table_search_ti recovers it in one
probe. Removes the entire assigned_type_hts table and its build pass; deletes
the now-dead lnk_leaf_hash_table_search and lnk_assigned_type_ht_search.

Correctness: deduplicated leaves share the same ghash (debug_h value), so the
fixup query and the assign-time canonical bucket hash to the same slot.

lnk_fixup_cv_type_indices (main thread): 1445ms -> 390ms inclusive.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Input obj/lib files are mapped copy-on-write (PAGE_WRITECOPY/FILE_MAP_COPY) so
the linker can patch them in place. Pages touched during linking become
private-dirty; at process exit the kernel reclaims them in single-threaded
address-space rundown -- ~3s of lingering process time after the last thread
exits for a large (Fortnite-scale) link.

After all outputs are written and inputs are no longer read (post image-write
join), unmap the whole-file CoW views in parallel on the thread pool. The same
reclaim work then runs multi-threaded, off the serial post-exit path:
measured ~34s of aggregate UnmapViewOfFile CPU collapsing to ~0.55s wall, and
the post-exit process tail dropping from ~3s to ~0.5s.

Only the is_thin whole-file views are swept (lib-member substrings and linkgen
arena data are skipped), and only in the copy-on-write (read-only) mapping mode
-- read-write-shared mapping would flush dirty pages back to the input files on
unmap.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two micro-optimizations on hot parse helpers (profiled as the largest
aggregate-CPU functions in a Fortnite link):

- lnk_obj_section_from_sect_idx: split out a _no_name variant that skips the
  section-name string-table lookup (coff_name_from_section_header). The full
  variant now reuses it + adds the name. lnk_raw_directives_from_obj iterated
  every section of every obj building the full section struct just to test a
  flag, computing the name on ~all sections though only .drectve needs it -- now
  uses the no-name variant and resolves the name only inside the LnkInfo branch.

- lnk_parsed_symbol_from_coff_symbol_idx (+_no_name): return the coff_parse_*
  result directly instead of zero-initializing a local and assigning to it,
  removing a redundant ~48-byte COFF_ParsedSymbol copy + zero-init per call (RVO).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lnk_leaf_hash_table_search_ti (~EpicGames#2 radlink hotspot, ~48% of its self-time) spent
its time in lnk_match_leaf_ref, which is just a_hash==b_hash but fetches the
bucket's hash via lnk_hash_from_leaf_ref -> input->debug_h_arr[obj].v[leaf], a
scattered cache miss per probe step.

Add LNK_LeafHashTable.hash_arr (parallel to bucket_arr), populated at bucket
claim/update (both lnk_populate_leaf_ht and lnk_leaf_dedup_task) with the leaf's
debug_h hash. search_ti now matches via hash_arr[idx] == hash -- no deref. Exact
equivalent (match is pure hash compare); same value across same-hash updates.

Gated 65/65 linker torture (ghash_basic/match_debug_t, determ_test, p2r_determinism).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The image buffer was push'd on the shared link arena and only reclaimed in the
single-threaded process rundown at exit -- a multi-second kernel page-reclaim tail
(observed: one thread 100% in-kernel, zero user frames).

Allocate it as a standalone reserve_memory/commit_memory region and
release_memory() it the instant the background image-write thread joins (image is
on disk, no later reader). VirtualFree(MEM_RELEASE) returns fast; the kernel zeroes
the ~1GB on its background thread, overlapping the parallel input-view release +
exit instead of blocking rundown. Discard early so the kernel cleans up while the
app still runs -- don't defer to exit.

Gated 65/65 linker torture (determ_test + p2r_determinism: image correct).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…hotspot)

Parse each COFF symbol once in lnk_obj_initer into LNK_Obj.parsed_symbols;
lnk_parsed_symbol_from_coff_symbol_idx[_no_name] becomes an array index instead of
re-decoding the mmapped symbol table on every access (it was the EpicGames#1 hotspot,
lnk_parsed_symbol_from_coff_symbol_idx). All symbol-value patch sites (weak-replace,
COMDAT-leader, regular/common fixups) write obj->parsed_symbols[idx], decoupling
symbol values from the input mapping.

Extracted from the entangled WIP commit 8dc5fa6 (parsed-symbol memo + .rgd staging);
this is the memo half only -- no .rgd. Gated separately.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…im ~3GB peak)

The CV type-index fixup (05af760) had stored the assigned type index in arrays parallel
to the dedup hash table (leaf_ht), whose cap is the TOTAL pre-dedup leaf count summed
over all objs. On large links that ti_arr (+ the hash_arr added for deref-free probing)
added ~3GB to peak working set vs the prior unique-sized assigned_type_ht.

Split the two concerns:
 - LNK_LeafHashTable: just {cap, bucket_arr} for dedup (total-sized, as before / unavoidable).
 - LNK_AssignedTiHash {cap, ti_arr, hash_arr}: hash -> assigned ti, sized to the UNIQUE
   (post-dedup) leaf count. Built in lnk_assign_type_indices_task by hashing each unique
   leaf into its own slot (atomic claim; unique leaves have distinct hashes since dedup is
   by hash). search_ti probes it in one deref-free pass (occupant hash stored on the slot),
   exactly as before -- just on a table sized by unique instead of total.

Keeps the single-probe fixup speed of 05af760; removes its peak-memory regression.
ti==0 marks empty (assigned ti is always >= CV_MinComplexTypeIndex). Builds clean,
95 linker torture PASS, 0 fail.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
g_input_type_map mapped o/obj/lib/rlib/res/rrt but not .a, so clang/meson-built
GNU ar archives (e.g. ThirdParty libdav1d.a) hit Error(002) 'unknown file format'.
rlib (also GNU ar) already routes to LNK_Input_Lib and parses, so map .a the same.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
honkstar1 and others added 14 commits June 19, 2026 16:59
LNK_Obj.parsed_symbols memoized the full COFF_ParsedSymbol per symbol -- including the
16B String8 name -- sized by total symbol count, held to exit. Store a slim
LNK_ParsedSymbolLite (every field except name, ~24B vs ~40B) and re-decode the name from
the read-only symbol record in the named accessor only. The hot _no_name path
(can_replace/GC/resolution) and all symbol-value patching never touch the name, so they
stay fully memoized; only the named/push path pays a re-decode (cold relative to total).

FN no-rrt full link: peak commit 50.5GB -> 49.1GB (-1.4GB), wall flat (~8-9s no-debug),
valid 5.68GB PDB, 95/0 linker torture.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Store the COFF symbol record as a U32 byte-offset into obj->data instead of an 8B
pointer, and value as U32 (COFF symbol value is U32). Struct goes 24B -> 16B (no
padding): raw_symbol_off(4) value(4) section_number(4) type(2) storage_class(1) aux(1).
The named/no_name accessors reconstruct the pointer as obj->data.str + off (obj->data ==
input->data, so the offset is stable). Sized by total symbol count -> 8B/sym off peak.

FN no-rrt full link: peak commit 49.1GB -> 48.4GB (-0.67GB), wall flat, valid 5.68GB
PDB, 95/0 linker torture.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
After type merging, prune any merged TPI/IPI record not transitively reachable
from a surviving symbol. Roots are the type indices referenced by the symbols
that survive /OPT:REF (plus inlinee call-site types); the type graph is then
closed over and everything unreached is dropped before the streams are written.
Runs only under /OPT:REF -- it is the debug-info analogue of dead-section
stripping, and is otherwise transparent (no visible type is removed).

Implementation notes:
  - parallel transitive closure (bulk-synchronous rounds, atomic mark/expand)
  - fwdref<->definition pairing via a per-unique-name ring so a live forward
    reference keeps its definition (and vice-versa)
  - compaction is in place with the remap kept in scratch, so peak memory is
    unchanged

Numbers (UnrealEditorFortnite-Engine.dll, /OPT:REF /OPT:ICF, hashing NONE):
  PDB  5315 -> 5081 MB   (type-GC alone: -234 MB)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…arallelized

Fold byte-identical COMDAT sections whose relocations point at equivalent
targets, iterated to a fixpoint, then redirect each group's followers at their
shared symbol-table node so every reference resolves to one leader and /OPT:REF
collects the now-unreferenced follower sections (and their associated
.pdata/.xdata/.debug$S). Mirrors link.exe /OPT:ICF.

  - equivalence: round-0 key from content + reloc structure + non-candidate
    target identity; refine by candidate targets' colors until the partition is
    stable; a final byte-compare + per-reloc target-color check guards folds
    (no hash-collision can produce a bad fold)
  - folds code AND read-only data (vtables, const tables, string literals);
    folding identical read-only data lets the functions that reference it fold
    too (cascade)
  - fully parallel: candidate collection (count -> exact alloc -> fill),
    content hashing + reloc-target resolution, refinement, and final grouping
    via a parallel LSD radix sort (8-bit digits). A flat open-addressing map
    with an avalanche-scrambled key avoids O(n^2) probing on UE-scale inputs.

Only externally-defined COMDATs are folded (the follower redirects through its
symbol). Static/internal-linkage folding is intentionally out of scope here.

Numbers (UnrealEditorFortnite-Engine.dll, vs /OPT:NOICF, hashing NONE):
  .text  727 -> 643 MiB   (-84)
  .rdata 218 -> 194 MiB   (-24)
  PDB   5562 -> 5081 MB   (-481)
  DLL    999 -> 882 MB    (-117)
  link    33 -> 20 s      (-13; less to relocate and emit downstream)

This commit also adds the shared parallel radix-sort helper
(lnk_radix_sort_u64_pairs) used here and by the PDB GSI/PSI sort.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ize GSI/PSI sort

DBI section contributions:
  - after sorting, merge contiguous contributions that share (section, module,
    flags), absorbing the alignment-padding gaps between them. On UE-scale
    input this collapses ~12.5M contribution records to ~2.0M and shrinks the
    DBI stream 367 -> 72 MB, with no change to the address map.

GSI/PSI publics sort:
  - the comparators got element-stable tiebreakers (sort by record offset /
    dereferenced symbol identity, not by slot pointer) so the median-of-9
    quicksort cannot degrade to O(n^2) on the large runs of equal-address /
    equal-name records that ICF now produces.
  - gsi_record_sort_by_sc returns a radix-sorted permutation index (via the
    shared lnk_radix_sort_u64_pairs) and the PSI address map is built from it,
    replacing a comparator sort that stalled multiple seconds on ICF-heavy links.

Numbers (UnrealEditorFortnite-Engine.dll):
  DBI stream  367 -> 72 MB
  removes a multi-second GSI/PSI sort stall on ICF-folded inputs

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The closure re-scanned every merged type each round (O(rounds * total types)) to
find marked-but-unexpanded leaves. In a full-link trace of UnrealEditorFortnite
that round-rescan dominated the type-GC: lnk_gc_expand_task ~13.3 s of CPU.

Replace it with a frontier worklist: the atomic mark now gates a single append per
leaf, and each round expands only the slice newly marked by the previous round, so
total work is O(reachable types) instead of O(rounds * total types). Drops the
per-round `expanded` bitmap and full-array sweeps.

Output is unchanged -- same reachable set, PDB byte-identical (5081 MB on the same
input), and debugger-fidelity checks (addr->symbol 100%, core types resolve) match
the pre-change build.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…-parse

lnk_opt_icf re-parsed every candidate section serially (lnk_coff_relocs_from_
section_header per candidate) just to size the flattened reloc-target arrays.
Move the per-candidate reloc count into lnk_icf_fill_task -- which is already
parallel and has the section in hand -- so lnk_opt_icf only does a cheap serial
prefix sum for reloc_first. No output change (PDB/.text/.pdata identical).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The frontier mark did an interlocked op on every reference edge. Add a plain
non-atomic check first (the mark bit only ever goes 0->1, so a stale "already
set" read is safe), so the interlocked op runs once per leaf at its 0->1
transition instead of once per edge. Output identical (PDB 5081 MB).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Type-GC prunes CodeView types not referenced by any surviving symbol. That is a
PDB-size win, but it removes types a debugger can still legitimately cast to in the
watch window (the reachable-from-symbols set is a subset of the castable-type set) --
which is why the same approach was reverted before after users reported losing the
ability to cast in the watch. So gate it behind /OPT:GCTYPES, default OFF; only opt in
when the smaller PDB is worth the reduced castable-type set.

Numbers (UnrealEditorFortnite-Engine.dll): default PDB 5315 MB; with /OPT:GCTYPES
5081 MB (-234). LINK ok, no RelocationAgainstRemovedSection either way.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two data races in the parallel library search made radlink output
nonreproducible: relinking the same inputs produced ~2.1M differing
bytes (every reloc against an import could shift).

Both originate in the generated DLL import objs (".idata"), whose
symbol values (IAT slots, jump thunks) are laid out in the order the
imports appear -- so any nondeterminism in the import set or order
propagates to every call site that references an imported function.

1. Import member order. The parallel lib search appends discovered
   import members to link->imports in worker/round completion order,
   which is nondeterministic. Sort them into a stable total order
   (by link_symbol, member_idx tie-break) before generating the
   import objs.

2. Misindexed dedup flag. When a second reference to an already-queued
   import is found, lnk_queue_lib_member OR'd LinkedRegular/LinkedImp
   into import_member_infos[member_idx] -- but member_idx indexes the
   *currently searched* lib, not the import's lib. It must be
   is_queued_import->member_idx. The wrong (race-determined) slot got
   flagged, so whether an import emitted a jump thunk varied run to
   run, changing the import obj's symbol count.

After both fixes, relinking UnrealEditorFortnite-Engine.dll is
byte-identical except for the 20 bytes of intentional PE timestamp
and PDB GUID/age (verified: 2,110,092 -> 20 differing bytes; output
size unchanged).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lnk_sort_contribs_task ran one serial radsort per chunk. A section's contribs live
in a single chunk sized to the whole section, so the merged .text chunk (millions of
entries) was sorted serially on one worker while every other thread idled -- the
straggler that stretched the "Sort Section Contribs" phase.

Sort chunks >= 64K entries with the parallel radix sort (key = Compose64Bit(obj_idx,
obj_sect_idx), which is unique per contrib so the order matches the comparator) using
all threads, before the per-chunk task pass handles the small remainder. Output is
unchanged: section sizes identical, byte diff within the pre-existing relink noise.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ICF refinement re-densified all ~N candidates every round, radix-sorting
the full set each iteration to a fixpoint. But a candidate alone in its
equivalence class can never split or merge again -- its color is final.

Track an active set of only the candidates still sharing a class with
another, and re-densify just that set each round (ids drawn from an
ever-increasing base so they never collide with the colors already
finalized for singletons). The per-round sort shrinks from all candidates
to those that still have a content+reloc twin, and converged classes drop
out as they fragment into singletons.

Output is unchanged -- relinking UnrealEditorFortnite-Engine.dll is
byte-identical to the prior ICF (same folds, same size) and reproducible
across runs. On that link the refinement loop drops from ~888ms to ~765ms
(first round prunes ~2.1M of 3.98M candidates to singletons immediately).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ved symbols

lnk_search_lib_task scans the entire search-chunk symbol set once per library
(2733 dispatches on the UE editor link). A symbol that started Undefined/Weak
stays in search_chunks even after a definition resolves it, so every later
library pass re-parsed it -- lnk_ref_from_symbol + lnk_parsed_symbol_from_coff
_symbol_idx -- just to recompute its interp and skip it. That parse faults the
COFF symbol record out of the mmap'd obj, and the profile showed those two
lines at ~65% of the task and a matching wall of page-fault kernel time.

The interp is already computed once in lnk_symbol_table_push_; cache it on
LNK_Symbol and read it in the hot loop. Only genuinely Weak symbols still parse
(for the weak-extension characteristics). The hash-trie node always points at
the current leader symbol, so the cached interp reflects the resolved state.

Output unchanged (byte-identical DLL+PDB, reproducible). Lib-search dispatch
wall drops ~18% (~1.5s -> ~1.2s on the UE editor link) with a larger drop in
aggregate CPU and page-fault traffic.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lnk_link_inputs resolves libraries to a fixpoint: an outer pass loops over
every library, and each library is re-searched (a full tp_for_parallel over
all workers, scanning every undefined/weak symbol in search_chunks) once per
drained input batch until nothing new resolves. On the UE editor link that is
~2733 dispatches, each waking and joining ~60 workers -- and the phase is
barrier-bound, so that wake/join is the cost, not the scan.

Most re-searches are redundant: search_chunks only grows during the loop
(symbols are never removed until the end) and member-queue dedup is idempotent,
so a re-search can only queue new members if the undefined/weak symbol set grew
or anti-dep searching was just enabled since this library was last searched.
Stamp each LNK_Lib with the search_chunks symbol count + anti-dep mode at its
last search and skip the dispatch when neither changed.

~24% fewer dispatches (2733 -> ~2089) and ~0.2s of wake/join wall-time removed.
Output is byte-identical and reproducible (which dispatch coalesces is timing
dependent, but a skipped one provably queued nothing, so the result is
unchanged -- verified relink-twice byte-identical across many runs).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@honkstar1 honkstar1 force-pushed the perf/radlink-link-time-plus branch from 1bb78ed to 9223d30 Compare June 20, 2026 00:00
@honkstar1 honkstar1 changed the title radlink: link-time, memory, and output-size optimizations radlink: reproducibility, link-time, memory, and output-size pass Jun 20, 2026
@honkstar1

Copy link
Copy Markdown
Author

Closing in favor of #842. This was an older umbrella branch; every commit here (COFF-symbol memoization, cached symbol interp, redundant lib re-search skip, type-GC + /OPT:GCTYPES, /OPT:ICF, import-determinism, GNU-ar input, section-contrib coalesce/sort, etc.) is absorbed or superseded by the refreshed #842, which is rebased on latest dev with already-taken commits dropped and kept as clean per-feature commits for cherry-picking. Continue review there.

@honkstar1 honkstar1 closed this Jun 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant