radlink: perf + feature series — ICF/ICFSTATIC, GCTYPES, header-units, link-time + memory (rebased on dev, taken commits dropped)#842
Draft
honkstar1 wants to merge 40 commits into
Draft
Conversation
8b67cdb to
d70c7b1
Compare
get_cpu_features was the top main-thread hot spot (~5.6s for one Fortnite link), 97% of it inside a single ATOMIC_LOAD(g_cpu_features). On MSVC, blake3_dispatch.c defines ATOMIC_LOAD as _InterlockedOr(&x,0) -- a lock'd RMW (full barrier) run on every BLAKE3 compress dispatch. The value is written once and read-only after, so the barrier is pointless. Enable BLAKE3's plain-load path (C11 _Atomic, a plain mov on x86) via build flags only, leaving the vendored third_party/blake3 source untouched: /std:c11 /experimental:c11atomics -DBLAKE3_ATOMICS=1 Scoped to the radlink target. MSVC C11 atomics need both /std:c11 and /experimental:c11atomics. get_cpu_features: 5591ms -> 4ms (main thread). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
coff_read_symbol_name scans a cstr in the memory-mapped string table -- the
dominant, page-fault-bound cost of bulk symbol parsing. Many hot callers parse
a full symbol but only read scalar fields (value/section/storage_class/aux) to
interpret the symbol value; the name is never used.
Add name-skipping parse variants and route the interp-only paths through them:
coff_parse_symbol{16,32}_no_name (coff_parse.c) -- and the full variants now
call these + add the name, so the scalar logic lives in one place
lnk_parsed_symbol_from_coff_symbol_idx_no_name (lnk_obj.c)
lnk_interp_from_symbol / lnk_can_replace_symbol / lnk_on_symbol_replace
(lnk_symbol_table.c) and the lnk_search_lib_task loop (lnk.c)
Where the name is still needed (lnk_search_lib) it uses the already-cached
LNK_Symbol.name instead of re-parsing. lnk_can_replace_symbol previously parsed
dst/src twice (full parse + a second parse for interp); collapsed to one
no-name parse each.
coff_parse_symbol32 on the main thread: 3922ms -> ~810ms (name-needed callers).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lnk_fixup_cv_type_indices did two open-addressing probes per type-index reference: lnk_leaf_hash_table_search (leaf_ref -> canonical bucket), then lnk_assigned_type_ht_search (canonical bucket -> assigned type index, via a second hash table keyed by leaf-ref content). Both are cache-miss-bound and this ran across every type-index reference in every obj. Store the assigned type index directly on the leaf hash table: add a ti_arr parallel to bucket_arr. lnk_assign_type_indices_task writes ti = min+i into the leaf's bucket slot (each unique leaf owns a distinct slot, so worker writes never collide), and the new lnk_leaf_hash_table_search_ti recovers it in one probe. Removes the entire assigned_type_hts table and its build pass; deletes the now-dead lnk_leaf_hash_table_search and lnk_assigned_type_ht_search. Correctness: deduplicated leaves share the same ghash (debug_h value), so the fixup query and the assign-time canonical bucket hash to the same slot. lnk_fixup_cv_type_indices (main thread): 1445ms -> 390ms inclusive. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Input obj/lib files are mapped copy-on-write (PAGE_WRITECOPY/FILE_MAP_COPY) so the linker can patch them in place. Pages touched during linking become private-dirty; at process exit the kernel reclaims them in single-threaded address-space rundown -- ~3s of lingering process time after the last thread exits for a large (Fortnite-scale) link. After all outputs are written and inputs are no longer read (post image-write join), unmap the whole-file CoW views in parallel on the thread pool. The same reclaim work then runs multi-threaded, off the serial post-exit path: measured ~34s of aggregate UnmapViewOfFile CPU collapsing to ~0.55s wall, and the post-exit process tail dropping from ~3s to ~0.5s. Only the is_thin whole-file views are swept (lib-member substrings and linkgen arena data are skipped), and only in the copy-on-write (read-only) mapping mode -- read-write-shared mapping would flush dirty pages back to the input files on unmap. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two micro-optimizations on hot parse helpers (profiled as the largest aggregate-CPU functions in a Fortnite link): - lnk_obj_section_from_sect_idx: split out a _no_name variant that skips the section-name string-table lookup (coff_name_from_section_header). The full variant now reuses it + adds the name. lnk_raw_directives_from_obj iterated every section of every obj building the full section struct just to test a flag, computing the name on ~all sections though only .drectve needs it -- now uses the no-name variant and resolves the name only inside the LnkInfo branch. - lnk_parsed_symbol_from_coff_symbol_idx (+_no_name): return the coff_parse_* result directly instead of zero-initializing a local and assigning to it, removing a redundant ~48-byte COFF_ParsedSymbol copy + zero-init per call (RVO). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lnk_leaf_hash_table_search_ti (~EpicGames#2 radlink hotspot, ~48% of its self-time) spent its time in lnk_match_leaf_ref, which is just a_hash==b_hash but fetches the bucket's hash via lnk_hash_from_leaf_ref -> input->debug_h_arr[obj].v[leaf], a scattered cache miss per probe step. Add LNK_LeafHashTable.hash_arr (parallel to bucket_arr), populated at bucket claim/update (both lnk_populate_leaf_ht and lnk_leaf_dedup_task) with the leaf's debug_h hash. search_ti now matches via hash_arr[idx] == hash -- no deref. Exact equivalent (match is pure hash compare); same value across same-hash updates. Gated 65/65 linker torture (ghash_basic/match_debug_t, determ_test, p2r_determinism). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The image buffer was push'd on the shared link arena and only reclaimed in the single-threaded process rundown at exit -- a multi-second kernel page-reclaim tail (observed: one thread 100% in-kernel, zero user frames). Allocate it as a standalone reserve_memory/commit_memory region and release_memory() it the instant the background image-write thread joins (image is on disk, no later reader). VirtualFree(MEM_RELEASE) returns fast; the kernel zeroes the ~1GB on its background thread, overlapping the parallel input-view release + exit instead of blocking rundown. Discard early so the kernel cleans up while the app still runs -- don't defer to exit. Gated 65/65 linker torture (determ_test + p2r_determinism: image correct). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…hotspot) Parse each COFF symbol once in lnk_obj_initer into LNK_Obj.parsed_symbols; lnk_parsed_symbol_from_coff_symbol_idx[_no_name] becomes an array index instead of re-decoding the mmapped symbol table on every access (it was the EpicGames#1 hotspot, lnk_parsed_symbol_from_coff_symbol_idx). All symbol-value patch sites (weak-replace, COMDAT-leader, regular/common fixups) write obj->parsed_symbols[idx], decoupling symbol values from the input mapping. Extracted from the entangled WIP commit 8dc5fa6 (parsed-symbol memo + .rgd staging); this is the memo half only -- no .rgd. Gated separately. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…im ~3GB peak) The CV type-index fixup (05af760) had stored the assigned type index in arrays parallel to the dedup hash table (leaf_ht), whose cap is the TOTAL pre-dedup leaf count summed over all objs. On large links that ti_arr (+ the hash_arr added for deref-free probing) added ~3GB to peak working set vs the prior unique-sized assigned_type_ht. Split the two concerns: - LNK_LeafHashTable: just {cap, bucket_arr} for dedup (total-sized, as before / unavoidable). - LNK_AssignedTiHash {cap, ti_arr, hash_arr}: hash -> assigned ti, sized to the UNIQUE (post-dedup) leaf count. Built in lnk_assign_type_indices_task by hashing each unique leaf into its own slot (atomic claim; unique leaves have distinct hashes since dedup is by hash). search_ti probes it in one deref-free pass (occupant hash stored on the slot), exactly as before -- just on a table sized by unique instead of total. Keeps the single-probe fixup speed of 05af760; removes its peak-memory regression. ti==0 marks empty (assigned ti is always >= CV_MinComplexTypeIndex). Builds clean, 95 linker torture PASS, 0 fail. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
LNK_Obj.parsed_symbols memoized the full COFF_ParsedSymbol per symbol -- including the 16B String8 name -- sized by total symbol count, held to exit. Store a slim LNK_ParsedSymbolLite (every field except name, ~24B vs ~40B) and re-decode the name from the read-only symbol record in the named accessor only. The hot _no_name path (can_replace/GC/resolution) and all symbol-value patching never touch the name, so they stay fully memoized; only the named/push path pays a re-decode (cold relative to total). FN no-rrt full link: peak commit 50.5GB -> 49.1GB (-1.4GB), wall flat (~8-9s no-debug), valid 5.68GB PDB, 95/0 linker torture. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Store the COFF symbol record as a U32 byte-offset into obj->data instead of an 8B pointer, and value as U32 (COFF symbol value is U32). Struct goes 24B -> 16B (no padding): raw_symbol_off(4) value(4) section_number(4) type(2) storage_class(1) aux(1). The named/no_name accessors reconstruct the pointer as obj->data.str + off (obj->data == input->data, so the offset is stable). Sized by total symbol count -> 8B/sym off peak. FN no-rrt full link: peak commit 49.1GB -> 48.4GB (-0.67GB), wall flat, valid 5.68GB PDB, 95/0 linker torture. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
After type merging, prune any merged TPI/IPI record not transitively reachable
from a surviving symbol. Roots are the type indices referenced by the symbols
that survive /OPT:REF (plus inlinee call-site types); the type graph is then
closed over and everything unreached is dropped before the streams are written.
Runs only under /OPT:REF -- it is the debug-info analogue of dead-section
stripping, and is otherwise transparent (no visible type is removed).
Implementation notes:
- parallel transitive closure (bulk-synchronous rounds, atomic mark/expand)
- fwdref<->definition pairing via a per-unique-name ring so a live forward
reference keeps its definition (and vice-versa)
- compaction is in place with the remap kept in scratch, so peak memory is
unchanged
Numbers (UnrealEditorFortnite-Engine.dll, /OPT:REF /OPT:ICF, hashing NONE):
PDB 5315 -> 5081 MB (type-GC alone: -234 MB)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…arallelized
Fold byte-identical COMDAT sections whose relocations point at equivalent
targets, iterated to a fixpoint, then redirect each group's followers at their
shared symbol-table node so every reference resolves to one leader and /OPT:REF
collects the now-unreferenced follower sections (and their associated
.pdata/.xdata/.debug$S). Mirrors link.exe /OPT:ICF.
- equivalence: round-0 key from content + reloc structure + non-candidate
target identity; refine by candidate targets' colors until the partition is
stable; a final byte-compare + per-reloc target-color check guards folds
(no hash-collision can produce a bad fold)
- folds code AND read-only data (vtables, const tables, string literals);
folding identical read-only data lets the functions that reference it fold
too (cascade)
- fully parallel: candidate collection (count -> exact alloc -> fill),
content hashing + reloc-target resolution, refinement, and final grouping
via a parallel LSD radix sort (8-bit digits). A flat open-addressing map
with an avalanche-scrambled key avoids O(n^2) probing on UE-scale inputs.
Only externally-defined COMDATs are folded (the follower redirects through its
symbol). Static/internal-linkage folding is intentionally out of scope here.
Numbers (UnrealEditorFortnite-Engine.dll, vs /OPT:NOICF, hashing NONE):
.text 727 -> 643 MiB (-84)
.rdata 218 -> 194 MiB (-24)
PDB 5562 -> 5081 MB (-481)
DLL 999 -> 882 MB (-117)
link 33 -> 20 s (-13; less to relocate and emit downstream)
This commit also adds the shared parallel radix-sort helper
(lnk_radix_sort_u64_pairs) used here and by the PDB GSI/PSI sort.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ize GSI/PSI sort
DBI section contributions:
- after sorting, merge contiguous contributions that share (section, module,
flags), absorbing the alignment-padding gaps between them. On UE-scale
input this collapses ~12.5M contribution records to ~2.0M and shrinks the
DBI stream 367 -> 72 MB, with no change to the address map.
GSI/PSI publics sort:
- the comparators got element-stable tiebreakers (sort by record offset /
dereferenced symbol identity, not by slot pointer) so the median-of-9
quicksort cannot degrade to O(n^2) on the large runs of equal-address /
equal-name records that ICF now produces.
- gsi_record_sort_by_sc returns a radix-sorted permutation index (via the
shared lnk_radix_sort_u64_pairs) and the PSI address map is built from it,
replacing a comparator sort that stalled multiple seconds on ICF-heavy links.
Numbers (UnrealEditorFortnite-Engine.dll):
DBI stream 367 -> 72 MB
removes a multi-second GSI/PSI sort stall on ICF-folded inputs
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The closure re-scanned every merged type each round (O(rounds * total types)) to find marked-but-unexpanded leaves. In a full-link trace of UnrealEditorFortnite that round-rescan dominated the type-GC: lnk_gc_expand_task ~13.3 s of CPU. Replace it with a frontier worklist: the atomic mark now gates a single append per leaf, and each round expands only the slice newly marked by the previous round, so total work is O(reachable types) instead of O(rounds * total types). Drops the per-round `expanded` bitmap and full-array sweeps. Output is unchanged -- same reachable set, PDB byte-identical (5081 MB on the same input), and debugger-fidelity checks (addr->symbol 100%, core types resolve) match the pre-change build. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…-parse lnk_opt_icf re-parsed every candidate section serially (lnk_coff_relocs_from_ section_header per candidate) just to size the flattened reloc-target arrays. Move the per-candidate reloc count into lnk_icf_fill_task -- which is already parallel and has the section in hand -- so lnk_opt_icf only does a cheap serial prefix sum for reloc_first. No output change (PDB/.text/.pdata identical). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The frontier mark did an interlocked op on every reference edge. Add a plain non-atomic check first (the mark bit only ever goes 0->1, so a stale "already set" read is safe), so the interlocked op runs once per leaf at its 0->1 transition instead of once per edge. Output identical (PDB 5081 MB). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Type-GC prunes CodeView types not referenced by any surviving symbol. That is a PDB-size win, but it removes types a debugger can still legitimately cast to in the watch window (the reachable-from-symbols set is a subset of the castable-type set) -- which is why the same approach was reverted before after users reported losing the ability to cast in the watch. So gate it behind /OPT:GCTYPES, default OFF; only opt in when the smaller PDB is worth the reduced castable-type set. Numbers (UnrealEditorFortnite-Engine.dll): default PDB 5315 MB; with /OPT:GCTYPES 5081 MB (-234). LINK ok, no RelocationAgainstRemovedSection either way. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lnk_sort_contribs_task ran one serial radsort per chunk. A section's contribs live in a single chunk sized to the whole section, so the merged .text chunk (millions of entries) was sorted serially on one worker while every other thread idled -- the straggler that stretched the "Sort Section Contribs" phase. Sort chunks >= 64K entries with the parallel radix sort (key = Compose64Bit(obj_idx, obj_sect_idx), which is unique per contrib so the order matches the comparator) using all threads, before the per-chunk task pass handles the small remainder. Output is unchanged: section sizes identical, byte diff within the pre-existing relink noise. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ICF refinement re-densified all ~N candidates every round, radix-sorting the full set each iteration to a fixpoint. But a candidate alone in its equivalence class can never split or merge again -- its color is final. Track an active set of only the candidates still sharing a class with another, and re-densify just that set each round (ids drawn from an ever-increasing base so they never collide with the colors already finalized for singletons). The per-round sort shrinks from all candidates to those that still have a content+reloc twin, and converged classes drop out as they fragment into singletons. Output is unchanged -- relinking UnrealEditorFortnite-Engine.dll is byte-identical to the prior ICF (same folds, same size) and reproducible across runs. On that link the refinement loop drops from ~888ms to ~765ms (first round prunes ~2.1M of 3.98M candidates to singletons immediately). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ved symbols lnk_search_lib_task scans the entire search-chunk symbol set once per library (2733 dispatches on the UE editor link). A symbol that started Undefined/Weak stays in search_chunks even after a definition resolves it, so every later library pass re-parsed it -- lnk_ref_from_symbol + lnk_parsed_symbol_from_coff _symbol_idx -- just to recompute its interp and skip it. That parse faults the COFF symbol record out of the mmap'd obj, and the profile showed those two lines at ~65% of the task and a matching wall of page-fault kernel time. The interp is already computed once in lnk_symbol_table_push_; cache it on LNK_Symbol and read it in the hot loop. Only genuinely Weak symbols still parse (for the weak-extension characteristics). The hash-trie node always points at the current leader symbol, so the cached interp reflects the resolved state. Output unchanged (byte-identical DLL+PDB, reproducible). Lib-search dispatch wall drops ~18% (~1.5s -> ~1.2s on the UE editor link) with a larger drop in aggregate CPU and page-fault traffic. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lnk_link_inputs resolves libraries to a fixpoint: an outer pass loops over every library, and each library is re-searched (a full tp_for_parallel over all workers, scanning every undefined/weak symbol in search_chunks) once per drained input batch until nothing new resolves. On the UE editor link that is ~2733 dispatches, each waking and joining ~60 workers -- and the phase is barrier-bound, so that wake/join is the cost, not the scan. Most re-searches are redundant: search_chunks only grows during the loop (symbols are never removed until the end) and member-queue dedup is idempotent, so a re-search can only queue new members if the undefined/weak symbol set grew or anti-dep searching was just enabled since this library was last searched. Stamp each LNK_Lib with the search_chunks symbol count + anti-dep mode at its last search and skip the dispatch when neither changed. ~24% fewer dispatches (2733 -> ~2089) and ~0.2s of wake/join wall-time removed. Output is byte-identical and reproducible (which dispatch coalesces is timing dependent, but a skipped one provably queued nothing, so the result is unchanged -- verified relink-twice byte-identical across many runs). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
External ICF folds a follower by redirecting its defining symbol to the leader (Fnode->symbol = Lnode->symbol); /OPT:REF then finds the follower unreferenced and dead-strips it along with its associated .pdata/.xdata. Static COMDATs have no external symbol -- they're reached only by section-relative relocs to a static symbol that resolves to themselves -- so that path can't fold them, leaving a large .text gap vs MSVC. Add an opt-in static fold. lnk_icf_section_kind now returns candidates for static COMDATs too; they join the same content+reloc-equivalence classes. Leader selection prefers a non-static member so an external is never folded into a static leader. A static follower records a per-section icf_fold map (LNK_Obj.icf_fold: follower section -> leader obj/section) instead of a symbol redirect. The /OPT:REF mark-live walk consults that map: when a reference (or an associative-section walk) reaches a folded static follower, it marks the LEADER section live cross-obj and enqueues the leader's relocs/associated sections instead -- so the follower dead-strips, taking its .pdata/.xdata with it. Crucially the redirect is applied in the associated-section walk too: folded static .text are often associative COMDATs (EH funclets/thunks) pulled in via associated_sections[], and keeping those followers live was the prior attempt's bug (~250K malformed .pdata + nondeterminism). lnk_set_icf_static_leader_contribs_task then redirects folded followers' sect_map entries to the leader's contrib so any residual reloc resolves to the identical leader. Gated behind /OPT:ICFSTATIC (default off): runtime-unvalidated, like /OPT:GCTYPES. Default ICF output is byte-identical to before. On UnrealEditorFortnite-Engine.dll with /OPT:ICFSTATIC: DLL 925 -> 844 MB (.text 643.1 -> 594.3 MiB, .rdata 194.9 -> 169.2 MiB), output reproducible (relink byte-identical), .pdata clean (1,740,509 records, 0 malformed, 0 out-of-order), link exits 0. No runtime validation performed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Resolve MSVC C++ header-unit IFC debug records (LF_IFC_RECORD 0x1522) by merging the .ifc .msvc.trait.debug-records CodeView stream and redirecting each record to its real type -> fixes VS debugger AV stepping into header- unit code (BAD 921->0, 44/44 MSVC name-match, live debug verified). ICF: key non-candidate Regular COMDAT reloc targets by their resolved COMDAT leader instead of per-obj (input_idx,section) -> folds identical funcs referencing per-obj-dup COMDATs (.text 623->536 MiB, -82.56 MiB, deterministic, .pdata valid). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…vor-emit), dense-color SoA+U32, reloc-weight load-balance, parallel fold-verify + IFC scan optG: ~35s->22-23s warm on monolithic UnrealEditorFortnite-Engine link, all determinism-verified (PDB byte-identical to fixes bar /BREPRO GUID). Stacks on header-unit IFC + ICF-keying fixes (69ac14f). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…llel prefix-sum)
The per-round serial group-scan in lnk_icf_dense_colors_active (walk sorted
keys, assign a dense color id per distinct-key group, mark size>=2 survivors,
exclusive-prefix their emit slots) was the remaining refine-window sawtooth
bottleneck (inlined into lnk_link_image self). Replace it with a 3-phase
parallel prefix sum:
mark (parallel): per sorted position compute boundary (run start) and keep
(run size>=2 survivor) bits -- each reads only sk[k-1..k+1]
so it is chunk-local -- plus per-chunk local boundary/keep/
surviving-class counts.
prefix (serial): tiny exclusive prefix over the worker_count chunk totals
(NOT over n) -> each chunk's exclusive class-id / emit-slot
base; sum surviving classes.
apply (parallel): each chunk re-derives color_at[]/out_slot[] from its base.
Determinism: the per-position values are a pure prefix of independent
per-position bits, byte-identical to the old serial running counter regardless
of how tp_divide_work splits the chunks; next_active survivors are emitted in
ascending sorted-color order via the prefix slots. The existing parallel color
scatter + survivor emit are unchanged. Verified: linked DLL byte-identical to
the optG canonical (modulo the /BREPRO GUID block), self-deterministic across
relinks and reproducible across recompiles; PDB dia_types BAD=0. A gated
ICF_SCAN_SELFCHECK build asserts the parallel scan matches the serial scan
byte-for-byte.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… region Collapse the ICF color-refine round loop (~18 rounds x ~10 tp_for_parallel phases = ~126 fork-join cycles) into a SINGLE tp_for_parallel region of worker_count participants (lnk_icf_refine_region_task). Every former phase boundary becomes a barrier_wait(tp->barrier) inside the region; the tiny serial glue (radix per-pass 256xW prefix, group-scan W-chunk prefix, convergence + buffer swap) runs on worker 0 behind a barrier while the others wait. This kills the thread-pool wake->work->sleep sawtooth that dominated lnk_link_image self time without changing any task body. Order is preserved exactly: ranges are rebuilt each round into preallocated buffers (in-place lnk_icf_divide_by_reloc / tp_divide_work), all phase math is byte-identical to the per-round path (refine, gather, LSD radix, scan mark/apply, color scatter, survivor emit). Per-round scratch is preallocated once to cand_count and reused; the radix double-buffer / pass-count / pointer swap are driven by worker 0 across barriers. 1:1 worker<->task is guaranteed because every participant blocks on the first barrier before any can steal a second task. Determinism: relinked UnrealEditorFortnite-Engine.dll is byte-identical to a freshly-built e32b662 (group-scan) DLL except the /BREPRO GUID block and the known pre-existing offset-261 export-dir Size field. ICF fold counts unchanged (folded 10082697 of 19045185 into 5661141 classes); dia_types BAD=0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…er) atop persistent region lnk_icf_divide_by_reloc_into: replace two O(n) random gathers (cands[active[i]].reloc_count, cache-miss-bound, x~18 rounds = ~5.6s serial, found via no-inline trace) with O(worker_count) count-based split. Work-split only -> output byte-identical (verified: combo == canonical, 0 diffs). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…re-sort, ~3-4s) The persistent ICF refine region re-sorts the entire ~10.65M-candidate active set every round until a round splits nothing. Measured churn (live UE Engine link): the partition is ~99.99% stable by round ~7; the last ~11 rounds each still pay the full ~10.65M-key sort just to resolve a few hundred cumulative splits. That tail is ~3-4s of pure COMPUTE waste. Run the persistent region for a bounded warm-up (8 rounds, where churn is large and the full parallel sort wins), then hand the still-active set to a dirty-class worklist (Hopcroft "process only what can change"): - Per-color member slab; the work unit is a CLASS. - Each round re-keys ONLY members of dirty classes against a FROZEN colors[] snapshot (Jacobi commit at the round barrier -- reading mid-round colors would over-refine past the unique coarsest stable partition and may never terminate; this is the trap that hung the prior arm). - A class can only split if its own color or a reloc target's color changed; after a split, enqueue the split class's referrers via a CSR reverse target->referrers index. Dirty set empty => fixpoint. - Per-class re-key uses a serial pool-free sort (classes are tiny; the prior arm re-entered the thread pool thousands of times per round -> hang). Both paths compute the unique coarsest stable partition, so the final colors[] partition is identical; ids are renamed but every fold consumer (group-by-color, identity-keyed leader election, colors[a]==colors[b] verify) depends only on the equivalence relation. Output DLL byte-identical to canonical (bar /BREPRO GUID). Hang-guard: hard 40-round cap + non-progress (dirty set not shrinking) detector; on trip, AssertAlways -> never ships a spinning binary. Gated -DICF_WORKLIST_SELFCHECK=1 build runs BOTH the worklist and the uncapped region each link and AssertAlways the partitions are identical (verified clean across the whole UE link). Verified on UE Editor Fortnite Engine.dll: - worklist tail: 10 rounds (next_dirty 14->0), handoff active=10654290 - fold count EXACT: folded 10082697 of 19045185 into 5661141 classes - self-cmp GUID-only (17 bytes); vs canonical a597eb6 GUID-only (26 bytes, all in the debug-dir/GUID window, zero diffs elsewhere) - dia_types: UDTs=2559825 children=39928767 BAD=0 - ICF_WORKLIST_SELFCHECK: partition identical to full region every link Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Stacks worklist refine (1692a9d) + parallel lnk_apply_ifc_debug_records discovery scan (766deb7e). Byte-identical to canonical; BAD=0; 0x1522->0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The worklist's CSR reverse-index (rev_adj = U32 x candidate-edge-count) costs ~10GB peak on the UE link -- net-negative on the page-fault-bound critical path vs the ~3-4s of tail re-sorts it saves. region_cap=64 -> region converges (~19 rounds), worklist handoff skipped, no reverse-index. Re-enable by lowering region_cap. Size wins intact (.text 536MB), sound, links clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ng insert lnk_opt_icf built the (input_idx,sn)->cand_idx+1 lookup serially via lnk_icf_map_put; Superluminal showed ~642ms there, cache-miss-bound on the scrambled-slot keys[] probe. Parallelize the insert loop over the thread pool. Keys are UNIQUE (1:1 map) and the map is read only by key afterward (lnk_icf_map_get, post-barrier), so insert ORDER is output-neutral. New lnk_icf_map_put_atomic claims each empty slot with a CAS EMPTY->key (ins_atomic_u64_eval_cond_assign); only the CAS winner writes vals[slot]. Load factor <=0.5 (lnk_icf_map_make oversizes cap>=capacity*2). Output byte-identical to canonical bar /BREPRO GUID (35 bytes, 2 RSDS records). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> (cherry picked from commit ff632aab8f81b3de37eb2a0a77c10845ce019e4c)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…path cand_map keys fill: LNK_ICF_EMPTY is all-0xFF -> single MemorySet vs scalar per-U64 store loop (256MB serial first-touch on the monolithic link). image_fill_task: direct copy fast-path for single-data-node contribs (the vast majority), skipping the list-walk + cursor on the hot 739MB image write. Byte-identical (27B: chksum+GUID). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…merge dedup probes) leaf_ht + assigned_ti caps rounded via u64_up_to_pow2 so the bucket index is hash & (cap-1) instead of hash % cap -- removes a 64-bit DIV from the densest type-dedup/fixup probe loops (lnk_leaf_hash_table_search_ti, lnk_leaf_dedup_task, lnk_hash_debug_t_task, assigned-ti pass). Byte-identical (back-to-back A/B: 29B chksum+GUID, control 17B; the load factor stays <=~0.65). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… already-searched symbols) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
After merge-types reaches the per-thread scratch high-water (~9GB of tctx scratch arenas stay committed but idle through the PDB peak), release the committed-but-unused scratch pages back to the OS before the PDB build re-grows them. Drops recorded peak working set on the monolithic UnrealEditorFortnite-Engine.dll link. - arena_decommit_unused(): decommit committed pages strictly above each block's live pos in the active chain, and the unused bodies of free-list blocks (keeping the header page). Reservation kept; push path re-commits on demand, so reuse is transparent and output byte-identical. - tctx_scratch_decommit(): decommit the calling thread's two equipped scratch arenas. - lnk_scratch_decommit_worker + tp_for_parallel(worker_count) with an in-task barrier: every worker (worker 0 IS the main thread) decommits its own scratch exactly once between lnk_merge_types and lnk_build_pdb. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ostic, byte-neutral) When RADLINK_PHASE_LOG is set, lnk_log_timers writes machine-parseable raw per-phase microseconds (Image/PDB/RDI/Lib/Debug + TOTAL) to that path, for automated perf A/B. Env-unset -> identical code path, DLL/PDB byte-identical. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
d70c7b1 to
bd71560
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this is
The remaining radlink perf + feature contribution, rebased on latest
devwith everything you've already taken dropped. Base:dev.Kept as individual, reviewable, per-feature commits — cherry-pick whatever you want; nothing here is squashed and nothing here is something you already have.
40 commits, roughly:
/OPT:ICFidentical COMDAT folding (code + read-only data),/OPT:ICFSTATIC(static/internal-linkage COMDATs), leader-keying, and a parallelized refine pipeline (persistent-worker region, dense-color SoA, reloc-weight load-balance, parallel fold-verify)./OPT:GCTYPES(opt-in, default off): GC unreferenced CodeView types before PDB emit, frontier-worklist transitive closure.0x1522) + ICF leader-keying so header-unit objects link and debug cleanly.65s→18s), skip redundant library re-searches, per-lib frontier cursor, parallelize section-contrib sort / make_code_view_input / cand_map build, pow2 mask-index hash caps, batched thread-pool wake.RADLINK_PHASE_LOGper-phase micros (byte-neutral).Output stays byte-identical to before unless the commit's whole point is to change bytes (the size / determinism work) — verified by relinking and
cmpon DLL+PDB.The cross-process shared thread pool stacks on top of this in #847 (dual-path).