fix(riscv64): relax out-of-range branches + frame-slot overlap (#1666)#1671
Merged
Conversation
RISC-V B-type conditional branches reach only +-4 KiB and J-type jumps only +-1 MiB, so a function whose encoded body exceeds those ranges could not encode: the patcher rejected the overflowing displacement instead of relaxing it. A large stack frame (>2 KiB offsets expanding into multi-instruction sequences) inflates a body past 4 KiB and overflows a conditional branch spanning it. Add branch relaxation to the riscv64 encoder. A conditional past +-4 KiB becomes its inverted guard (a short branch skipping the trampoline) plus a jal to the real target; a jal past +-1 MiB becomes an auipc+jalr pc-relative trampoline. The relaxation runs as a per-function fixpoint after the body is emitted: each growth opens a text gap that slides the rest of the function down and fixes the block / fixup / symbol / relocation tables, so each rescan remeasures against the grown layout. The new shared encode.insert_text owns the ISA-general gap mechanism; encode.block_offset is made public so the pass can resolve targets. Closes #1666
Any riscv64 function with an address-taken local or a spill / aggregate slot (frame.size > 0) corrupted its own saved return address and segfaulted at runtime. The prologue pins s0 at the frame top, with the 16-byte ra / s0 record and the callee-saved areas occupying the bytes immediately below it, but the shared frame phase assigns local slot offsets just below the frame pointer (the x86 model). On riscv64 those landed on top of the saved record - e.g. a [256]u8 buffer at s0-256 while ra was saved at s0-8, so zeroing the buffer clobbered ra. Bias the s0-relative slot offset down by frame_reserved_top (16 + callee-saved GP + FP bytes) so locals land below the reserved region. Localized to frame_slot_offset, mirrored in the assembly printer; s0 stays at the frame top so incoming stack-argument access is unchanged. It byte-encoded fine, so byte-verify never caught it and the register-only freestanding fixtures never exercised it.
Extend the freestanding rv64 fixture with two probes folded into the qemu exit code, so a regression in either fix changes the asserted code: - parse_probe: a deterministic parse over a [2048]u8 stack buffer. The large frame inflates the encoded body past 4KiB and leaves a conditional branch out of the B-type +-4KiB range, exercising long-branch relaxation (#1666). verify.sh asserts the inverted-guard + jal sequence is present. - stack_probe: a small stack-local buffer written then read back, exercising frame-slot addressing (#1670). The result matches the unrelaxed x86_64 / aarch64 build (exit code 68). verify.sh: disassemble with --mattr=+m,+f,+d so the <unknown> guard does not false-positive on valid M / F / D words (the backend emits no .riscv.attributes ISA string yet), and read grep inputs from here-strings so a large disassembly does not trip a SIGPIPE under pipefail.
…h-relax # Conflicts: # test/riscv64/src/main.mach # test/riscv64/verify.sh
A 32-bit bitwise and / or / xor encoded an illegal instruction and faulted with SIGILL at runtime (surfaced building std crypto / bignum, which is u32-heavy). encode_alu3 selected the major opcode through alu_opcode(width), which returns OP_32 (the RV64 word group) at 32-bit width - but RISC-V defines no andw / orw / xorw, so the word matched no encoding (funct3 6/7/4 with funct7 0 in OP_32 is reserved). Bitwise ops have no word form: they are bit-for-bit identical at any width, and a bitwise op of two consistently extended 32-bit operands yields a consistently extended result. Thread a word_form flag through encode_alu3 so mul keeps its mulw word form while and / or / xor always encode as the full-register OP. add / sub / shifts / mul / div already use valid word forms, so only the logical ops were affected. Verified: a std sha256 / sha512 / bignum consumer cross-compiled for linux-riscv64 now runs under qemu to the same result as the native build, and a 32-bit bitwise probe in the fixture matches the native value.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #1666
Closes #1670
Closes #1672
Fixes the riscv64 backend so arbitrary, large, real std code byte-encodes and runs correctly under qemu. Three distinct backend gaps were found by driving a real std workload (
dns_nameserver, crypto sha256/sha512, bignum, large match chains, >2KiB frames) to a clean cross-compile + qemu run. Each is its own commit, cross-checked against the unrelaxed x86_64 build. Mergedorigin/dev(RV64A inline-asm #1669) in; the fixture exercises those atomics too.Gap 1 - no long-branch relaxation (#1666)
error: encode: riscv64 branch displacement exceeds the +-4KiB rangeon any function whose encoded body exceeds the B-type +-4 KiB reach (trigger:std.system.os.linux.shared.dns_nameserver, avar buf:[2048]u8frame whose >2 KiB stack offsets expand into multi-instruction sequences, inflating the body past 4 KiB).jalto the real target; ajalpast +-1 MiB becomes anauipc t0,%hi ; jalr x0,t0,%lopc-relative trampoline (full 32-bit reach). It runs as a per-function fixpoint after the body is emitted: each growth opens a text gap via the new sharedencode.insert_text, which slides the rest of the function down and fixes the block / fixup / symbol / relocation tables, so each rescan remeasures against the grown layout.encode.block_offsetis made public so the pass can resolve targets;patch_branch_riscv64learns theauipc+jalrform. The inverted guard's skip is purely local and resolved in the pass; the trampoline's real-target displacement is left to pass 2 through the repointed fixup.Gap 2 - frame slots overlap the saved ra/s0 record (#1670)
frame.size > 0) corrupts its own saved return address and segfaults at runtime (SIGSEGV, NULL deref). It byte-encodes fine, so byte-verify never caught it, and the register-only freestanding fixtures never exercised it.s0at the frame top, with the 16-byte ra/s0 record and the callee-saved areas occupying the bytes immediately below it. The shared frame phase assigns local slot offsets just below the frame pointer (the x86 model: locals immediately below fp), so on riscv64 those land on top of the saved record - e.g.var buf:[256]u8gots0-256whilerawas saved ats0-8, so zeroing the buffer clobberedra.frame_reserved_top(16 + callee-saved GP + FP bytes) so locals land below the reserved region. Localized toframe_slot_offset, mirrored in the assembly printer.s0stays at the frame top, so incoming stack-argument access is unchanged.Gap 3 - 32-bit and/or/xor encode illegal word-group ops (#1672)
and/or/xorencodes an illegal instruction and faults with SIGILL at runtime (surfaced building std crypto / bignum, which are u32-heavy). It byte-encodes without error.encode_alu3selected the major opcode throughalu_opcode(width), which returnsOP_32(the RV64 word group) at 32-bit width - but RISC-V defines noandw/orw/xorw(the wordfunct36/7/4 withfunct70 is reserved).add/sub/mul/ shifts have valid word forms; only the logical ops do not.word_formflag throughencode_alu3somulkeeps itsmulwword form whileand/or/xoralways encode as the full-registerOP. A bitwise op of two consistently extended 32-bit operands yields a consistently extended result, so the full-register form is correct at every width.Verification
test/riscv64): extends the freestanding rv64 fixture with a >4 KiBparse_probe(forces long-branch relaxation - verify.sh asserts the inverted-guard +jalsequence), astack_probe(stack-local frame slot), and abitmix(32-bit bitwise word-group). Folded with the existing const-shift and RV64A atomics probes into a qemu exit code (70) asserted byverify.sh; the code matches the unrelaxed x86_64 build, so a regression in any fix changes it.verify.shnow disassembles with--mattr=+m,+f,+d,+a(the backend emits no.riscv.attributesISA string yet, so objdump must be told the extensions) and reads grep inputs from here-strings (a large disassembly tripped a SIGPIPE under pipefail).linux-riscv64and run under qemu, each matching the native x86_64 result: sha256 / sha512 / bignum / wide mul-div-rem; math.bits (clz/ctz/popcount/rotate/byteswap/bitreverse) / math (min/max/abs/log2/sat_*) / fnv1a / RV64D float (add/mul/div/cvt/cmp) / 32-bit div-rem word forms; and a heavy combined sha256-over-4KiB x8 + bignum workload.mach build . -o a && a build . -o b && b build . -o c && cmp b c-> b == c byte-identical).mach test .passes (666 passed, 0 failed). x86_64 and aarch64 unaffected (the fixpoint + existing cross lanes).Minor follow-up observation (not in scope here, does not affect qemu execution): the riscv64 object writer emits no
.riscv.attributesISA-string section, so external tools default to base RV64I and render valid M/F/D/A words as<unknown>unless--mattris passed.🤖 Generated with Claude Code