Skip to content

fix(riscv64): relax out-of-range branches + frame-slot overlap (#1666)#1671

Merged
octalide merged 5 commits into
devfrom
fix/1666-riscv64-branch-relax
Jun 27, 2026
Merged

fix(riscv64): relax out-of-range branches + frame-slot overlap (#1666)#1671
octalide merged 5 commits into
devfrom
fix/1666-riscv64-branch-relax

Conversation

@octalide

@octalide octalide commented Jun 27, 2026

Copy link
Copy Markdown
Collaborator

Closes #1666
Closes #1670
Closes #1672

Fixes the riscv64 backend so arbitrary, large, real std code byte-encodes and runs correctly under qemu. Three distinct backend gaps were found by driving a real std workload (dns_nameserver, crypto sha256/sha512, bignum, large match chains, >2KiB frames) to a clean cross-compile + qemu run. Each is its own commit, cross-checked against the unrelaxed x86_64 build. Merged origin/dev (RV64A inline-asm #1669) in; the fixture exercises those atomics too.

Gap 1 - no long-branch relaxation (#1666)

  • Symptom: error: encode: riscv64 branch displacement exceeds the +-4KiB range on any function whose encoded body exceeds the B-type +-4 KiB reach (trigger: std.system.os.linux.shared.dns_nameserver, a var buf:[2048]u8 frame whose >2 KiB stack offsets expand into multi-instruction sequences, inflating the body past 4 KiB).
  • Root cause: RISC-V B-type conditional branches reach only +-4 KiB and J-type jumps only +-1 MiB. The pass-2 patcher rejected an out-of-range displacement instead of relaxing the branch.
  • Fix: branch relaxation in the riscv64 encoder. A conditional past +-4 KiB becomes its inverted guard (a short branch skipping the trampoline) + a jal to the real target; a jal past +-1 MiB becomes an auipc t0,%hi ; jalr x0,t0,%lo pc-relative trampoline (full 32-bit reach). It runs as a per-function fixpoint after the body is emitted: each growth opens a text gap via the new shared encode.insert_text, which slides the rest of the function down and fixes the block / fixup / symbol / relocation tables, so each rescan remeasures against the grown layout. encode.block_offset is made public so the pass can resolve targets; patch_branch_riscv64 learns the auipc+jalr form. The inverted guard's skip is purely local and resolved in the pass; the trampoline's real-target displacement is left to pass 2 through the repointed fixup.

Gap 2 - frame slots overlap the saved ra/s0 record (#1670)

  • Symptom: any riscv64 function with an address-taken local or a spill / aggregate slot (frame.size > 0) corrupts its own saved return address and segfaults at runtime (SIGSEGV, NULL deref). It byte-encodes fine, so byte-verify never caught it, and the register-only freestanding fixtures never exercised it.
  • Root cause: the prologue pins s0 at the frame top, with the 16-byte ra/s0 record and the callee-saved areas occupying the bytes immediately below it. The shared frame phase assigns local slot offsets just below the frame pointer (the x86 model: locals immediately below fp), so on riscv64 those land on top of the saved record - e.g. var buf:[256]u8 got s0-256 while ra was saved at s0-8, so zeroing the buffer clobbered ra.
  • Fix: bias the s0-relative slot offset down by frame_reserved_top (16 + callee-saved GP + FP bytes) so locals land below the reserved region. Localized to frame_slot_offset, mirrored in the assembly printer. s0 stays at the frame top, so incoming stack-argument access is unchanged.

Gap 3 - 32-bit and/or/xor encode illegal word-group ops (#1672)

  • Symptom: a 32-bit bitwise and / or / xor encodes an illegal instruction and faults with SIGILL at runtime (surfaced building std crypto / bignum, which are u32-heavy). It byte-encodes without error.
  • Root cause: encode_alu3 selected the major opcode through alu_opcode(width), which returns OP_32 (the RV64 word group) at 32-bit width - but RISC-V defines no andw / orw / xorw (the word funct3 6/7/4 with funct7 0 is reserved). add / sub / mul / shifts have valid word forms; only the logical ops do not.
  • Fix: thread a word_form flag through encode_alu3 so mul keeps its mulw word form while and / or / xor always encode as the full-register OP. A bitwise op of two consistently extended 32-bit operands yields a consistently extended result, so the full-register form is correct at every width.

Verification

  • Regression fixture (test/riscv64): extends the freestanding rv64 fixture with a >4 KiB parse_probe (forces long-branch relaxation - verify.sh asserts the inverted-guard + jal sequence), a stack_probe (stack-local frame slot), and a bitmix (32-bit bitwise word-group). Folded with the existing const-shift and RV64A atomics probes into a qemu exit code (70) asserted by verify.sh; the code matches the unrelaxed x86_64 build, so a regression in any fix changes it. verify.sh now disassembles with --mattr=+m,+f,+d,+a (the backend emits no .riscv.attributes ISA string yet, so objdump must be told the extensions) and reads grep inputs from here-strings (a large disassembly tripped a SIGPIPE under pipefail).
  • Real std consumers cross-compiled for linux-riscv64 and run under qemu, each matching the native x86_64 result: sha256 / sha512 / bignum / wide mul-div-rem; math.bits (clz/ctz/popcount/rotate/byteswap/bitreverse) / math (min/max/abs/log2/sat_*) / fnv1a / RV64D float (add/mul/div/cvt/cmp) / 32-bit div-rem word forms; and a heavy combined sha256-over-4KiB x8 + bignum workload.
  • Self-host fixpoint holds (mach build . -o a && a build . -o b && b build . -o c && cmp b c -> b == c byte-identical).
  • mach test . passes (666 passed, 0 failed). x86_64 and aarch64 unaffected (the fixpoint + existing cross lanes).

Minor follow-up observation (not in scope here, does not affect qemu execution): the riscv64 object writer emits no .riscv.attributes ISA-string section, so external tools default to base RV64I and render valid M/F/D/A words as <unknown> unless --mattr is passed.

🤖 Generated with Claude Code

octalide added 5 commits June 27, 2026 00:04
RISC-V B-type conditional branches reach only +-4 KiB and J-type jumps
only +-1 MiB, so a function whose encoded body exceeds those ranges could
not encode: the patcher rejected the overflowing displacement instead of
relaxing it. A large stack frame (>2 KiB offsets expanding into
multi-instruction sequences) inflates a body past 4 KiB and overflows a
conditional branch spanning it.

Add branch relaxation to the riscv64 encoder. A conditional past +-4 KiB
becomes its inverted guard (a short branch skipping the trampoline) plus a
jal to the real target; a jal past +-1 MiB becomes an auipc+jalr
pc-relative trampoline. The relaxation runs as a per-function fixpoint
after the body is emitted: each growth opens a text gap that slides the
rest of the function down and fixes the block / fixup / symbol /
relocation tables, so each rescan remeasures against the grown layout. The
new shared encode.insert_text owns the ISA-general gap mechanism;
encode.block_offset is made public so the pass can resolve targets.

Closes #1666
Any riscv64 function with an address-taken local or a spill / aggregate
slot (frame.size > 0) corrupted its own saved return address and
segfaulted at runtime. The prologue pins s0 at the frame top, with the
16-byte ra / s0 record and the callee-saved areas occupying the bytes
immediately below it, but the shared frame phase assigns local slot
offsets just below the frame pointer (the x86 model). On riscv64 those
landed on top of the saved record - e.g. a [256]u8 buffer at s0-256 while
ra was saved at s0-8, so zeroing the buffer clobbered ra.

Bias the s0-relative slot offset down by frame_reserved_top (16 +
callee-saved GP + FP bytes) so locals land below the reserved region.
Localized to frame_slot_offset, mirrored in the assembly printer; s0 stays
at the frame top so incoming stack-argument access is unchanged. It
byte-encoded fine, so byte-verify never caught it and the register-only
freestanding fixtures never exercised it.
Extend the freestanding rv64 fixture with two probes folded into the qemu
exit code, so a regression in either fix changes the asserted code:

- parse_probe: a deterministic parse over a [2048]u8 stack buffer. The
  large frame inflates the encoded body past 4KiB and leaves a conditional
  branch out of the B-type +-4KiB range, exercising long-branch relaxation
  (#1666). verify.sh asserts the inverted-guard + jal sequence is present.
- stack_probe: a small stack-local buffer written then read back,
  exercising frame-slot addressing (#1670).

The result matches the unrelaxed x86_64 / aarch64 build (exit code 68).

verify.sh: disassemble with --mattr=+m,+f,+d so the <unknown> guard does
not false-positive on valid M / F / D words (the backend emits no
.riscv.attributes ISA string yet), and read grep inputs from here-strings
so a large disassembly does not trip a SIGPIPE under pipefail.
…h-relax

# Conflicts:
#	test/riscv64/src/main.mach
#	test/riscv64/verify.sh
A 32-bit bitwise and / or / xor encoded an illegal instruction and faulted
with SIGILL at runtime (surfaced building std crypto / bignum, which is
u32-heavy). encode_alu3 selected the major opcode through alu_opcode(width),
which returns OP_32 (the RV64 word group) at 32-bit width - but RISC-V
defines no andw / orw / xorw, so the word matched no encoding (funct3 6/7/4
with funct7 0 in OP_32 is reserved).

Bitwise ops have no word form: they are bit-for-bit identical at any width,
and a bitwise op of two consistently extended 32-bit operands yields a
consistently extended result. Thread a word_form flag through encode_alu3 so
mul keeps its mulw word form while and / or / xor always encode as the
full-register OP. add / sub / shifts / mul / div already use valid word
forms, so only the logical ops were affected.

Verified: a std sha256 / sha512 / bignum consumer cross-compiled for
linux-riscv64 now runs under qemu to the same result as the native build,
and a 32-bit bitwise probe in the fixture matches the native value.
@octalide octalide marked this pull request as ready for review June 27, 2026 04:33
@octalide octalide merged commit 5afdbdc into dev Jun 27, 2026
10 checks passed
@octalide octalide deleted the fix/1666-riscv64-branch-relax branch June 27, 2026 04:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant