Skip to content

vlseg<NF>e<EEW> with NF > 1 and EEW != EW8 corrupts vd+1..vd+nf on subsequent reads #453

@sjalloq

Description

@sjalloq

Full disclosure: the following bug report and related PR is authored by Claude Code. I've been using Claude to try and learn more about Ara and a simple test program failed. The model managed to debug it and suggest a fix.

I tried to rerun all tests but there are quite a few that hang due to other bugs present on master. It would be good if you could define a regression suite that actually passed so users could test that their changes haven't regressed the current state.

Summary

vlseg<NF>e<EEW>.v vd, (rs1) writes the data into vd, vd+1, ..., vd+nf correctly, but a subsequent read of vd+1..vd+nf at the same
EEW returns byte-packed garbage.

The cause is the per-vreg EEW tracker (eew_d[]) in
hardware/src/ara_dispatcher.sv: it only ever tags the base vd,
leaving vd+1..vd+nf at the reset-default EW8. The next instruction
that reads them sees an eew_vs* != old_eew_vs* mismatch and
triggers an on-read EEW shuffle that stripes the bytes across lanes.

Configuration where reproduced

  • Branch: main at 826f57c ([ci] [toolchain] Fix verilator update)
  • nr_lanes = 4
  • vlen = 4096
  • Verilator 5.048
  • riscv64-unknown-elf-gcc 13.2.0

Reproducer

apps/seg_test/main.c (added in accompanying PR):

asm volatile ("vsetivli x0, 4, e16, m1, ta, ma");
asm volatile ("vlseg4e16.v v8, (%0)" :: "r"(in_buf) : "memory");
asm volatile ("vse16.v v8,  (%0)"   :: "r"(v8_out)  : "memory");
asm volatile ("vse16.v v9,  (%0)"   :: "r"(v9_out)  : "memory");
asm volatile ("vse16.v v10, (%0)"   :: "r"(v10_out) : "memory");
asm volatile ("vse16.v v11, (%0)"   :: "r"(v11_out) : "memory");

in_buf[i] = base[i] + element-index, where base = {1,11,21,31}
for elements i = 0,1,2,3.

Build and run on Verilator:

cd apps     && make bin/seg_test
cd hardware && make verilate && make simv app=seg_test

Observed (Ara, broken)

element  v8  v9   v10  v11
  [0]    1   3074 3331 3588
  [1]    11  8214 8471 8728
  [2]    21  0    0    0
  [3]    31  0    0    0
*** seg_test FAIL: 12 mismatches ***

Decoding the v9 column:

index value low byte high byte
v9[0] 0x0C02 0x02 (LSB of in_buf[1]=2) 0x0C (LSB of in_buf[5]=12)
v9[1] 0x2016 0x16 (LSB of in_buf[9]=22) 0x20 (LSB of in_buf[13]=32)
v9[2] 0x0000 unwritten unwritten
v9[3] 0x0000 unwritten unwritten

Same byte-packing pattern for v10 and v11 (using elements 2/6/10/14
and 3/7/11/15 respectively). v8 is correct.

Expected (Spike, same ELF passes)

Building with make bin/seg_test.spike and running via
make spike-run-seg_test:

element  v8  v9  v10 v11
  [0]    1   2   3   4
  [1]    11  12  13  14
  [2]    21  22  23  24
  [3]    31  32  33  34
*** seg_test PASS ***

Spike does not maintain a per-vreg EEW tracker, so the on-read
shuffle does not exist and the data is read out cleanly.

Root cause

hardware/src/ara_dispatcher.sv around line 3700:

// Update the EEW
if (ara_req_valid_d && ara_req.use_vd && ara_req_ready_i) begin
  unique case (ara_req.emul)
    LMUL_1: begin
      for (int i = 0; i < 1; i++) begin
        eew_d[ara_req.vd + i]       = ara_req.vtype.vsew;
        eew_valid_d[ara_req.vd + i] = 1'b1;
      end
    end
    LMUL_2: begin ... 2 ... end
    LMUL_4: begin ... 4 ... end
    LMUL_8: begin ... 8 ... end
    default: begin ... 1 ... end
  endcase
end

For vlseg4e16.v v8 the original instruction has vd = 8,
emul = LMUL_1, nf = 3. The segment_sequencer downstream emits 16
micro-ops ((nf+1) * vl = 4 * 4) and each pulses ara_req_valid_d,
but the loop above only ever updates eew_d[8] (the base vd of
the original request). eew_d[9], eew_d[10], eew_d[11] stay at
the reset-default EW8 (line 201:
eew_q <= '{default: rvv_pkg::EW8}).

When the subsequent vse16.v v9 is dispatched:

ara_req.old_eew_vs1 = eew_q[insn.vmem_type.rd];   // = eew_q[9] = EW8

The eew_vs1 = EW16 vs old_eew_vs1 = EW8 mismatch triggers an
on-read shuffle that converts the bytes from the EW8 layout it
never actually had into the EW16 layout being requested,
byte-packing the data across lanes.

vse16 source vreg eew_vs1 old_eew_vs1
vse16.v v8 8 EW16 EW16 ← tagged correctly
vse16.v v9 9 EW16 EW8 ← never tagged, default
vse16.v v10 10 EW16 EW8
vse16.v v11 11 EW16 EW8

I confirmed via FST trace that every load micro-op's writeback
lands on the correct lane with the correct address and byte enable
(ldu_result_addr=0x080..0x0b0, be=0x03, wdata correct), so the
data is in the VRF correctly. The corruption happens at the next
read
, not at the load.

Fix

Extend the tracker update loop by (nf+1) for segment loads/stores
so all destination vregs get tagged. Patch on
hardware/src/ara_dispatcher.sv (single hunk, replacing the case
statement above):

 if (ara_req_valid_d && ara_req.use_vd && ara_req_ready_i) begin
+  automatic int unsigned regs_per_emul;
+  automatic int unsigned regs_to_tag;
+  automatic logic        is_seg_mem_op;
   unique case (ara_req.emul)
-    LMUL_1: begin
-      for (int i = 0; i < 1; i++) begin
-        eew_d[ara_req.vd + i]       = ara_req.vtype.vsew;
-        eew_valid_d[ara_req.vd + i] = 1'b1;
-      end
-    end
-    LMUL_2: begin ... 2 ... end
-    LMUL_4: begin ... 4 ... end
-    LMUL_8: begin ... 8 ... end
-    default: begin ... 1 ... end
+    LMUL_1:  regs_per_emul = 1;
+    LMUL_2:  regs_per_emul = 2;
+    LMUL_4:  regs_per_emul = 4;
+    LMUL_8:  regs_per_emul = 8;
+    default: regs_per_emul = 1; // EMUL < 1
   endcase
+  is_seg_mem_op = (ara_req.op inside {VLE, VLSE, VLXE, VSE, VSSE, VSXE})
+               && (ara_req.nf != 3'b000);
+  regs_to_tag = is_seg_mem_op ? regs_per_emul * (int'(ara_req.nf) + 1)
+                              : regs_per_emul;
+  for (int i = 0; i < 32; i++) begin
+    if (i < regs_to_tag) begin
+      eew_d[ara_req.vd + i]       = ara_req.vtype.vsew;
+      eew_valid_d[ara_req.vd + i] = 1'b1;
+    end
+  end
 end

After the fix apps/seg_test PASSes on the same configuration.

The matching segment-store ops (VSE/VSSE/VSXE with nf != 0) do
not actually write vd, so they don't strictly need the extended
tag. They are kept in the inside {...} set for symmetry; drop
them from the set if you prefer to keep the loop tight.

Why the existing rv64uv segment tests pass

apps/riscv-tests/isa/rv64uv/vlseg.c does test vlseg4e16.v (line
113) and would otherwise hit this bug. It passes because every test
case calls VCLEAR(vN) on each destination register before the
segment load:

VSET(4, e16, m1);
VCLEAR(v1);   // -> vmv.v.i v1, 0  at e16
VCLEAR(v2);   // -> vmv.v.i v2, 0  at e16
VCLEAR(v3);   // -> vmv.v.i v3, 0  at e16
VCLEAR(v4);   // -> vmv.v.i v4, 0  at e16
asm volatile("vlseg4e16.v v1, (%0)" ::"r"(INP1));
VCMP_U16(23, v1, ...);

VCLEAR performs vmv.v.i v_reg, 0 at the test's current vsew,
which side-effects the EEW tracker into the matching state for
v_reg. By the time vlseg4e16.v v1 runs, the tracker for v1, v2,
v3, v4 is already EW16. The buggy load only re-tags v1 (the base
vd), but v2, v3, v4 were already EW16 from the VCLEAR, so the
follow-up VCMP_U16 reads see no tracker mismatch and no shuffle.

Test code that does not pre-tag the destination registers (the
common case for real workloads) surfaces the bug.

Suggested regression coverage

apps/seg_test/main.c deliberately does NOT VCLEAR v8..v11 before
the load, so it would catch a regression of this fix. It could be
added to the rv64uv test set as a complement to the existing
VCLEAR-prefixed tests.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions