`vlseg<NF>e<EEW>` with `NF > 1` and `EEW != EW8` corrupts vd+1..vd+nf on subsequent reads

Full disclosure: the following bug report and related PR is authored by Claude Code.  I've been using Claude to try and learn more about Ara and a simple test program failed.  The model managed to debug it and suggest a fix.  

I tried to rerun all tests but there are quite a few that hang due to other bugs present on master.  It would be good if you could define a regression suite that actually passed so users could test that their changes haven't regressed the current state.

## Summary

`vlseg<NF>e<EEW>.v vd, (rs1)` writes the data into `vd, vd+1, ...,
vd+nf` correctly, but a subsequent read of `vd+1..vd+nf` at the same
EEW returns byte-packed garbage.

The cause is the per-vreg EEW tracker (`eew_d[]`) in
`hardware/src/ara_dispatcher.sv`: it only ever tags the base `vd`,
leaving `vd+1..vd+nf` at the reset-default `EW8`. The next instruction
that reads them sees an `eew_vs* != old_eew_vs*` mismatch and
triggers an on-read EEW shuffle that stripes the bytes across lanes.

## Configuration where reproduced

- Branch: `main` at `826f57c` ([ci] [toolchain] Fix verilator update)
- `nr_lanes = 4`
- `vlen = 4096`
- Verilator 5.048
- riscv64-unknown-elf-gcc 13.2.0

## Reproducer

`apps/seg_test/main.c` (added in accompanying PR):

```c
asm volatile ("vsetivli x0, 4, e16, m1, ta, ma");
asm volatile ("vlseg4e16.v v8, (%0)" :: "r"(in_buf) : "memory");
asm volatile ("vse16.v v8,  (%0)"   :: "r"(v8_out)  : "memory");
asm volatile ("vse16.v v9,  (%0)"   :: "r"(v9_out)  : "memory");
asm volatile ("vse16.v v10, (%0)"   :: "r"(v10_out) : "memory");
asm volatile ("vse16.v v11, (%0)"   :: "r"(v11_out) : "memory");
```

`in_buf[i] = base[i] + element-index`, where `base = {1,11,21,31}`
for elements `i = 0,1,2,3`.

Build and run on Verilator:

```sh
cd apps     && make bin/seg_test
cd hardware && make verilate && make simv app=seg_test
```

## Observed (Ara, broken)

```
element  v8  v9   v10  v11
  [0]    1   3074 3331 3588
  [1]    11  8214 8471 8728
  [2]    21  0    0    0
  [3]    31  0    0    0
*** seg_test FAIL: 12 mismatches ***
```

Decoding the v9 column:

| index  | value    | low byte                         | high byte                         |
|--------|----------|----------------------------------|-----------------------------------|
| `v9[0]` | `0x0C02` | `0x02` (LSB of `in_buf[1]`=2)    | `0x0C` (LSB of `in_buf[5]`=12)    |
| `v9[1]` | `0x2016` | `0x16` (LSB of `in_buf[9]`=22)   | `0x20` (LSB of `in_buf[13]`=32)   |
| `v9[2]` | `0x0000` | unwritten                        | unwritten                         |
| `v9[3]` | `0x0000` | unwritten                        | unwritten                         |

Same byte-packing pattern for v10 and v11 (using elements 2/6/10/14
and 3/7/11/15 respectively). v8 is correct.

## Expected (Spike, same ELF passes)

Building with `make bin/seg_test.spike` and running via
`make spike-run-seg_test`:

```
element  v8  v9  v10 v11
  [0]    1   2   3   4
  [1]    11  12  13  14
  [2]    21  22  23  24
  [3]    31  32  33  34
*** seg_test PASS ***
```

Spike does not maintain a per-vreg EEW tracker, so the on-read
shuffle does not exist and the data is read out cleanly.

## Root cause

`hardware/src/ara_dispatcher.sv` around line 3700:

```sv
// Update the EEW
if (ara_req_valid_d && ara_req.use_vd && ara_req_ready_i) begin
  unique case (ara_req.emul)
    LMUL_1: begin
      for (int i = 0; i < 1; i++) begin
        eew_d[ara_req.vd + i]       = ara_req.vtype.vsew;
        eew_valid_d[ara_req.vd + i] = 1'b1;
      end
    end
    LMUL_2: begin ... 2 ... end
    LMUL_4: begin ... 4 ... end
    LMUL_8: begin ... 8 ... end
    default: begin ... 1 ... end
  endcase
end
```

For `vlseg4e16.v v8` the original instruction has `vd = 8`,
`emul = LMUL_1`, `nf = 3`. The segment_sequencer downstream emits 16
micro-ops (`(nf+1) * vl = 4 * 4`) and each pulses `ara_req_valid_d`,
but the loop above only ever updates `eew_d[8]` (the base `vd` of
the original request). `eew_d[9]`, `eew_d[10]`, `eew_d[11]` stay at
the reset-default `EW8` (line 201:
`eew_q <= '{default: rvv_pkg::EW8}`).

When the subsequent `vse16.v v9` is dispatched:

```sv
ara_req.old_eew_vs1 = eew_q[insn.vmem_type.rd];   // = eew_q[9] = EW8
```

The `eew_vs1 = EW16` vs `old_eew_vs1 = EW8` mismatch triggers an
on-read shuffle that converts the bytes from the EW8 layout it
never actually had into the EW16 layout being requested,
byte-packing the data across lanes.

| `vse16` | source vreg | `eew_vs1` | `old_eew_vs1` |
|---|---|---|---|
| `vse16.v v8`  | 8  | EW16 | **EW16** ← tagged correctly |
| `vse16.v v9`  | 9  | EW16 | **EW8**  ← never tagged, default |
| `vse16.v v10` | 10 | EW16 | **EW8**  |
| `vse16.v v11` | 11 | EW16 | **EW8**  |

I confirmed via FST trace that every load micro-op's writeback
lands on the correct lane with the correct address and byte enable
(`ldu_result_addr=0x080..0x0b0`, `be=0x03`, `wdata` correct), so the
data is in the VRF correctly. The corruption happens at the *next
read*, not at the load.

## Fix

Extend the tracker update loop by `(nf+1)` for segment loads/stores
so all destination vregs get tagged. Patch on
`hardware/src/ara_dispatcher.sv` (single hunk, replacing the case
statement above):

```diff
 if (ara_req_valid_d && ara_req.use_vd && ara_req_ready_i) begin
+  automatic int unsigned regs_per_emul;
+  automatic int unsigned regs_to_tag;
+  automatic logic        is_seg_mem_op;
   unique case (ara_req.emul)
-    LMUL_1: begin
-      for (int i = 0; i < 1; i++) begin
-        eew_d[ara_req.vd + i]       = ara_req.vtype.vsew;
-        eew_valid_d[ara_req.vd + i] = 1'b1;
-      end
-    end
-    LMUL_2: begin ... 2 ... end
-    LMUL_4: begin ... 4 ... end
-    LMUL_8: begin ... 8 ... end
-    default: begin ... 1 ... end
+    LMUL_1:  regs_per_emul = 1;
+    LMUL_2:  regs_per_emul = 2;
+    LMUL_4:  regs_per_emul = 4;
+    LMUL_8:  regs_per_emul = 8;
+    default: regs_per_emul = 1; // EMUL < 1
   endcase
+  is_seg_mem_op = (ara_req.op inside {VLE, VLSE, VLXE, VSE, VSSE, VSXE})
+               && (ara_req.nf != 3'b000);
+  regs_to_tag = is_seg_mem_op ? regs_per_emul * (int'(ara_req.nf) + 1)
+                              : regs_per_emul;
+  for (int i = 0; i < 32; i++) begin
+    if (i < regs_to_tag) begin
+      eew_d[ara_req.vd + i]       = ara_req.vtype.vsew;
+      eew_valid_d[ara_req.vd + i] = 1'b1;
+    end
+  end
 end
```

After the fix `apps/seg_test` PASSes on the same configuration.

The matching segment-store ops (`VSE/VSSE/VSXE` with `nf != 0`) do
not actually write `vd`, so they don't strictly need the extended
tag. They are kept in the `inside {...}` set for symmetry; drop
them from the set if you prefer to keep the loop tight.

## Why the existing rv64uv segment tests pass

`apps/riscv-tests/isa/rv64uv/vlseg.c` does test `vlseg4e16.v` (line
113) and would otherwise hit this bug. It passes because every test
case calls `VCLEAR(vN)` on each destination register before the
segment load:

```c
VSET(4, e16, m1);
VCLEAR(v1);   // -> vmv.v.i v1, 0  at e16
VCLEAR(v2);   // -> vmv.v.i v2, 0  at e16
VCLEAR(v3);   // -> vmv.v.i v3, 0  at e16
VCLEAR(v4);   // -> vmv.v.i v4, 0  at e16
asm volatile("vlseg4e16.v v1, (%0)" ::"r"(INP1));
VCMP_U16(23, v1, ...);
```

`VCLEAR` performs `vmv.v.i v_reg, 0` at the test's current vsew,
which side-effects the EEW tracker into the matching state for
`v_reg`. By the time `vlseg4e16.v v1` runs, the tracker for v1, v2,
v3, v4 is already `EW16`. The buggy load only re-tags v1 (the base
`vd`), but v2, v3, v4 were already `EW16` from the VCLEAR, so the
follow-up `VCMP_U16` reads see no tracker mismatch and no shuffle.

Test code that does not pre-tag the destination registers (the
common case for real workloads) surfaces the bug.

## Suggested regression coverage

`apps/seg_test/main.c` deliberately does NOT VCLEAR `v8..v11` before
the load, so it would catch a regression of this fix. It could be
added to the rv64uv test set as a complement to the existing
VCLEAR-prefixed tests.


index	value	low byte	high byte
`v9[0]`	`0x0C02`	`0x02` (LSB of `in_buf[1]`=2)	`0x0C` (LSB of `in_buf[5]`=12)
`v9[1]`	`0x2016`	`0x16` (LSB of `in_buf[9]`=22)	`0x20` (LSB of `in_buf[13]`=32)
`v9[2]`	`0x0000`	unwritten	unwritten
`v9[3]`	`0x0000`	unwritten	unwritten

`vse16`	source vreg	`eew_vs1`	`old_eew_vs1`
`vse16.v v8`	8	EW16	EW16 ← tagged correctly
`vse16.v v9`	9	EW16	EW8 ← never tagged, default
`vse16.v v10`	10	EW16	EW8
`vse16.v v11`	11	EW16	EW8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`vlseg<NF>e<EEW>` with `NF > 1` and `EEW != EW8` corrupts vd+1..vd+nf on subsequent reads #453

Summary

Configuration where reproduced

Reproducer

Observed (Ara, broken)

Expected (Spike, same ELF passes)

Root cause

Fix

Why the existing rv64uv segment tests pass

Suggested regression coverage

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

vlseg<NF>e<EEW> with NF > 1 and EEW != EW8 corrupts vd+1..vd+nf on subsequent reads #453

Description

Summary

Configuration where reproduced

Reproducer

Observed (Ara, broken)

Expected (Spike, same ELF passes)

Root cause

Fix

Why the existing rv64uv segment tests pass

Suggested regression coverage

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`vlseg<NF>e<EEW>` with `NF > 1` and `EEW != EW8` corrupts vd+1..vd+nf on subsequent reads #453