Full disclosure: the following bug report and related PR is authored by Claude Code. I've been using Claude to try and learn more about Ara and a simple test program failed. The model managed to debug it and suggest a fix.
I tried to rerun all tests but there are quite a few that hang due to other bugs present on master. It would be good if you could define a regression suite that actually passed so users could test that their changes haven't regressed the current state.
Summary
vlseg<NF>e<EEW>.v vd, (rs1) writes the data into vd, vd+1, ..., vd+nf correctly, but a subsequent read of vd+1..vd+nf at the same
EEW returns byte-packed garbage.
The cause is the per-vreg EEW tracker (eew_d[]) in
hardware/src/ara_dispatcher.sv: it only ever tags the base vd,
leaving vd+1..vd+nf at the reset-default EW8. The next instruction
that reads them sees an eew_vs* != old_eew_vs* mismatch and
triggers an on-read EEW shuffle that stripes the bytes across lanes.
Configuration where reproduced
- Branch:
main at 826f57c ([ci] [toolchain] Fix verilator update)
nr_lanes = 4
vlen = 4096
- Verilator 5.048
- riscv64-unknown-elf-gcc 13.2.0
Reproducer
apps/seg_test/main.c (added in accompanying PR):
asm volatile ("vsetivli x0, 4, e16, m1, ta, ma");
asm volatile ("vlseg4e16.v v8, (%0)" :: "r"(in_buf) : "memory");
asm volatile ("vse16.v v8, (%0)" :: "r"(v8_out) : "memory");
asm volatile ("vse16.v v9, (%0)" :: "r"(v9_out) : "memory");
asm volatile ("vse16.v v10, (%0)" :: "r"(v10_out) : "memory");
asm volatile ("vse16.v v11, (%0)" :: "r"(v11_out) : "memory");
in_buf[i] = base[i] + element-index, where base = {1,11,21,31}
for elements i = 0,1,2,3.
Build and run on Verilator:
cd apps && make bin/seg_test
cd hardware && make verilate && make simv app=seg_test
Observed (Ara, broken)
element v8 v9 v10 v11
[0] 1 3074 3331 3588
[1] 11 8214 8471 8728
[2] 21 0 0 0
[3] 31 0 0 0
*** seg_test FAIL: 12 mismatches ***
Decoding the v9 column:
| index |
value |
low byte |
high byte |
v9[0] |
0x0C02 |
0x02 (LSB of in_buf[1]=2) |
0x0C (LSB of in_buf[5]=12) |
v9[1] |
0x2016 |
0x16 (LSB of in_buf[9]=22) |
0x20 (LSB of in_buf[13]=32) |
v9[2] |
0x0000 |
unwritten |
unwritten |
v9[3] |
0x0000 |
unwritten |
unwritten |
Same byte-packing pattern for v10 and v11 (using elements 2/6/10/14
and 3/7/11/15 respectively). v8 is correct.
Expected (Spike, same ELF passes)
Building with make bin/seg_test.spike and running via
make spike-run-seg_test:
element v8 v9 v10 v11
[0] 1 2 3 4
[1] 11 12 13 14
[2] 21 22 23 24
[3] 31 32 33 34
*** seg_test PASS ***
Spike does not maintain a per-vreg EEW tracker, so the on-read
shuffle does not exist and the data is read out cleanly.
Root cause
hardware/src/ara_dispatcher.sv around line 3700:
// Update the EEW
if (ara_req_valid_d && ara_req.use_vd && ara_req_ready_i) begin
unique case (ara_req.emul)
LMUL_1: begin
for (int i = 0; i < 1; i++) begin
eew_d[ara_req.vd + i] = ara_req.vtype.vsew;
eew_valid_d[ara_req.vd + i] = 1'b1;
end
end
LMUL_2: begin ... 2 ... end
LMUL_4: begin ... 4 ... end
LMUL_8: begin ... 8 ... end
default: begin ... 1 ... end
endcase
end
For vlseg4e16.v v8 the original instruction has vd = 8,
emul = LMUL_1, nf = 3. The segment_sequencer downstream emits 16
micro-ops ((nf+1) * vl = 4 * 4) and each pulses ara_req_valid_d,
but the loop above only ever updates eew_d[8] (the base vd of
the original request). eew_d[9], eew_d[10], eew_d[11] stay at
the reset-default EW8 (line 201:
eew_q <= '{default: rvv_pkg::EW8}).
When the subsequent vse16.v v9 is dispatched:
ara_req.old_eew_vs1 = eew_q[insn.vmem_type.rd]; // = eew_q[9] = EW8
The eew_vs1 = EW16 vs old_eew_vs1 = EW8 mismatch triggers an
on-read shuffle that converts the bytes from the EW8 layout it
never actually had into the EW16 layout being requested,
byte-packing the data across lanes.
vse16 |
source vreg |
eew_vs1 |
old_eew_vs1 |
vse16.v v8 |
8 |
EW16 |
EW16 ← tagged correctly |
vse16.v v9 |
9 |
EW16 |
EW8 ← never tagged, default |
vse16.v v10 |
10 |
EW16 |
EW8 |
vse16.v v11 |
11 |
EW16 |
EW8 |
I confirmed via FST trace that every load micro-op's writeback
lands on the correct lane with the correct address and byte enable
(ldu_result_addr=0x080..0x0b0, be=0x03, wdata correct), so the
data is in the VRF correctly. The corruption happens at the next
read, not at the load.
Fix
Extend the tracker update loop by (nf+1) for segment loads/stores
so all destination vregs get tagged. Patch on
hardware/src/ara_dispatcher.sv (single hunk, replacing the case
statement above):
if (ara_req_valid_d && ara_req.use_vd && ara_req_ready_i) begin
+ automatic int unsigned regs_per_emul;
+ automatic int unsigned regs_to_tag;
+ automatic logic is_seg_mem_op;
unique case (ara_req.emul)
- LMUL_1: begin
- for (int i = 0; i < 1; i++) begin
- eew_d[ara_req.vd + i] = ara_req.vtype.vsew;
- eew_valid_d[ara_req.vd + i] = 1'b1;
- end
- end
- LMUL_2: begin ... 2 ... end
- LMUL_4: begin ... 4 ... end
- LMUL_8: begin ... 8 ... end
- default: begin ... 1 ... end
+ LMUL_1: regs_per_emul = 1;
+ LMUL_2: regs_per_emul = 2;
+ LMUL_4: regs_per_emul = 4;
+ LMUL_8: regs_per_emul = 8;
+ default: regs_per_emul = 1; // EMUL < 1
endcase
+ is_seg_mem_op = (ara_req.op inside {VLE, VLSE, VLXE, VSE, VSSE, VSXE})
+ && (ara_req.nf != 3'b000);
+ regs_to_tag = is_seg_mem_op ? regs_per_emul * (int'(ara_req.nf) + 1)
+ : regs_per_emul;
+ for (int i = 0; i < 32; i++) begin
+ if (i < regs_to_tag) begin
+ eew_d[ara_req.vd + i] = ara_req.vtype.vsew;
+ eew_valid_d[ara_req.vd + i] = 1'b1;
+ end
+ end
end
After the fix apps/seg_test PASSes on the same configuration.
The matching segment-store ops (VSE/VSSE/VSXE with nf != 0) do
not actually write vd, so they don't strictly need the extended
tag. They are kept in the inside {...} set for symmetry; drop
them from the set if you prefer to keep the loop tight.
Why the existing rv64uv segment tests pass
apps/riscv-tests/isa/rv64uv/vlseg.c does test vlseg4e16.v (line
113) and would otherwise hit this bug. It passes because every test
case calls VCLEAR(vN) on each destination register before the
segment load:
VSET(4, e16, m1);
VCLEAR(v1); // -> vmv.v.i v1, 0 at e16
VCLEAR(v2); // -> vmv.v.i v2, 0 at e16
VCLEAR(v3); // -> vmv.v.i v3, 0 at e16
VCLEAR(v4); // -> vmv.v.i v4, 0 at e16
asm volatile("vlseg4e16.v v1, (%0)" ::"r"(INP1));
VCMP_U16(23, v1, ...);
VCLEAR performs vmv.v.i v_reg, 0 at the test's current vsew,
which side-effects the EEW tracker into the matching state for
v_reg. By the time vlseg4e16.v v1 runs, the tracker for v1, v2,
v3, v4 is already EW16. The buggy load only re-tags v1 (the base
vd), but v2, v3, v4 were already EW16 from the VCLEAR, so the
follow-up VCMP_U16 reads see no tracker mismatch and no shuffle.
Test code that does not pre-tag the destination registers (the
common case for real workloads) surfaces the bug.
Suggested regression coverage
apps/seg_test/main.c deliberately does NOT VCLEAR v8..v11 before
the load, so it would catch a regression of this fix. It could be
added to the rv64uv test set as a complement to the existing
VCLEAR-prefixed tests.
Full disclosure: the following bug report and related PR is authored by Claude Code. I've been using Claude to try and learn more about Ara and a simple test program failed. The model managed to debug it and suggest a fix.
I tried to rerun all tests but there are quite a few that hang due to other bugs present on master. It would be good if you could define a regression suite that actually passed so users could test that their changes haven't regressed the current state.
Summary
vlseg<NF>e<EEW>.v vd, (rs1)writes the data intovd, vd+1, ..., vd+nfcorrectly, but a subsequent read ofvd+1..vd+nfat the sameEEW returns byte-packed garbage.
The cause is the per-vreg EEW tracker (
eew_d[]) inhardware/src/ara_dispatcher.sv: it only ever tags the basevd,leaving
vd+1..vd+nfat the reset-defaultEW8. The next instructionthat reads them sees an
eew_vs* != old_eew_vs*mismatch andtriggers an on-read EEW shuffle that stripes the bytes across lanes.
Configuration where reproduced
mainat826f57c([ci] [toolchain] Fix verilator update)nr_lanes = 4vlen = 4096Reproducer
apps/seg_test/main.c(added in accompanying PR):in_buf[i] = base[i] + element-index, wherebase = {1,11,21,31}for elements
i = 0,1,2,3.Build and run on Verilator:
Observed (Ara, broken)
Decoding the v9 column:
v9[0]0x0C020x02(LSB ofin_buf[1]=2)0x0C(LSB ofin_buf[5]=12)v9[1]0x20160x16(LSB ofin_buf[9]=22)0x20(LSB ofin_buf[13]=32)v9[2]0x0000v9[3]0x0000Same byte-packing pattern for v10 and v11 (using elements 2/6/10/14
and 3/7/11/15 respectively). v8 is correct.
Expected (Spike, same ELF passes)
Building with
make bin/seg_test.spikeand running viamake spike-run-seg_test:Spike does not maintain a per-vreg EEW tracker, so the on-read
shuffle does not exist and the data is read out cleanly.
Root cause
hardware/src/ara_dispatcher.svaround line 3700:For
vlseg4e16.v v8the original instruction hasvd = 8,emul = LMUL_1,nf = 3. The segment_sequencer downstream emits 16micro-ops (
(nf+1) * vl = 4 * 4) and each pulsesara_req_valid_d,but the loop above only ever updates
eew_d[8](the basevdofthe original request).
eew_d[9],eew_d[10],eew_d[11]stay atthe reset-default
EW8(line 201:eew_q <= '{default: rvv_pkg::EW8}).When the subsequent
vse16.v v9is dispatched:The
eew_vs1 = EW16vsold_eew_vs1 = EW8mismatch triggers anon-read shuffle that converts the bytes from the EW8 layout it
never actually had into the EW16 layout being requested,
byte-packing the data across lanes.
vse16eew_vs1old_eew_vs1vse16.v v8vse16.v v9vse16.v v10vse16.v v11I confirmed via FST trace that every load micro-op's writeback
lands on the correct lane with the correct address and byte enable
(
ldu_result_addr=0x080..0x0b0,be=0x03,wdatacorrect), so thedata is in the VRF correctly. The corruption happens at the next
read, not at the load.
Fix
Extend the tracker update loop by
(nf+1)for segment loads/storesso all destination vregs get tagged. Patch on
hardware/src/ara_dispatcher.sv(single hunk, replacing the casestatement above):
After the fix
apps/seg_testPASSes on the same configuration.The matching segment-store ops (
VSE/VSSE/VSXEwithnf != 0) donot actually write
vd, so they don't strictly need the extendedtag. They are kept in the
inside {...}set for symmetry; dropthem from the set if you prefer to keep the loop tight.
Why the existing rv64uv segment tests pass
apps/riscv-tests/isa/rv64uv/vlseg.cdoes testvlseg4e16.v(line113) and would otherwise hit this bug. It passes because every test
case calls
VCLEAR(vN)on each destination register before thesegment load:
VCLEARperformsvmv.v.i v_reg, 0at the test's current vsew,which side-effects the EEW tracker into the matching state for
v_reg. By the timevlseg4e16.v v1runs, the tracker for v1, v2,v3, v4 is already
EW16. The buggy load only re-tags v1 (the basevd), but v2, v3, v4 were alreadyEW16from the VCLEAR, so thefollow-up
VCMP_U16reads see no tracker mismatch and no shuffle.Test code that does not pre-tag the destination registers (the
common case for real workloads) surfaces the bug.
Suggested regression coverage
apps/seg_test/main.cdeliberately does NOT VCLEARv8..v11beforethe load, so it would catch a regression of this fix. It could be
added to the rv64uv test set as a complement to the existing
VCLEAR-prefixed tests.