Skip to content

Deeper SQ (MAX_QP_WR=16) + honor rnr_retry=7 as infinite RNR retry + fix CI test suite#3

Open
ruck314 wants to merge 6 commits into
FilMarini:masterfrom
ruck314:feat/deeper-sq-rnr-infinite
Open

Deeper SQ (MAX_QP_WR=16) + honor rnr_retry=7 as infinite RNR retry + fix CI test suite#3
ruck314 wants to merge 6 commits into
FilMarini:masterfrom
ruck314:feat/deeper-sq-rnr-infinite

Conversation

@ruck314

@ruck314 ruck314 commented Jun 18, 2026

Copy link
Copy Markdown

Two independent changes from running this core in a SLAC KCU105 10GbE RoCEv2 build (SEND-with-immediate datapath), plus a CI test-suite fix. Draft for discussion — the commits are independent and can be reviewed/taken separately.

fix: honor rnr_retry=7 as infinite RNR retry

RNR retry was gated only by disableRetryCntReg (set when max_retry_cnt == 7). A QP configured with rnr_retry == 7 (IBV "infinite") but a finite retry_count would still exhaust rnrCntReg and fault. This adds a dedicated disableRnrCntReg (set when max_rnr_cnt == INFINITE_RETRY) gating both the rnrCntReg decrement and the RNR retry-limit-exceeded check, mirroring the regular-retry path.

feat: raise MAX_QP_WR 4 -> 16

A SEND-only datapath is bandwidth-delay-product limited: with ~4 SENDs in flight against the SEND->completion latency, throughput saturates well below line rate. Raising MAX_QP_WR to 16 deepens the SQ pending-WR window. On a KCU105 10GbE build this lifts RDMA-SEND from ~8.1 to ~9.75 Gb/s (line rate) at 4096 B PMTU, with PRBS integrity intact and timing closed. MAX_QP_WR stays a power of 2; derived sizes (MAX_CQE, MAX_PENDING_WORK_COMP_NUM, ...) scale with it. The depth value could be made a parameter if preferred.

test: fix CI test suite

The BlueSim CI (run.sh -> Makefile.test) did not compile/run on this branch, so the two changes above were landing without test coverage. This restores the registered suite to green and revives the cocotb AXI-Stream integration test.

  • Build/CI: add src/ to the bsc package search path for test builds (Makefile.base); actions/checkout@v4 with submodules: recursive so the blue-wrapper submodule (required on the bsc path) is fetched on the runner.
  • SQ test-harness fixes to match the current source: single-element SGLs (MAX_SGE == 1), 4-arg mkRespHandleSQ, mkWorkCompGenSQ no longer waits on a DMA-write response, and dropped references to the removed request-ingress pipe (reqPktPipeOut / statusRQ).
  • Skip RQ-dependent tests: the receive-queue datapath is disabled in the source, so TestQueuePair, TestReqHandleRQ, TestTransportLayer, and the request-ingress-only cases are de-registered in Makefile.test (sources preserved, with revival instructions, for when RQ is re-enabled).
  • cocotb: use LogicArray.to_unsigned() instead of the deprecated .integer getter (cocotb 2.x); add test/cocotb/requirements.txt documenting the Python deps (incl. bitstring) for the manually-run cocotb flow.

ruck314 added 3 commits June 18, 2026 08:57
The RNR retry counter was gated only by disableRetryCntReg (the regular
retry-count infinite flag, set when max_retry_cnt == 7). A QP configured
with rnr_retry == 7 (IBV "infinite") but a finite retry_count would still
exhaust rnrCntReg and fault. Add a dedicated disableRnrCntReg, set when
max_rnr_cnt == INFINITE_RETRY, gating both the rnrCntReg decrement and the
RNR retry-limit-exceeded check, mirroring the regular-retry path.
A SEND-only RDMA datapath is bandwidth-delay-product limited: with ~4 SENDs
in flight against the SEND->completion latency, throughput saturates well
below line rate. Raising MAX_QP_WR to 16 deepens the SQ pending-WR window;
on a KCU105 10GbE RoCEv2 build this lifts RDMA-SEND from ~8.1 to ~9.75 Gb/s
(line rate) at 4096 B PMTU. MAX_QP_WR stays a power of 2; derived sizes
(MAX_CQE, MAX_PENDING_WORK_COMP_NUM, ...) scale with it.
The BlueSim CI (run.sh -> Makefile.test) failed to compile and run on this
branch. Fix it so the full registered suite passes, and get the cocotb
AXI-Stream integration test running again.

Build/CI:
- Makefile.base: add src/ to the bsc package search path for test builds
  (TESTSRCDIR); test benches run from test/ and could not find any core
  package. Keep the src/ build path (SRCDIRFLAGS) lib-only to avoid a
  duplicate-path error during Verilog generation.
- .github/workflows/ci.yml: checkout@v4 with submodules: recursive so the
  blue-wrapper submodule (required on the bsc path) is fetched on the runner.

SQ test-harness fixes (match current source signatures/behavior):
- Utils4Test / TestInputPktHandle: drop references to the removed request
  ingress pipe (reqPktPipeOut); keep response + CNP checks.
- SimGenRdmaReqResp: use contextSQ.statusSQ (statusRQ was removed).
- TestSendQ / TestPayloadGen: build single-element SGLs (MAX_SGE == 1).
- TestRespHandleSQ: mkRespHandleSQ is now 4-arg; drain the response payload
  with a sink (SQ DMA-write/permission path removed) and drop the no-longer
  observable read/atomic payload comparison.
- TestWorkCompGen: mkWorkCompGenSQ no longer waits on a DMA-write response;
  drive wcWaitDmaResp false and stop feeding the removed payloadConResp port.

Skip tests that depend on the disabled receive-queue (RQ) datapath, with
sources preserved for when RQ is re-enabled:
- De-register TestQueuePair, TestReqHandleRQ, TestTransportLayer, and the
  request-ingress-only SimExtractRdmaHeaderPayload from Makefile.test.
- Trim TestInputPktHandle to the CNP test and TestWorkCompGen to its SQ cases;
  stub the RQ work-completion test body.

cocotb:
- TestAxiSTransportLayer: use LogicArray.to_unsigned() instead of the
  deprecated .integer getter (cocotb 2.x).
- Add test/cocotb/requirements.txt documenting the Python deps (incl.
  bitstring) for the manually-run cocotb flow.
@ruck314 ruck314 changed the title Deeper SQ (MAX_QP_WR=16) + honor rnr_retry=7 as infinite RNR retry Deeper SQ (MAX_QP_WR=16) + honor rnr_retry=7 as infinite RNR retry + fix CI test suite Jun 22, 2026
ruck314 added 3 commits June 24, 2026 08:48
Under sustained/nested RNR backpressure the SQ retry FSM could wedge
permanently: a second RNR/NAK arriving during RNR_WAIT re-runs initRetry,
re-entering RETRY_HANDLE_ST_START_PRE_RETRY while the pending-WR scan FIFO
is still parked in PRE_SCAN_MODE. The old 2-way isScanDone branch then
issued preScanRestart(), whose implicit guard is inScanMode -- unsatisfiable
in PRE_SCAN_MODE -- so the rule could never fire again. Dispatch and
completions stalled until a full QP/source drain.

startPreRetry now selects the scan command by exact mode: FIFOF_MODE ->
preScanStart, SCAN_MODE -> preScanRestart, PRE_SCAN_MODE -> no-op (the
pre-scan is already armed at the head). Expose ScanCntrl.isScanMode() to
distinguish the SCAN/PRE_SCAN cases.

Adds TestSqRnrThroughput: drives a continuous WR stream through a sustained
RNR window and asserts post-RNR throughput recovers to >=80% of baseline
(pre-fix it stays at 0%, a permanent stall). Registered in Makefile.test.
The pending-WR scan FIFO (mkScanFIFOF) cleared its output queue scanOutQ
unconditionally on every fifoMode cycle. On a normal scanDone -> FIFOF_MODE
transition the up-to-2 replayed work requests still buffered in scanOutQ (its
mkFIFOF depth) had not yet been drained by mkReqGenSQ, so they were flushed and
never reached the wire. Each go-back-N retry therefore replayed itemCnt-2
packets instead of itemCnt, leaving a PSN gap that the peer SEQ_ERR-NAKs,
triggering another truncated retry -> a self-perpetuating ~2x retransmit
treadmill after RNR backpressure (the sticky half-throughput latch).

Gate scanOutQ.clear to fire only when (re)arming a fresh scan (the
preScanStart branch). The abort paths (stopScan, preScanRestart) already clear
scanOutQ explicitly in scanModeStateChange, so this was the only unconditional
clear, and it was the one dropping legitimate scanDone leftovers.

Full Bluesim regression suite passes (76/76, 0 Error/ImmAssert), including
mkTestScanFIFOF and the mkTestSqRnrThroughput nested-RNR deadlock guard.
…receiver)

The existing mkTestSqRnrThroughput responder is a blind echo-ACK: it ACKs every
WR-end at that packet's own PSN, so when the buggy SQ drops the last <=2 WRs of a
retry replay and jumps ahead, the responder ACKs the higher-PSN WR-end and the SQ
treats it as a cumulative ACK that silently absorbs the dropped WRs. That test
therefore PASSES on the scanOutQ.clear-buggy code and does not cover the
replay-tail-drop fixed in 2cbc90e.

Add mkTestSqTailDropCase (+ mkTestSqTailDropNoRnrCase self-check) driving the same
mkSQ with a STRICT in-order RC receiver: ACK only the in-order WR-end at the
expected PSN, SEQ_ERR-NAK a forward PSN gap (one coalesced NAK per episode, never
cumulatively ACK past a hole), re-ACK duplicates. Single-packet WRs make PSN index
the WR directly (the hardware 4096B-SEND regime). A transient RNR burst kicks off
one retry; the dropped replay tail then leaves a gap the receiver NAKs, forcing
another truncated retry -> the self-perpetuating ~2x retransmit treadmill.

Verdict = steady-state TX/comp (req-pkts/work-comps, last bucket); PASS <= 130%.
Verified discriminating: pristine code -> TX/comp=200%, ImmAssert FAIL; with the
fix -> TX/comp=100% PASS. No-RNR self-check -> 100% (strict receiver does not
itself manufacture the gap). Full Bluesim suite 78/78, 0 Error/ImmAssert.
@ruck314 ruck314 marked this pull request as ready for review June 25, 2026 01:57
@ruck314

ruck314 commented Jun 25, 2026

Copy link
Copy Markdown
Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant