Deeper SQ (MAX_QP_WR=16) + honor rnr_retry=7 as infinite RNR retry + fix CI test suite#3
Open
ruck314 wants to merge 6 commits into
Open
Deeper SQ (MAX_QP_WR=16) + honor rnr_retry=7 as infinite RNR retry + fix CI test suite#3ruck314 wants to merge 6 commits into
ruck314 wants to merge 6 commits into
Conversation
The RNR retry counter was gated only by disableRetryCntReg (the regular retry-count infinite flag, set when max_retry_cnt == 7). A QP configured with rnr_retry == 7 (IBV "infinite") but a finite retry_count would still exhaust rnrCntReg and fault. Add a dedicated disableRnrCntReg, set when max_rnr_cnt == INFINITE_RETRY, gating both the rnrCntReg decrement and the RNR retry-limit-exceeded check, mirroring the regular-retry path.
A SEND-only RDMA datapath is bandwidth-delay-product limited: with ~4 SENDs in flight against the SEND->completion latency, throughput saturates well below line rate. Raising MAX_QP_WR to 16 deepens the SQ pending-WR window; on a KCU105 10GbE RoCEv2 build this lifts RDMA-SEND from ~8.1 to ~9.75 Gb/s (line rate) at 4096 B PMTU. MAX_QP_WR stays a power of 2; derived sizes (MAX_CQE, MAX_PENDING_WORK_COMP_NUM, ...) scale with it.
The BlueSim CI (run.sh -> Makefile.test) failed to compile and run on this branch. Fix it so the full registered suite passes, and get the cocotb AXI-Stream integration test running again. Build/CI: - Makefile.base: add src/ to the bsc package search path for test builds (TESTSRCDIR); test benches run from test/ and could not find any core package. Keep the src/ build path (SRCDIRFLAGS) lib-only to avoid a duplicate-path error during Verilog generation. - .github/workflows/ci.yml: checkout@v4 with submodules: recursive so the blue-wrapper submodule (required on the bsc path) is fetched on the runner. SQ test-harness fixes (match current source signatures/behavior): - Utils4Test / TestInputPktHandle: drop references to the removed request ingress pipe (reqPktPipeOut); keep response + CNP checks. - SimGenRdmaReqResp: use contextSQ.statusSQ (statusRQ was removed). - TestSendQ / TestPayloadGen: build single-element SGLs (MAX_SGE == 1). - TestRespHandleSQ: mkRespHandleSQ is now 4-arg; drain the response payload with a sink (SQ DMA-write/permission path removed) and drop the no-longer observable read/atomic payload comparison. - TestWorkCompGen: mkWorkCompGenSQ no longer waits on a DMA-write response; drive wcWaitDmaResp false and stop feeding the removed payloadConResp port. Skip tests that depend on the disabled receive-queue (RQ) datapath, with sources preserved for when RQ is re-enabled: - De-register TestQueuePair, TestReqHandleRQ, TestTransportLayer, and the request-ingress-only SimExtractRdmaHeaderPayload from Makefile.test. - Trim TestInputPktHandle to the CNP test and TestWorkCompGen to its SQ cases; stub the RQ work-completion test body. cocotb: - TestAxiSTransportLayer: use LogicArray.to_unsigned() instead of the deprecated .integer getter (cocotb 2.x). - Add test/cocotb/requirements.txt documenting the Python deps (incl. bitstring) for the manually-run cocotb flow.
Under sustained/nested RNR backpressure the SQ retry FSM could wedge permanently: a second RNR/NAK arriving during RNR_WAIT re-runs initRetry, re-entering RETRY_HANDLE_ST_START_PRE_RETRY while the pending-WR scan FIFO is still parked in PRE_SCAN_MODE. The old 2-way isScanDone branch then issued preScanRestart(), whose implicit guard is inScanMode -- unsatisfiable in PRE_SCAN_MODE -- so the rule could never fire again. Dispatch and completions stalled until a full QP/source drain. startPreRetry now selects the scan command by exact mode: FIFOF_MODE -> preScanStart, SCAN_MODE -> preScanRestart, PRE_SCAN_MODE -> no-op (the pre-scan is already armed at the head). Expose ScanCntrl.isScanMode() to distinguish the SCAN/PRE_SCAN cases. Adds TestSqRnrThroughput: drives a continuous WR stream through a sustained RNR window and asserts post-RNR throughput recovers to >=80% of baseline (pre-fix it stays at 0%, a permanent stall). Registered in Makefile.test.
The pending-WR scan FIFO (mkScanFIFOF) cleared its output queue scanOutQ unconditionally on every fifoMode cycle. On a normal scanDone -> FIFOF_MODE transition the up-to-2 replayed work requests still buffered in scanOutQ (its mkFIFOF depth) had not yet been drained by mkReqGenSQ, so they were flushed and never reached the wire. Each go-back-N retry therefore replayed itemCnt-2 packets instead of itemCnt, leaving a PSN gap that the peer SEQ_ERR-NAKs, triggering another truncated retry -> a self-perpetuating ~2x retransmit treadmill after RNR backpressure (the sticky half-throughput latch). Gate scanOutQ.clear to fire only when (re)arming a fresh scan (the preScanStart branch). The abort paths (stopScan, preScanRestart) already clear scanOutQ explicitly in scanModeStateChange, so this was the only unconditional clear, and it was the one dropping legitimate scanDone leftovers. Full Bluesim regression suite passes (76/76, 0 Error/ImmAssert), including mkTestScanFIFOF and the mkTestSqRnrThroughput nested-RNR deadlock guard.
…receiver) The existing mkTestSqRnrThroughput responder is a blind echo-ACK: it ACKs every WR-end at that packet's own PSN, so when the buggy SQ drops the last <=2 WRs of a retry replay and jumps ahead, the responder ACKs the higher-PSN WR-end and the SQ treats it as a cumulative ACK that silently absorbs the dropped WRs. That test therefore PASSES on the scanOutQ.clear-buggy code and does not cover the replay-tail-drop fixed in 2cbc90e. Add mkTestSqTailDropCase (+ mkTestSqTailDropNoRnrCase self-check) driving the same mkSQ with a STRICT in-order RC receiver: ACK only the in-order WR-end at the expected PSN, SEQ_ERR-NAK a forward PSN gap (one coalesced NAK per episode, never cumulatively ACK past a hole), re-ACK duplicates. Single-packet WRs make PSN index the WR directly (the hardware 4096B-SEND regime). A transient RNR burst kicks off one retry; the dropped replay tail then leaves a gap the receiver NAKs, forcing another truncated retry -> the self-perpetuating ~2x retransmit treadmill. Verdict = steady-state TX/comp (req-pkts/work-comps, last bucket); PASS <= 130%. Verified discriminating: pristine code -> TX/comp=200%, ImmAssert FAIL; with the fix -> TX/comp=100% PASS. No-RNR self-check -> 100% (strict receiver does not itself manufacture the gap). Full Bluesim suite 78/78, 0 Error/ImmAssert.
Author
|
@FilMarini I confirmed this changes worked using the following PRs: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Two independent changes from running this core in a SLAC KCU105 10GbE RoCEv2 build (SEND-with-immediate datapath), plus a CI test-suite fix. Draft for discussion — the commits are independent and can be reviewed/taken separately.
fix: honor rnr_retry=7 as infinite RNR retry
RNR retry was gated only by
disableRetryCntReg(set whenmax_retry_cnt == 7). A QP configured withrnr_retry == 7(IBV "infinite") but a finiteretry_countwould still exhaustrnrCntRegand fault. This adds a dedicateddisableRnrCntReg(set whenmax_rnr_cnt == INFINITE_RETRY) gating both thernrCntRegdecrement and the RNR retry-limit-exceeded check, mirroring the regular-retry path.feat: raise MAX_QP_WR 4 -> 16
A SEND-only datapath is bandwidth-delay-product limited: with ~4 SENDs in flight against the SEND->completion latency, throughput saturates well below line rate. Raising
MAX_QP_WRto 16 deepens the SQ pending-WR window. On a KCU105 10GbE build this lifts RDMA-SEND from ~8.1 to ~9.75 Gb/s (line rate) at 4096 B PMTU, with PRBS integrity intact and timing closed.MAX_QP_WRstays a power of 2; derived sizes (MAX_CQE,MAX_PENDING_WORK_COMP_NUM, ...) scale with it. The depth value could be made a parameter if preferred.test: fix CI test suite
The BlueSim CI (
run.sh->Makefile.test) did not compile/run on this branch, so the two changes above were landing without test coverage. This restores the registered suite to green and revives the cocotb AXI-Stream integration test.src/to the bsc package search path for test builds (Makefile.base);actions/checkout@v4withsubmodules: recursiveso theblue-wrappersubmodule (required on the bsc path) is fetched on the runner.MAX_SGE == 1), 4-argmkRespHandleSQ,mkWorkCompGenSQno longer waits on a DMA-write response, and dropped references to the removed request-ingress pipe (reqPktPipeOut/statusRQ).TestQueuePair,TestReqHandleRQ,TestTransportLayer, and the request-ingress-only cases are de-registered inMakefile.test(sources preserved, with revival instructions, for when RQ is re-enabled).LogicArray.to_unsigned()instead of the deprecated.integergetter (cocotb 2.x); addtest/cocotb/requirements.txtdocumenting the Python deps (incl.bitstring) for the manually-run cocotb flow.