Skip to content

fix(staking): harden markValidatorForDeletion (jailed-delete race + non-jailed divide-by-zero)#16

Open
vitaly-andr wants to merge 1 commit into
gonka-ai:release/v0.53.xfrom
vitaly-andr:fix/markvalidator-jailed-cometbft-lag
Open

fix(staking): harden markValidatorForDeletion (jailed-delete race + non-jailed divide-by-zero)#16
vitaly-andr wants to merge 1 commit into
gonka-ai:release/v0.53.xfrom
vitaly-andr:fix/markvalidator-jailed-cometbft-lag

Conversation

@vitaly-andr

@vitaly-andr vitaly-andr commented May 29, 2026

Copy link
Copy Markdown

markValidatorForDeletion mishandled validator removal in both branches. Same function, same defect class (zeroing state in a way that breaks downstream consumers before the record is gone), so fixed together.

1. Jailed branch — chain halt

The jailed branch called deleteValidatorInternal, removing the validator record and the ValidatorByConsAddr index immediately. CometBFT keeps the validator in LastCommit for ValidatorUpdateDelay blocks (effective 2-block lag — state/execution.go next-next-height plus the NextValidators double-buffer). During that window slashing.BeginBlocker → IsValidatorJailed → GetValidatorByConsAddr fails with ErrNoValidatorFound, halting the chain.

2. Non-jailed branch — divide-by-zero panic

The non-jailed branch zeroed DelegatorShares alongside Tokens. TokensFromShares (validator.go:308) computes shares × Tokens / DelegatorShares, so a MsgUnjail (or any other TokensFromShares caller) landing in the same block as the marking — before DeleteZeroPowerValidators removes the record on the next block — hits a divide-by-zero panic.

Fix

Both paths now zero Tokens but keep DelegatorShares (and, for the jailed path, keep the record + index). Power becomes zero, the existing DeleteZeroPowerValidators defer-path removes the record on a later block — after LastValidatorPower clears and CometBFT has dropped the validator — and TokensFromShares returns 0 cleanly.

Tests

  • TestMarkValidatorForDeletion_JailedRemoval_KeepsValidatorReachable
  • TestMarkValidatorForDeletion_NonJailedRemoval_KeepsDelegatorShares

Both fail on current release/v0.53.x and pass with the fix.

Context

Surfaced while implementing simulation/fuzz testing for inference-chain (gonka-ai/gonka#982); filed with a standalone reproducer as gonka-ai/gonka#1205.

Independent of and complementary to GON-191 (#14, @0xgonka) — a different function (markValidatorForDeletion) than GON-191's (filterBasedOnExisting / deleteValidatorInternal). The two were verified to compose without conflict (full simulation pass with both patches, height=501); coordination on #14: #14 (comment). In that thread @0xgonka confirmed the non-jailed divide-by-zero (#2 above) and recommended fixing it in this PR (identical fix shape).

cc @gmorgachev — fork maintainer and author of markValidatorForDeletion (both branches changed here: the jailed delete in cefc31d3a8e, the non-jailed DelegatorShares zeroing in 0d28c0a1e2e)

markValidatorForDeletion mishandled validator removal in both branches:

1. Jailed branch deleted the validator record and the ValidatorByConsAddr
   index immediately. CometBFT keeps the validator in LastCommit for
   ValidatorUpdateDelay blocks (effective 2-block lag), and
   slashing.BeginBlocker -> GetValidatorByConsAddr then fails with
   ErrNoValidatorFound, halting the chain.

2. Non-jailed branch zeroed DelegatorShares alongside Tokens.
   TokensFromShares divides by DelegatorShares, so a MsgUnjail (or any
   other TokensFromShares caller) in the same block as the marking --
   before DeleteZeroPowerValidators removes the record on the next block --
   hits a divide-by-zero panic.

Both paths now zero Tokens but keep DelegatorShares (and, for the jailed
path, keep the record + index): power becomes zero, the existing
DeleteZeroPowerValidators defer-path removes the record on a later block
after CometBFT has dropped it, and TokensFromShares returns 0 cleanly.

Adds TestMarkValidatorForDeletion_JailedRemoval_KeepsValidatorReachable
and TestMarkValidatorForDeletion_NonJailedRemoval_KeepsDelegatorShares;
both fail on current release/v0.53.x and pass with the fix.

The non-jailed divide-by-zero was confirmed by @0xgonka (GON-191 author)
on gonka-ai#14 as worth fixing in the same PR (identical fix
shape).

Surfaced via gonka-ai/gonka#982 (inference-chain simulation/fuzz testing);
filed with a standalone reproducer as gonka-ai/gonka#1205.

Signed-off-by: Vitaly Andrianov <vitaly.andr@gmail.com>
@vitaly-andr vitaly-andr force-pushed the fix/markvalidator-jailed-cometbft-lag branch from ad4a98b to 46ca57b Compare May 29, 2026 09:08
@vitaly-andr vitaly-andr changed the title fix(staking): keep jailed validator reachable during CometBFT update lag fix(staking): harden markValidatorForDeletion (jailed-delete race + non-jailed divide-by-zero) May 29, 2026
vitaly-andr added a commit to vitaly-andr/gonka that referenced this pull request May 29, 2026
…zzing, secondary-op factories

Completes gonka-ai#982 Phase 3 (x/inference hardening) on top of the second-wave ops:

- Custom invariants via RegisterInvariants (bank-backs-positive-balance
  solvency, no-stuck-voting, effective-epoch-fresh,
  active-invalidations-ref-live) + unit tests.
- Store decoders for the simulation decode harness (Participant / Inference /
  KeySet / Uint64 / Int64 / legacy-params, hex fallback) + tests.
- Aggressive parameter-boundary fuzzing (MutateSimParams): economic params to
  Validate() boundaries, widened EpochParams, corner profiles. Slashing-
  activation levers held at defaults — active slashing drains the Tokens=1 sim
  substrate.
- Three real secondary-op factories (SubmitUnitOfComputePriceProposal,
  SubmitHardwareDiff, SubmitNewUnfundedParticipant) replacing NoOp stubs.
- Data-driven weight tuning; a model-weight transient-cache test helper.
- docs/simulation.md: Phase-3 status, fuzzing design, invariant list, sweep.

The full 500-block sim run is green only with the production fixes it
surfaced: gonka-ai#1275 (VOTING-timeout refund), gonka-ai#1276
(revalidation membership guard), and gonka-ai/cosmos-sdk#16
(markValidatorForDeletion). Verified combined: sim-full 500x200 seed=99 PASS,
app-hash f76df5cd….

The post-run bank-backs invariant holds in the canonical 500x200 run
(transient intra-epoch under-backing heals via debt-recovery by settlement);
the asymmetric refund-debit drift it surfaces under shorter/aggressive runs is
documented as a design question in gonka-ai#1273
(TestRefundInvalidation_LeavesAccountingDrift).

Signed-off-by: Vitaly Andrianov <vitaly.andr@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant