Skip to content

redundancy strategy fetch drain #5448

@nugaon

Description

@nugaon

⚠️ Support requests in an issue-format will be closed immediately. For support, go to Swarm's Discord.

Context

Summary

runStrategy uses defer cancel() with a for range c loop to drive chunk fetches. When the loop exits early (enough chunks fetched or too many failed), cancel() fires but goroutines already inside g.fetcher.Get continue running. These ghost goroutines from a failed DATA strategy mutate shared atomic counters g.failedCnt and g.fetchedCnt concurrently with the subsequent RACE strategy, corrupting its accounting and causing it to terminate prematurely with errStrategyFailed even when enough parity chunks are available. Additionally, for range c never terminates on its own since c is never closed — the function can only exit via early returns.

Expected behavior

Each strategy's fetch goroutines must fully settle their side effects on g.failedCnt and g.fetchedCnt before the next strategy reads them. The strategies in prefetch (L172–L206) run sequentially — DATA → RACE — and shared counter state must be consistent at each strategy boundary.

Actual behavior

If the goroutine drain is omitted:

  1. DATA strategy returns early with errStrategyFailed (allowedErrs = 0, first failure)
  2. cancel() fires, but goroutines already inside g.fetcher.Get(fctx, ...) are still running
  3. RACE strategy starts with allowedErrs = g.parityCnt
  4. DATA's cancelled flying goroutines fail their network calls → each calls g.failedCnt.Add(1) (L148)
  5. RACE checks g.failedCnt.Load() > int32(allowedErrs) at L264failedCnt is inflated by DATA's ghost goroutines — and returns errStrategyFailed prematurely

Goroutines blocked on waits[i] (singleflight wait, L159–L163) return ctx.Err() without touching failedCnt, so only flying goroutines contribute to this race. On a high-latency or congested network, multiple DATA goroutines can be in-flight simultaneously, making the race likely.

Steps to reproduce

The current code at L242–L267 reads:

ctx, cancel := context.WithCancel(context.Background())
defer cancel()

for _, i := range m {
    go func(i int) {
        c <- g.fetch(ctx, i, false)
    }(i)
}

for range c {
    if g.fetchedCnt.Load() >= int32(g.shardCnt) {
        return nil
    }
    if g.failedCnt.Load() > int32(allowedErrs) {
        return errStrategyFailed
    }
}

To trigger the race:

  1. Configure a node with erasure coding enabled (redundancy level > 0)
  2. Upload a file such that some data shards become unavailable in the network
  3. Attempt retrieval — DATA strategy fails, RACE strategy starts
  4. DATA's cancelled flying goroutines call g.failedCnt.Add(1) concurrently with RACE's check of failedCnt > allowedErrs
  5. RACE incorrectly returns errStrategyFailed despite sufficient parity being available

Possible solution

Replace defer cancel() + for range c with an explicit drain using a completed counter. The counter is load-bearing: the main loop and the defer both consume from c (buffered to len(m)), so without it the defer would attempt len(m) + completed reads on a channel that only ever receives len(m) sends — deadlock.

ctx, cancel := context.WithCancel(context.Background())
completed := 0
defer func() {
    cancel()
    // drain remaining goroutines to ensure g.failedCnt/g.fetchedCnt are
    // fully settled before the next strategy or recover() reads them.
    // 'completed' tracks how many were already consumed by the loop below.
    for range len(m) - completed {
        <-c
    }
}()

AI Disclosure

  • This issue contains suggestions and text generated by an LLM.
  • I have reviewed the AI generated content thoroughly.
  • I possess the technical expertise to responsibly review the AI generated content mentioned in this issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions