⚠️ Support requests in an issue-format will be closed immediately. For support, go to Swarm's Discord.
Context
Summary
runStrategy uses defer cancel() with a for range c loop to drive chunk fetches. When the loop exits early (enough chunks fetched or too many failed), cancel() fires but goroutines already inside g.fetcher.Get continue running. These ghost goroutines from a failed DATA strategy mutate shared atomic counters g.failedCnt and g.fetchedCnt concurrently with the subsequent RACE strategy, corrupting its accounting and causing it to terminate prematurely with errStrategyFailed even when enough parity chunks are available. Additionally, for range c never terminates on its own since c is never closed — the function can only exit via early returns.
Expected behavior
Each strategy's fetch goroutines must fully settle their side effects on g.failedCnt and g.fetchedCnt before the next strategy reads them. The strategies in prefetch (L172–L206) run sequentially — DATA → RACE — and shared counter state must be consistent at each strategy boundary.
Actual behavior
If the goroutine drain is omitted:
- DATA strategy returns early with
errStrategyFailed (allowedErrs = 0, first failure)
cancel() fires, but goroutines already inside g.fetcher.Get(fctx, ...) are still running
- RACE strategy starts with
allowedErrs = g.parityCnt
- DATA's cancelled flying goroutines fail their network calls → each calls
g.failedCnt.Add(1) (L148)
- RACE checks
g.failedCnt.Load() > int32(allowedErrs) at L264 — failedCnt is inflated by DATA's ghost goroutines — and returns errStrategyFailed prematurely
Goroutines blocked on waits[i] (singleflight wait, L159–L163) return ctx.Err() without touching failedCnt, so only flying goroutines contribute to this race. On a high-latency or congested network, multiple DATA goroutines can be in-flight simultaneously, making the race likely.
Steps to reproduce
The current code at L242–L267 reads:
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
for _, i := range m {
go func(i int) {
c <- g.fetch(ctx, i, false)
}(i)
}
for range c {
if g.fetchedCnt.Load() >= int32(g.shardCnt) {
return nil
}
if g.failedCnt.Load() > int32(allowedErrs) {
return errStrategyFailed
}
}
To trigger the race:
- Configure a node with erasure coding enabled (redundancy level > 0)
- Upload a file such that some data shards become unavailable in the network
- Attempt retrieval — DATA strategy fails, RACE strategy starts
- DATA's cancelled flying goroutines call
g.failedCnt.Add(1) concurrently with RACE's check of failedCnt > allowedErrs
- RACE incorrectly returns
errStrategyFailed despite sufficient parity being available
Possible solution
Replace defer cancel() + for range c with an explicit drain using a completed counter. The counter is load-bearing: the main loop and the defer both consume from c (buffered to len(m)), so without it the defer would attempt len(m) + completed reads on a channel that only ever receives len(m) sends — deadlock.
ctx, cancel := context.WithCancel(context.Background())
completed := 0
defer func() {
cancel()
// drain remaining goroutines to ensure g.failedCnt/g.fetchedCnt are
// fully settled before the next strategy or recover() reads them.
// 'completed' tracks how many were already consumed by the loop below.
for range len(m) - completed {
<-c
}
}()
AI Disclosure
Context
v2.7.1(latest stable)runStrategy(L208–L270)Summary
runStrategyusesdefer cancel()with afor range cloop to drive chunk fetches. When the loop exits early (enough chunks fetched or too many failed),cancel()fires but goroutines already insideg.fetcher.Getcontinue running. These ghost goroutines from a failed DATA strategy mutate shared atomic countersg.failedCntandg.fetchedCntconcurrently with the subsequent RACE strategy, corrupting its accounting and causing it to terminate prematurely witherrStrategyFailedeven when enough parity chunks are available. Additionally,for range cnever terminates on its own sincecis never closed — the function can only exit via early returns.Expected behavior
Each strategy's fetch goroutines must fully settle their side effects on
g.failedCntandg.fetchedCntbefore the next strategy reads them. The strategies inprefetch(L172–L206) run sequentially — DATA → RACE — and shared counter state must be consistent at each strategy boundary.Actual behavior
If the goroutine drain is omitted:
errStrategyFailed(allowedErrs = 0, first failure)cancel()fires, but goroutines already insideg.fetcher.Get(fctx, ...)are still runningallowedErrs = g.parityCntg.failedCnt.Add(1)(L148)g.failedCnt.Load() > int32(allowedErrs)at L264 —failedCntis inflated by DATA's ghost goroutines — and returnserrStrategyFailedprematurelyGoroutines blocked on
waits[i](singleflight wait, L159–L163) returnctx.Err()without touchingfailedCnt, so only flying goroutines contribute to this race. On a high-latency or congested network, multiple DATA goroutines can be in-flight simultaneously, making the race likely.Steps to reproduce
The current code at L242–L267 reads:
To trigger the race:
g.failedCnt.Add(1)concurrently with RACE's check offailedCnt > allowedErrserrStrategyFaileddespite sufficient parity being availablePossible solution
Replace
defer cancel()+for range cwith an explicit drain using acompletedcounter. The counter is load-bearing: the main loop and the defer both consume fromc(buffered tolen(m)), so without it the defer would attemptlen(m) + completedreads on a channel that only ever receiveslen(m)sends — deadlock.AI Disclosure