redundancy strategy fetch drain

⚠️ Support requests in an issue-format will be closed immediately. For support, go to Swarm's [Discord](https://discord.gg/XGNBEKktkD).

### Context

- **Bee version:** `v2.7.1` (latest stable)
- **Function:** [`runStrategy` (L208–L270)](https://github.com/ethersphere/bee/blob/master/pkg/file/redundancy/getter/getter.go#L208-L270)
- **Area:** Erasure coding / redundancy retrieval

### Summary

`runStrategy` uses `defer cancel()` with a `for range c` loop to drive chunk fetches. When the loop exits early (enough chunks fetched or too many failed), `cancel()` fires but goroutines already inside `g.fetcher.Get` continue running. These ghost goroutines from a failed DATA strategy mutate shared atomic counters `g.failedCnt` and `g.fetchedCnt` concurrently with the subsequent RACE strategy, corrupting its accounting and causing it to terminate prematurely with `errStrategyFailed` even when enough parity chunks are available. Additionally, `for range c` never terminates on its own since `c` is never closed — the function can only exit via early returns.

### Expected behavior

Each strategy's fetch goroutines must fully settle their side effects on `g.failedCnt` and `g.fetchedCnt` before the next strategy reads them. The strategies in [`prefetch` (L172–L206)](https://github.com/ethersphere/bee/blob/master/pkg/file/redundancy/getter/getter.go#L172-L206) run sequentially — DATA → RACE — and shared counter state must be consistent at each strategy boundary.

### Actual behavior

If the goroutine drain is omitted:

1. DATA strategy returns early with `errStrategyFailed` (`allowedErrs = 0`, first failure)
2. `cancel()` fires, but goroutines already inside `g.fetcher.Get(fctx, ...)` are still running
3. RACE strategy starts with `allowedErrs = g.parityCnt`
4. DATA's cancelled flying goroutines fail their network calls → each calls [`g.failedCnt.Add(1)` (L148)](https://github.com/ethersphere/bee/blob/master/pkg/file/redundancy/getter/getter.go#L148)
5. RACE checks `g.failedCnt.Load() > int32(allowedErrs)` at [L264](https://github.com/ethersphere/bee/blob/master/pkg/file/redundancy/getter/getter.go#L264) — `failedCnt` is inflated by DATA's ghost goroutines — and returns `errStrategyFailed` prematurely

Goroutines blocked on `waits[i]` (singleflight wait, [L159–L163](https://github.com/ethersphere/bee/blob/master/pkg/file/redundancy/getter/getter.go#L159-L163)) return `ctx.Err()` without touching `failedCnt`, so only flying goroutines contribute to this race. On a high-latency or congested network, multiple DATA goroutines can be in-flight simultaneously, making the race likely.

### Steps to reproduce

The current code at [L242–L267](https://github.com/ethersphere/bee/blob/master/pkg/file/redundancy/getter/getter.go#L242-L267) reads:

```go
ctx, cancel := context.WithCancel(context.Background())
defer cancel()

for _, i := range m {
    go func(i int) {
        c <- g.fetch(ctx, i, false)
    }(i)
}

for range c {
    if g.fetchedCnt.Load() >= int32(g.shardCnt) {
        return nil
    }
    if g.failedCnt.Load() > int32(allowedErrs) {
        return errStrategyFailed
    }
}
```

To trigger the race:
1. Configure a node with erasure coding enabled (redundancy level > 0)
2. Upload a file such that some data shards become unavailable in the network
3. Attempt retrieval — DATA strategy fails, RACE strategy starts
4. DATA's cancelled flying goroutines call `g.failedCnt.Add(1)` concurrently with RACE's check of `failedCnt > allowedErrs`
5. RACE incorrectly returns `errStrategyFailed` despite sufficient parity being available

### Possible solution

Replace `defer cancel()` + `for range c` with an explicit drain using a `completed` counter. The counter is load-bearing: the main loop and the defer both consume from `c` (buffered to `len(m)`), so without it the defer would attempt `len(m) + completed` reads on a channel that only ever receives `len(m)` sends — deadlock.

```go
ctx, cancel := context.WithCancel(context.Background())
completed := 0
defer func() {
    cancel()
    // drain remaining goroutines to ensure g.failedCnt/g.fetchedCnt are
    // fully settled before the next strategy or recover() reads them.
    // 'completed' tracks how many were already consumed by the loop below.
    for range len(m) - completed {
        <-c
    }
}()
```

### AI Disclosure
- [x] This issue contains suggestions and text generated by an LLM.
- [x] I have reviewed the AI generated content thoroughly.
- [x] I possess the technical expertise to responsibly review the AI generated content mentioned in this issue.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

redundancy strategy fetch drain #5448

Context

Summary

Expected behavior

Actual behavior

Steps to reproduce

Possible solution

AI Disclosure

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

redundancy strategy fetch drain #5448

Description

Context

Summary

Expected behavior

Actual behavior

Steps to reproduce

Possible solution

AI Disclosure

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions