Shape storage init races with draining instance during rolling deploy (shared bind-mount)

## Summary

During a rolling production deploy, two Electric instances run **concurrently on the same host** and bind-mount the **same** data directory (`/var/electric/<stack_id>/...`). The shape storage layer assumes it is the sole owner of that directory, so the **starting** instance runs destructive init (`File.rm_rf`, `cleanup_all!`) on directories the **draining** instance still has open — producing a burst of `:enoent` crashes and at least one hard NimblePool contract violation.

This was observed in production on 2026-06-17 across two hosts (`ampere`, `faraday`) within the `13:39:58–14:08:33Z` "Deploy production stack" window. Five distinct Sentry issues fired, all tracing to the same root cause.

There are three independent ways to enter the destructive path, ordered from "most innocent starting state" to "least":

1. **Version / schema / OTP-release change — the *common* deploy case, healthy db.** Per the doc on `ShapeStatus.initialize/1` (`shape_status.ex:45-53`), *"a change to `@version`, to `ShapeDb.Connection`'s `@schema_version`, **or to the OTP release**, will result in an empty database."* `validate_existing_shapes` → `count_shapes` then returns **0**, which trips the `valid_shape_count == 0` branch (`shape_status.ex:64-68`) and calls `cleanup_all!` — wiping the whole `<stack_id>` dir the draining instance is actively writing into. A rolling deploy is *exactly* when the release changes, so this is close to the default deploy, not an edge case. The db was perfectly healthy.
2. **Empty / freshly-created db.** Same `valid_shape_count == 0 → cleanup_all!` branch.
3. **Any open failure *or* integrity-check failure.** `delete_corrupt_db` (`rm_rf` of the shape-db parent dir) fires on the integrity-check branch (`connection.ex:271-278`) **and on *any* `Sqlite3.open` error** (`connection.ex:284-289`).

Once any one of these fires, it is self-sustaining and runs on healthy data: instance A's `rm_rf` makes instance B's next file op fail with `:enoent`, which (via path 3) triggers *another* wipe, which breaks A, and so on — the two instances ping-pong wiping + recreating the same dir. Corruption is just one of several ways to *enter* the loop; the loop itself does not need a corrupt db to keep going. This is why the fix must be **coordination + controlled shutdown (an ownership lock), not blind `:enoent` tolerance** — see the proposed fix below.

## Deployment context

Electric Cloud runs each logical instance as an ECS service on EC2 with `deploymentMinimumHealthyPercent: 100` / `maximumPercent: 200`. The data dir is a **host bind-mount** (`/mnt/nvme/electric/<instance>` → container `/var/electric`) attached to the EC2 instance — it is *not* a per-task volume. So during a rolling deploy the old and new tasks **share the same on-disk directory tree**, keyed on the logical instance name, for the whole overlap window. The new task waits on the upstream replication-slot advisory lock before going read-write, but **storage init/cleanup is not gated by that lock** — it runs eagerly at boot.

## Root cause — destructive init assumes sole ownership

**1. Startup cleanup wipes the entire stack dir.**
`lib/electric/shape_cache/shape_status.ex:58-68` — on every boot, if `valid_shape_count == 0` (which is true when the SQLite db was just deleted/corrupt), it calls `Storage.cleanup_all!()`:

```elixir
if valid_shape_count == 0 do
  # delete any orphaned shape data
  stack_id
  |> Electric.ShapeCache.Storage.for_stack()
  |> Electric.ShapeCache.Storage.cleanup_all!()
```

`lib/electric/shape_cache/pure_file_storage.ex:231` — `cleanup_all!/1` deletes the whole `base_path` (= `/var/electric/<stack_id>/`) and `tmp_dir`, including the `log/` and `metadata/` directories the draining instance still holds open.

**2. SQLite "recovery" deletes the shape-db parent dir.**
`lib/electric/shape_cache/shape_status/shape_db/connection.ex:294-298`:

```elixir
defp delete_corrupt_db(db_path) do
  with dir = Path.dirname(db_path),
       {:ok, _} <- File.rm_rf(dir),   # nukes shapes/meta/shape-db/
       :ok <- File.mkdir_p(dir) do
    :ok
  end
end
```

Called from `open_with_recovery/4` on **any** open failure *or* integrity-check failure (`connection.ex:276, 287`). During overlap, the two instances ping-pong wiping + recreating the same dir, which both starves and corrupts each other.

**3. NimblePool contract violation turns an open failure into an unhandled crash.**
When recovery exhausts its attempts (`connection.ex:258-260`):

```elixir
defp open_with_recovery(db_path, _pool_state, _opts, 0) do
  Logger.error("Unable to create database at #{db_path}")   # <- ELECTRIC-8XK
  {:error, "failed to open #{db_path}"}
end
```

`init_worker/1` (`connection.ex:157-162`) propagates that `{:error, _}` verbatim:

```elixir
def init_worker(pool_state) do
  with {:ok, conn} <- init_worker_for_pool(pool_state) do
    ...
  end
end
```

But NimblePool requires `{:ok, worker, pool_state} | {:async, fun, pool_state}` — so `{:error, _}` raises `RuntimeError: unexpected return from ...init_worker/1` (ELECTRIC-8XN). The pooled path's `Query.prepare!/2` similarly raises `ShapeDb.Error: prepare_stmts failed: {:error, 1}` (ELECTRIC-8XQ — a 492-event/sec crash-storm).

## Observed Sentry issues (electricsql-04, all 2026-06-17)

| Issue | Symptom | Origin |
|---|---|---|
| ELECTRIC-8XK | `Unable to create database at …` | `open_with_recovery/4` exhausted |
| ELECTRIC-8XN | `RuntimeError: unexpected return from …init_worker/1` (got `{:error, "failed to open …sqlite"}`) | NimblePool contract violation (same event as 8XK) |
| ELECTRIC-8XQ | `ShapeDb.Error: prepare_stmts failed: {:error, 1}` — 492 events/sec, escalating | `Query.prepare!/2` against wiped/contended db |
| ELECTRIC-8XP | `File.write!` `…/metadata/last_persisted_txn_offset.bin.tmp: :enoent` | `PureFileStorage.WriteLoop` writing into a deleted dir |
| ELECTRIC-8XM | `File.open!` `…/log/log.latest.0.jsonfile.bin: :enoent` | `WriteLoop.ensure_json_file_open/2` into a deleted dir |

## Proposed fix (layered)

**Primary — gate destructive storage init on directory ownership, not just the Postgres slot.** The starting instance must not run `cleanup_all!` / `delete_corrupt_db` (or open the storage for write) on a shared `<stack_id>` dir until the previous owner has released it. Suggested: an on-disk lock acquired via `flock(2)` on a per-stack lock file, held by the active instance; defer all destructive/write-mode storage setup until the lock is held. This extends the existing replication-slot handoff to cover the filesystem.

**Secondary — fail safe even if a race slips through:**
- `init_worker/1` must satisfy the NimblePool contract: on open failure, retry via `{:async, …}`/backoff or raise a typed, handled error — never return bare `{:error, _}` (fixes 8XN).
- `delete_corrupt_db/1` should not `rm_rf` a directory that a peer may be using; at minimum guard with the ownership lock above, and distinguish "db genuinely corrupt" from "dir pulled out from under me / open failed transiently."
- `PureFileStorage` write path (`File.open!`/`File.write!`, `write_loop.ex` `ensure_json_file_open/2`): when the dir was removed during a known draining/handoff, stop cleanly instead of crash-looping. The existing `shape_gone?/1` check (`pure_file_storage.ex:1408`) is racey.

**Do not** simply make these ops retry/ignore `:enoent` — that would mask genuine on-disk corruption. The fix is coordination + controlled shutdown, not blind tolerance.

## Notes

- Related to #4595 (AsyncDeleter resilient boot) — both concern boot-time storage cleanup robustness, but this is a *concurrency/ownership* bug specific to the rolling-deploy overlap, not a full-disk bug. `cleanup_all!` reaches `AsyncDeleter.delete`, so the two interact.
- The triggering deploy ("Deploy new Electric, hibernate-then-suspend, GC tweaks") also shipped GC/hibernate changes touching storage lifecycle — worth confirming those paths don't independently call `cleanup_all!`/`delete_corrupt_db` on live stacks (a candidate explanation for the standalone 8XQ burst at 14:49Z, which falls outside the deploy window).

🤖 Generated with [Claude Code](https://claude.com/claude-code)



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Shape storage init races with draining instance during rolling deploy (shared bind-mount) #4637

Summary

Deployment context

Root cause — destructive init assumes sole ownership

Observed Sentry issues (electricsql-04, all 2026-06-17)

Proposed fix (layered)

Notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue	Symptom	Origin
ELECTRIC-8XK	`Unable to create database at …`	`open_with_recovery/4` exhausted
ELECTRIC-8XN	`RuntimeError: unexpected return from …init_worker/1` (got `{:error, "failed to open …sqlite"}`)	NimblePool contract violation (same event as 8XK)
ELECTRIC-8XQ	`ShapeDb.Error: prepare_stmts failed: {:error, 1}` — 492 events/sec, escalating	`Query.prepare!/2` against wiped/contended db
ELECTRIC-8XP	`File.write!` `…/metadata/last_persisted_txn_offset.bin.tmp: :enoent`	`PureFileStorage.WriteLoop` writing into a deleted dir
ELECTRIC-8XM	`File.open!` `…/log/log.latest.0.jsonfile.bin: :enoent`	`WriteLoop.ensure_json_file_open/2` into a deleted dir

Uh oh!

Shape storage init races with draining instance during rolling deploy (shared bind-mount) #4637

Description

Summary

Deployment context

Root cause — destructive init assumes sole ownership

Observed Sentry issues (electricsql-04, all 2026-06-17)

Proposed fix (layered)

Notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions