Skip to content

Shape storage init races with draining instance during rolling deploy (shared bind-mount) #4637

Description

@erik-the-implementer

Summary

During a rolling production deploy, two Electric instances run concurrently on the same host and bind-mount the same data directory (/var/electric/<stack_id>/...). The shape storage layer assumes it is the sole owner of that directory, so the starting instance runs destructive init (File.rm_rf, cleanup_all!) on directories the draining instance still has open — producing a burst of :enoent crashes and at least one hard NimblePool contract violation.

This was observed in production on 2026-06-17 across two hosts (ampere, faraday) within the 13:39:58–14:08:33Z "Deploy production stack" window. Five distinct Sentry issues fired, all tracing to the same root cause.

There are three independent ways to enter the destructive path, ordered from "most innocent starting state" to "least":

  1. Version / schema / OTP-release change — the common deploy case, healthy db. Per the doc on ShapeStatus.initialize/1 (shape_status.ex:45-53), "a change to @version, to ShapeDb.Connection's @schema_version, or to the OTP release, will result in an empty database." validate_existing_shapescount_shapes then returns 0, which trips the valid_shape_count == 0 branch (shape_status.ex:64-68) and calls cleanup_all! — wiping the whole <stack_id> dir the draining instance is actively writing into. A rolling deploy is exactly when the release changes, so this is close to the default deploy, not an edge case. The db was perfectly healthy.
  2. Empty / freshly-created db. Same valid_shape_count == 0 → cleanup_all! branch.
  3. Any open failure or integrity-check failure. delete_corrupt_db (rm_rf of the shape-db parent dir) fires on the integrity-check branch (connection.ex:271-278) and on any Sqlite3.open error (connection.ex:284-289).

Once any one of these fires, it is self-sustaining and runs on healthy data: instance A's rm_rf makes instance B's next file op fail with :enoent, which (via path 3) triggers another wipe, which breaks A, and so on — the two instances ping-pong wiping + recreating the same dir. Corruption is just one of several ways to enter the loop; the loop itself does not need a corrupt db to keep going. This is why the fix must be coordination + controlled shutdown (an ownership lock), not blind :enoent tolerance — see the proposed fix below.

Deployment context

Electric Cloud runs each logical instance as an ECS service on EC2 with deploymentMinimumHealthyPercent: 100 / maximumPercent: 200. The data dir is a host bind-mount (/mnt/nvme/electric/<instance> → container /var/electric) attached to the EC2 instance — it is not a per-task volume. So during a rolling deploy the old and new tasks share the same on-disk directory tree, keyed on the logical instance name, for the whole overlap window. The new task waits on the upstream replication-slot advisory lock before going read-write, but storage init/cleanup is not gated by that lock — it runs eagerly at boot.

Root cause — destructive init assumes sole ownership

1. Startup cleanup wipes the entire stack dir.
lib/electric/shape_cache/shape_status.ex:58-68 — on every boot, if valid_shape_count == 0 (which is true when the SQLite db was just deleted/corrupt), it calls Storage.cleanup_all!():

if valid_shape_count == 0 do
  # delete any orphaned shape data
  stack_id
  |> Electric.ShapeCache.Storage.for_stack()
  |> Electric.ShapeCache.Storage.cleanup_all!()

lib/electric/shape_cache/pure_file_storage.ex:231cleanup_all!/1 deletes the whole base_path (= /var/electric/<stack_id>/) and tmp_dir, including the log/ and metadata/ directories the draining instance still holds open.

2. SQLite "recovery" deletes the shape-db parent dir.
lib/electric/shape_cache/shape_status/shape_db/connection.ex:294-298:

defp delete_corrupt_db(db_path) do
  with dir = Path.dirname(db_path),
       {:ok, _} <- File.rm_rf(dir),   # nukes shapes/meta/shape-db/
       :ok <- File.mkdir_p(dir) do
    :ok
  end
end

Called from open_with_recovery/4 on any open failure or integrity-check failure (connection.ex:276, 287). During overlap, the two instances ping-pong wiping + recreating the same dir, which both starves and corrupts each other.

3. NimblePool contract violation turns an open failure into an unhandled crash.
When recovery exhausts its attempts (connection.ex:258-260):

defp open_with_recovery(db_path, _pool_state, _opts, 0) do
  Logger.error("Unable to create database at #{db_path}")   # <- ELECTRIC-8XK
  {:error, "failed to open #{db_path}"}
end

init_worker/1 (connection.ex:157-162) propagates that {:error, _} verbatim:

def init_worker(pool_state) do
  with {:ok, conn} <- init_worker_for_pool(pool_state) do
    ...
  end
end

But NimblePool requires {:ok, worker, pool_state} | {:async, fun, pool_state} — so {:error, _} raises RuntimeError: unexpected return from ...init_worker/1 (ELECTRIC-8XN). The pooled path's Query.prepare!/2 similarly raises ShapeDb.Error: prepare_stmts failed: {:error, 1} (ELECTRIC-8XQ — a 492-event/sec crash-storm).

Observed Sentry issues (electricsql-04, all 2026-06-17)

Issue Symptom Origin
ELECTRIC-8XK Unable to create database at … open_with_recovery/4 exhausted
ELECTRIC-8XN RuntimeError: unexpected return from …init_worker/1 (got {:error, "failed to open …sqlite"}) NimblePool contract violation (same event as 8XK)
ELECTRIC-8XQ ShapeDb.Error: prepare_stmts failed: {:error, 1} — 492 events/sec, escalating Query.prepare!/2 against wiped/contended db
ELECTRIC-8XP File.write! …/metadata/last_persisted_txn_offset.bin.tmp: :enoent PureFileStorage.WriteLoop writing into a deleted dir
ELECTRIC-8XM File.open! …/log/log.latest.0.jsonfile.bin: :enoent WriteLoop.ensure_json_file_open/2 into a deleted dir

Proposed fix (layered)

Primary — gate destructive storage init on directory ownership, not just the Postgres slot. The starting instance must not run cleanup_all! / delete_corrupt_db (or open the storage for write) on a shared <stack_id> dir until the previous owner has released it. Suggested: an on-disk lock acquired via flock(2) on a per-stack lock file, held by the active instance; defer all destructive/write-mode storage setup until the lock is held. This extends the existing replication-slot handoff to cover the filesystem.

Secondary — fail safe even if a race slips through:

  • init_worker/1 must satisfy the NimblePool contract: on open failure, retry via {:async, …}/backoff or raise a typed, handled error — never return bare {:error, _} (fixes 8XN).
  • delete_corrupt_db/1 should not rm_rf a directory that a peer may be using; at minimum guard with the ownership lock above, and distinguish "db genuinely corrupt" from "dir pulled out from under me / open failed transiently."
  • PureFileStorage write path (File.open!/File.write!, write_loop.ex ensure_json_file_open/2): when the dir was removed during a known draining/handoff, stop cleanly instead of crash-looping. The existing shape_gone?/1 check (pure_file_storage.ex:1408) is racey.

Do not simply make these ops retry/ignore :enoent — that would mask genuine on-disk corruption. The fix is coordination + controlled shutdown, not blind tolerance.

Notes

  • Related to AsyncDeleter deadlocks on full disk: crashes on boot with misleading :enoent, never reclaims trash #4595 (AsyncDeleter resilient boot) — both concern boot-time storage cleanup robustness, but this is a concurrency/ownership bug specific to the rolling-deploy overlap, not a full-disk bug. cleanup_all! reaches AsyncDeleter.delete, so the two interact.
  • The triggering deploy ("Deploy new Electric, hibernate-then-suspend, GC tweaks") also shipped GC/hibernate changes touching storage lifecycle — worth confirming those paths don't independently call cleanup_all!/delete_corrupt_db on live stacks (a candidate explanation for the standalone 8XQ burst at 14:49Z, which falls outside the deploy window).

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions