You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
During a rolling production deploy, two Electric instances run concurrently on the same host and bind-mount the same data directory (/var/electric/<stack_id>/...). The shape storage layer assumes it is the sole owner of that directory, so the starting instance runs destructive init (File.rm_rf, cleanup_all!) on directories the draining instance still has open — producing a burst of :enoent crashes and at least one hard NimblePool contract violation.
This was observed in production on 2026-06-17 across two hosts (ampere, faraday) within the 13:39:58–14:08:33Z "Deploy production stack" window. Five distinct Sentry issues fired, all tracing to the same root cause.
There are three independent ways to enter the destructive path, ordered from "most innocent starting state" to "least":
Version / schema / OTP-release change — the common deploy case, healthy db. Per the doc on ShapeStatus.initialize/1 (shape_status.ex:45-53), "a change to @version, to ShapeDb.Connection's @schema_version, or to the OTP release, will result in an empty database."validate_existing_shapes → count_shapes then returns 0, which trips the valid_shape_count == 0 branch (shape_status.ex:64-68) and calls cleanup_all! — wiping the whole <stack_id> dir the draining instance is actively writing into. A rolling deploy is exactly when the release changes, so this is close to the default deploy, not an edge case. The db was perfectly healthy.
Any open failure or integrity-check failure.delete_corrupt_db (rm_rf of the shape-db parent dir) fires on the integrity-check branch (connection.ex:271-278) and on anySqlite3.open error (connection.ex:284-289).
Once any one of these fires, it is self-sustaining and runs on healthy data: instance A's rm_rf makes instance B's next file op fail with :enoent, which (via path 3) triggers another wipe, which breaks A, and so on — the two instances ping-pong wiping + recreating the same dir. Corruption is just one of several ways to enter the loop; the loop itself does not need a corrupt db to keep going. This is why the fix must be coordination + controlled shutdown (an ownership lock), not blind :enoent tolerance — see the proposed fix below.
Deployment context
Electric Cloud runs each logical instance as an ECS service on EC2 with deploymentMinimumHealthyPercent: 100 / maximumPercent: 200. The data dir is a host bind-mount (/mnt/nvme/electric/<instance> → container /var/electric) attached to the EC2 instance — it is not a per-task volume. So during a rolling deploy the old and new tasks share the same on-disk directory tree, keyed on the logical instance name, for the whole overlap window. The new task waits on the upstream replication-slot advisory lock before going read-write, but storage init/cleanup is not gated by that lock — it runs eagerly at boot.
Root cause — destructive init assumes sole ownership
1. Startup cleanup wipes the entire stack dir. lib/electric/shape_cache/shape_status.ex:58-68 — on every boot, if valid_shape_count == 0 (which is true when the SQLite db was just deleted/corrupt), it calls Storage.cleanup_all!():
ifvalid_shape_count==0do# delete any orphaned shape datastack_id|>Electric.ShapeCache.Storage.for_stack()|>Electric.ShapeCache.Storage.cleanup_all!()
lib/electric/shape_cache/pure_file_storage.ex:231 — cleanup_all!/1 deletes the whole base_path (= /var/electric/<stack_id>/) and tmp_dir, including the log/ and metadata/ directories the draining instance still holds open.
2. SQLite "recovery" deletes the shape-db parent dir. lib/electric/shape_cache/shape_status/shape_db/connection.ex:294-298:
Called from open_with_recovery/4 on any open failure or integrity-check failure (connection.ex:276, 287). During overlap, the two instances ping-pong wiping + recreating the same dir, which both starves and corrupts each other.
3. NimblePool contract violation turns an open failure into an unhandled crash.
When recovery exhausts its attempts (connection.ex:258-260):
defpopen_with_recovery(db_path,_pool_state,_opts,0)doLogger.error("Unable to create database at #{db_path}")# <- ELECTRIC-8XK{:error,"failed to open #{db_path}"}end
init_worker/1 (connection.ex:157-162) propagates that {:error, _} verbatim:
WriteLoop.ensure_json_file_open/2 into a deleted dir
Proposed fix (layered)
Primary — gate destructive storage init on directory ownership, not just the Postgres slot. The starting instance must not run cleanup_all! / delete_corrupt_db (or open the storage for write) on a shared <stack_id> dir until the previous owner has released it. Suggested: an on-disk lock acquired via flock(2) on a per-stack lock file, held by the active instance; defer all destructive/write-mode storage setup until the lock is held. This extends the existing replication-slot handoff to cover the filesystem.
Secondary — fail safe even if a race slips through:
init_worker/1 must satisfy the NimblePool contract: on open failure, retry via {:async, …}/backoff or raise a typed, handled error — never return bare {:error, _} (fixes 8XN).
delete_corrupt_db/1 should not rm_rf a directory that a peer may be using; at minimum guard with the ownership lock above, and distinguish "db genuinely corrupt" from "dir pulled out from under me / open failed transiently."
PureFileStorage write path (File.open!/File.write!, write_loop.exensure_json_file_open/2): when the dir was removed during a known draining/handoff, stop cleanly instead of crash-looping. The existing shape_gone?/1 check (pure_file_storage.ex:1408) is racey.
Do not simply make these ops retry/ignore :enoent — that would mask genuine on-disk corruption. The fix is coordination + controlled shutdown, not blind tolerance.
The triggering deploy ("Deploy new Electric, hibernate-then-suspend, GC tweaks") also shipped GC/hibernate changes touching storage lifecycle — worth confirming those paths don't independently call cleanup_all!/delete_corrupt_db on live stacks (a candidate explanation for the standalone 8XQ burst at 14:49Z, which falls outside the deploy window).
Summary
During a rolling production deploy, two Electric instances run concurrently on the same host and bind-mount the same data directory (
/var/electric/<stack_id>/...). The shape storage layer assumes it is the sole owner of that directory, so the starting instance runs destructive init (File.rm_rf,cleanup_all!) on directories the draining instance still has open — producing a burst of:enoentcrashes and at least one hard NimblePool contract violation.This was observed in production on 2026-06-17 across two hosts (
ampere,faraday) within the13:39:58–14:08:33Z"Deploy production stack" window. Five distinct Sentry issues fired, all tracing to the same root cause.There are three independent ways to enter the destructive path, ordered from "most innocent starting state" to "least":
ShapeStatus.initialize/1(shape_status.ex:45-53), "a change to@version, toShapeDb.Connection's@schema_version, or to the OTP release, will result in an empty database."validate_existing_shapes→count_shapesthen returns 0, which trips thevalid_shape_count == 0branch (shape_status.ex:64-68) and callscleanup_all!— wiping the whole<stack_id>dir the draining instance is actively writing into. A rolling deploy is exactly when the release changes, so this is close to the default deploy, not an edge case. The db was perfectly healthy.valid_shape_count == 0 → cleanup_all!branch.delete_corrupt_db(rm_rfof the shape-db parent dir) fires on the integrity-check branch (connection.ex:271-278) and on anySqlite3.openerror (connection.ex:284-289).Once any one of these fires, it is self-sustaining and runs on healthy data: instance A's
rm_rfmakes instance B's next file op fail with:enoent, which (via path 3) triggers another wipe, which breaks A, and so on — the two instances ping-pong wiping + recreating the same dir. Corruption is just one of several ways to enter the loop; the loop itself does not need a corrupt db to keep going. This is why the fix must be coordination + controlled shutdown (an ownership lock), not blind:enoenttolerance — see the proposed fix below.Deployment context
Electric Cloud runs each logical instance as an ECS service on EC2 with
deploymentMinimumHealthyPercent: 100/maximumPercent: 200. The data dir is a host bind-mount (/mnt/nvme/electric/<instance>→ container/var/electric) attached to the EC2 instance — it is not a per-task volume. So during a rolling deploy the old and new tasks share the same on-disk directory tree, keyed on the logical instance name, for the whole overlap window. The new task waits on the upstream replication-slot advisory lock before going read-write, but storage init/cleanup is not gated by that lock — it runs eagerly at boot.Root cause — destructive init assumes sole ownership
1. Startup cleanup wipes the entire stack dir.
lib/electric/shape_cache/shape_status.ex:58-68— on every boot, ifvalid_shape_count == 0(which is true when the SQLite db was just deleted/corrupt), it callsStorage.cleanup_all!():lib/electric/shape_cache/pure_file_storage.ex:231—cleanup_all!/1deletes the wholebase_path(=/var/electric/<stack_id>/) andtmp_dir, including thelog/andmetadata/directories the draining instance still holds open.2. SQLite "recovery" deletes the shape-db parent dir.
lib/electric/shape_cache/shape_status/shape_db/connection.ex:294-298:Called from
open_with_recovery/4on any open failure or integrity-check failure (connection.ex:276, 287). During overlap, the two instances ping-pong wiping + recreating the same dir, which both starves and corrupts each other.3. NimblePool contract violation turns an open failure into an unhandled crash.
When recovery exhausts its attempts (
connection.ex:258-260):init_worker/1(connection.ex:157-162) propagates that{:error, _}verbatim:But NimblePool requires
{:ok, worker, pool_state} | {:async, fun, pool_state}— so{:error, _}raisesRuntimeError: unexpected return from ...init_worker/1(ELECTRIC-8XN). The pooled path'sQuery.prepare!/2similarly raisesShapeDb.Error: prepare_stmts failed: {:error, 1}(ELECTRIC-8XQ — a 492-event/sec crash-storm).Observed Sentry issues (electricsql-04, all 2026-06-17)
Unable to create database at …open_with_recovery/4exhaustedRuntimeError: unexpected return from …init_worker/1(got{:error, "failed to open …sqlite"})ShapeDb.Error: prepare_stmts failed: {:error, 1}— 492 events/sec, escalatingQuery.prepare!/2against wiped/contended dbFile.write!…/metadata/last_persisted_txn_offset.bin.tmp: :enoentPureFileStorage.WriteLoopwriting into a deleted dirFile.open!…/log/log.latest.0.jsonfile.bin: :enoentWriteLoop.ensure_json_file_open/2into a deleted dirProposed fix (layered)
Primary — gate destructive storage init on directory ownership, not just the Postgres slot. The starting instance must not run
cleanup_all!/delete_corrupt_db(or open the storage for write) on a shared<stack_id>dir until the previous owner has released it. Suggested: an on-disk lock acquired viaflock(2)on a per-stack lock file, held by the active instance; defer all destructive/write-mode storage setup until the lock is held. This extends the existing replication-slot handoff to cover the filesystem.Secondary — fail safe even if a race slips through:
init_worker/1must satisfy the NimblePool contract: on open failure, retry via{:async, …}/backoff or raise a typed, handled error — never return bare{:error, _}(fixes 8XN).delete_corrupt_db/1should notrm_rfa directory that a peer may be using; at minimum guard with the ownership lock above, and distinguish "db genuinely corrupt" from "dir pulled out from under me / open failed transiently."PureFileStoragewrite path (File.open!/File.write!,write_loop.exensure_json_file_open/2): when the dir was removed during a known draining/handoff, stop cleanly instead of crash-looping. The existingshape_gone?/1check (pure_file_storage.ex:1408) is racey.Do not simply make these ops retry/ignore
:enoent— that would mask genuine on-disk corruption. The fix is coordination + controlled shutdown, not blind tolerance.Notes
cleanup_all!reachesAsyncDeleter.delete, so the two interact.cleanup_all!/delete_corrupt_dbon live stacks (a candidate explanation for the standalone 8XQ burst at 14:49Z, which falls outside the deploy window).🤖 Generated with Claude Code