Summary
In Electric Cloud production (observed across sync-service 1.6.0–1.6.10) we see a steady stream of errors of the form:
Error processing replication event (NNNms retry budget left): ...
Error dispatching replication event (NNNms retry budget left): ** (exit) exited in: Electric.Replication.ShapeLogCollector.handle_event_async(...)
** (EXIT) no process: the process is not alive ...
Investigation shows these are one mechanism with several underlying gaps:
- The hot replication path (
ShapeLogCollector → Partitions → Inspector) does synchronous Postgres introspection but only handles one error value, {:error, :connection_not_available}. Every other legal return of the inspector contract crashes the ShapeLogCollector.
- The SLC crash takes down the whole
Electric.Shapes.Supervisor (:one_for_all), and while it restarts (which includes shape restore, potentially seconds to minutes), ReplicationClient retries the pending event every 50 ms, logging a full error — including the entire inspected event payload — on every attempt. A single incident can emit up to ~12,000 error log lines (10-minute budget / 50 ms), which flood logs and any error tracker attached to the logger.
Root causes (line refs at current main, 9b360dd9d)
1. Partitions.handle_relation is missing clauses for :table_not_found and generic {:error, reason}
packages/sync-service/lib/electric/shapes/partitions.ex — the case Inspector.load_relation_info(...) in handle_relation/2 (~L126) only matches {:ok, %{parent: ...}}, {:ok, _}, and {:error, :connection_not_available}.
But the inspector legitimately returns:
- bare
:table_not_found — e.g. when a table is dropped/renamed (or dropped+recreated with a new OID) between the WAL record being written and Electric introspecting it, such as while replaying backlog after a reconnect. Observed: {:case_clause, :table_not_found} crashes. Note add_shape/3 in the same module does handle :table_not_found (~L67), so this looks like an omission.
- in-band DB errors: the relation query stringifies errors (
{:error, Exception.message(err)} in direct_inspector.ex ~L97), and the column query returns raw {:error, %Postgrex.Error{}}. Observed crashes include {:case_clause, {:error, "ERROR 53200 (out_of_memory) ..."}}, {:case_clause, {:error, %Postgrex.Error{postgres: %{code: :query_canceled}}}} (source DB statement_timeout), and {:case_clause, {:error, "ssl recv: closed"}}. I.e. any hiccup on the customer's database crashes Electric's replication pipeline.
2. pk_cols_of_relation leaks bare :table_not_found into Utils.map_while_ok
shape_log_collector.ex pk_cols_of_relation/2 is a with over Inspector.load_relation_oid/load_column_info; a bare :table_not_found falls through and reaches map_while_ok's mapper case (utils.ex:84), whose contract is {:ok, _} | {:error, _} → CaseClauseError. Same trigger scenario as above, different crash site.
3. Shape-restore path crashes on :connection_not_available
shape_log_collector.ex handle_continue(:restore_shapes) (~L282) does a bare match:
{:ok, partitions} = Partitions.add_shape(partitions, shape_handle, shape)
add_shape returns {:error, :connection_not_available} when the pool is unavailable, producing a badmatch crash during restart — i.e. exactly when the DB is likely still unhappy after one of the crashes above, turning one crash into a restart loop. The other add_shape call site (in the shape-registration update path) handles this value gracefully.
4. Connection-class in-band errors bypass the :connection_not_available classification
EtsInspector's wrap_in_db_errors only converts raised DBConnection.ConnectionError/checkout exits into {:error, :connection_not_available}. Connection errors returned in-band as tuples (e.g. "ssl recv: closed") flow through unclassified and hit either handle_relation's case_clause or add_shape's catch-all raise RuntimeError, "Unable to introspect table ..." (partitions.ex ~L75).
Amplification: an error logged every 50 ms, with the full event payload
replication_client.ex:
@event_retry_delay 50
@max_event_retry_time 10 * 60_000
Both retry paths (apply_event's catch for dispatch failures, and the :DOWN handler for handler crashes) log at error level on every attempt before re-scheduling 50 ms later. Until #4105 (1.6.0) these retries were completely silent (catch _, _ with no logging), so the pendulum swung from zero visibility to up to 20 error lines/second for up to 10 minutes per event.
Two extra problems with the log line itself:
- The dispatch-path message embeds
Exception.format(...) of the exit, which includes the full inspected %TransactionFragment{} — i.e. complete customer row data ends up in logs (and in any log-capturing error tracker) on every retry.
- Because the interpolated payload differs per event, error trackers can't group these — each incident shows up as a new issue.
Suggested shape of a fix for the logging: log the first failure at error (or warning), subsequent retries at debug or periodically (e.g. every 5–10 s), always log budget exhaustion at error; and log the event identity (xid/LSN/relation) rather than the full inspected payload.
Suggested fixes for the crashes
Partitions.handle_relation: handle :table_not_found like add_shape does, and decide an explicit policy for other {:error, _} (retryable vs. skip) instead of crashing.
pk_cols_of_relation: normalize bare :table_not_found before it reaches map_while_ok.
- Restore path: handle
{:error, :connection_not_available} from add_shape (retry/defer) instead of the bare {:ok, _} = match.
EtsInspector/DirectInspector: classify connection-class in-band errors (TLS closed, etc.) as :connection_not_available so the existing recovery path applies.
Context
Summary
In Electric Cloud production (observed across sync-service 1.6.0–1.6.10) we see a steady stream of errors of the form:
Investigation shows these are one mechanism with several underlying gaps:
ShapeLogCollector→Partitions→Inspector) does synchronous Postgres introspection but only handles one error value,{:error, :connection_not_available}. Every other legal return of the inspector contract crashes theShapeLogCollector.Electric.Shapes.Supervisor(:one_for_all), and while it restarts (which includes shape restore, potentially seconds to minutes),ReplicationClientretries the pending event every 50 ms, logging a full error — including the entire inspected event payload — on every attempt. A single incident can emit up to ~12,000 error log lines (10-minute budget / 50 ms), which flood logs and any error tracker attached to the logger.Root causes (line refs at current main,
9b360dd9d)1.
Partitions.handle_relationis missing clauses for:table_not_foundand generic{:error, reason}packages/sync-service/lib/electric/shapes/partitions.ex— thecase Inspector.load_relation_info(...)inhandle_relation/2(~L126) only matches{:ok, %{parent: ...}},{:ok, _}, and{:error, :connection_not_available}.But the inspector legitimately returns:
:table_not_found— e.g. when a table is dropped/renamed (or dropped+recreated with a new OID) between the WAL record being written and Electric introspecting it, such as while replaying backlog after a reconnect. Observed:{:case_clause, :table_not_found}crashes. Noteadd_shape/3in the same module does handle:table_not_found(~L67), so this looks like an omission.{:error, Exception.message(err)}indirect_inspector.ex~L97), and the column query returns raw{:error, %Postgrex.Error{}}. Observed crashes include{:case_clause, {:error, "ERROR 53200 (out_of_memory) ..."}},{:case_clause, {:error, %Postgrex.Error{postgres: %{code: :query_canceled}}}}(source DBstatement_timeout), and{:case_clause, {:error, "ssl recv: closed"}}. I.e. any hiccup on the customer's database crashes Electric's replication pipeline.2.
pk_cols_of_relationleaks bare:table_not_foundintoUtils.map_while_okshape_log_collector.expk_cols_of_relation/2is awithoverInspector.load_relation_oid/load_column_info; a bare:table_not_foundfalls through and reachesmap_while_ok's mappercase(utils.ex:84), whose contract is{:ok, _} | {:error, _}→CaseClauseError. Same trigger scenario as above, different crash site.3. Shape-restore path crashes on
:connection_not_availableshape_log_collector.exhandle_continue(:restore_shapes)(~L282) does a bare match:add_shapereturns{:error, :connection_not_available}when the pool is unavailable, producing abadmatchcrash during restart — i.e. exactly when the DB is likely still unhappy after one of the crashes above, turning one crash into a restart loop. The otheradd_shapecall site (in the shape-registration update path) handles this value gracefully.4. Connection-class in-band errors bypass the
:connection_not_availableclassificationEtsInspector'swrap_in_db_errorsonly converts raisedDBConnection.ConnectionError/checkout exits into{:error, :connection_not_available}. Connection errors returned in-band as tuples (e.g."ssl recv: closed") flow through unclassified and hit eitherhandle_relation'scase_clauseoradd_shape's catch-allraise RuntimeError, "Unable to introspect table ..."(partitions.ex~L75).Amplification: an error logged every 50 ms, with the full event payload
replication_client.ex:Both retry paths (
apply_event'scatchfor dispatch failures, and the:DOWNhandler for handler crashes) log aterrorlevel on every attempt before re-scheduling 50 ms later. Until #4105 (1.6.0) these retries were completely silent (catch _, _with no logging), so the pendulum swung from zero visibility to up to 20 error lines/second for up to 10 minutes per event.Two extra problems with the log line itself:
Exception.format(...)of the exit, which includes the fullinspected%TransactionFragment{}— i.e. complete customer row data ends up in logs (and in any log-capturing error tracker) on every retry.Suggested shape of a fix for the logging: log the first failure at
error(orwarning), subsequent retries atdebugor periodically (e.g. every 5–10 s), always log budget exhaustion aterror; and log the event identity (xid/LSN/relation) rather than the full inspected payload.Suggested fixes for the crashes
Partitions.handle_relation: handle:table_not_foundlikeadd_shapedoes, and decide an explicit policy for other{:error, _}(retryable vs. skip) instead of crashing.pk_cols_of_relation: normalize bare:table_not_foundbefore it reachesmap_while_ok.{:error, :connection_not_available}fromadd_shape(retry/defer) instead of the bare{:ok, _} =match.EtsInspector/DirectInspector: classify connection-class in-band errors (TLS closed, etc.) as:connection_not_availableso the existing recovery path applies.Context
ReplicationClientonShapeLogCollectormissing #3097 (1.1.10, silent); per-attempt logging added in the async redesign in Fix replication connection drops from wal_sender_timeout during backpressure #4105 (1.6.0).9b360dd9d); the only post-1.6.7 change to the involved files is telemetry (feat(sync-service): record per-transaction fragment wall-time #4504).