Skip to content

Unhandled inspector errors crash ShapeLogCollector; ReplicationClient then logs a full error every 50ms for up to 10 minutes #4563

Description

@erik-the-implementer

Summary

In Electric Cloud production (observed across sync-service 1.6.0–1.6.10) we see a steady stream of errors of the form:

Error processing replication event (NNNms retry budget left): ...
Error dispatching replication event (NNNms retry budget left): ** (exit) exited in: Electric.Replication.ShapeLogCollector.handle_event_async(...)
    ** (EXIT) no process: the process is not alive ...

Investigation shows these are one mechanism with several underlying gaps:

  1. The hot replication path (ShapeLogCollectorPartitionsInspector) does synchronous Postgres introspection but only handles one error value, {:error, :connection_not_available}. Every other legal return of the inspector contract crashes the ShapeLogCollector.
  2. The SLC crash takes down the whole Electric.Shapes.Supervisor (:one_for_all), and while it restarts (which includes shape restore, potentially seconds to minutes), ReplicationClient retries the pending event every 50 ms, logging a full error — including the entire inspected event payload — on every attempt. A single incident can emit up to ~12,000 error log lines (10-minute budget / 50 ms), which flood logs and any error tracker attached to the logger.

Root causes (line refs at current main, 9b360dd9d)

1. Partitions.handle_relation is missing clauses for :table_not_found and generic {:error, reason}

packages/sync-service/lib/electric/shapes/partitions.ex — the case Inspector.load_relation_info(...) in handle_relation/2 (~L126) only matches {:ok, %{parent: ...}}, {:ok, _}, and {:error, :connection_not_available}.

But the inspector legitimately returns:

  • bare :table_not_found — e.g. when a table is dropped/renamed (or dropped+recreated with a new OID) between the WAL record being written and Electric introspecting it, such as while replaying backlog after a reconnect. Observed: {:case_clause, :table_not_found} crashes. Note add_shape/3 in the same module does handle :table_not_found (~L67), so this looks like an omission.
  • in-band DB errors: the relation query stringifies errors ({:error, Exception.message(err)} in direct_inspector.ex ~L97), and the column query returns raw {:error, %Postgrex.Error{}}. Observed crashes include {:case_clause, {:error, "ERROR 53200 (out_of_memory) ..."}}, {:case_clause, {:error, %Postgrex.Error{postgres: %{code: :query_canceled}}}} (source DB statement_timeout), and {:case_clause, {:error, "ssl recv: closed"}}. I.e. any hiccup on the customer's database crashes Electric's replication pipeline.

2. pk_cols_of_relation leaks bare :table_not_found into Utils.map_while_ok

shape_log_collector.ex pk_cols_of_relation/2 is a with over Inspector.load_relation_oid/load_column_info; a bare :table_not_found falls through and reaches map_while_ok's mapper case (utils.ex:84), whose contract is {:ok, _} | {:error, _}CaseClauseError. Same trigger scenario as above, different crash site.

3. Shape-restore path crashes on :connection_not_available

shape_log_collector.ex handle_continue(:restore_shapes) (~L282) does a bare match:

{:ok, partitions} = Partitions.add_shape(partitions, shape_handle, shape)

add_shape returns {:error, :connection_not_available} when the pool is unavailable, producing a badmatch crash during restart — i.e. exactly when the DB is likely still unhappy after one of the crashes above, turning one crash into a restart loop. The other add_shape call site (in the shape-registration update path) handles this value gracefully.

4. Connection-class in-band errors bypass the :connection_not_available classification

EtsInspector's wrap_in_db_errors only converts raised DBConnection.ConnectionError/checkout exits into {:error, :connection_not_available}. Connection errors returned in-band as tuples (e.g. "ssl recv: closed") flow through unclassified and hit either handle_relation's case_clause or add_shape's catch-all raise RuntimeError, "Unable to introspect table ..." (partitions.ex ~L75).

Amplification: an error logged every 50 ms, with the full event payload

replication_client.ex:

@event_retry_delay 50
@max_event_retry_time 10 * 60_000

Both retry paths (apply_event's catch for dispatch failures, and the :DOWN handler for handler crashes) log at error level on every attempt before re-scheduling 50 ms later. Until #4105 (1.6.0) these retries were completely silent (catch _, _ with no logging), so the pendulum swung from zero visibility to up to 20 error lines/second for up to 10 minutes per event.

Two extra problems with the log line itself:

  • The dispatch-path message embeds Exception.format(...) of the exit, which includes the full inspected %TransactionFragment{} — i.e. complete customer row data ends up in logs (and in any log-capturing error tracker) on every retry.
  • Because the interpolated payload differs per event, error trackers can't group these — each incident shows up as a new issue.

Suggested shape of a fix for the logging: log the first failure at error (or warning), subsequent retries at debug or periodically (e.g. every 5–10 s), always log budget exhaustion at error; and log the event identity (xid/LSN/relation) rather than the full inspected payload.

Suggested fixes for the crashes

  • Partitions.handle_relation: handle :table_not_found like add_shape does, and decide an explicit policy for other {:error, _} (retryable vs. skip) instead of crashing.
  • pk_cols_of_relation: normalize bare :table_not_found before it reaches map_while_ok.
  • Restore path: handle {:error, :connection_not_available} from add_shape (retry/defer) instead of the bare {:ok, _} = match.
  • EtsInspector/DirectInspector: classify connection-class in-band errors (TLS closed, etc.) as :connection_not_available so the existing recovery path applies.

Context

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions