Unhandled inspector errors crash ShapeLogCollector; ReplicationClient then logs a full error every 50ms for up to 10 minutes

## Summary

In Electric Cloud production (observed across sync-service 1.6.0–1.6.10) we see a steady stream of errors of the form:

```
Error processing replication event (NNNms retry budget left): ...
Error dispatching replication event (NNNms retry budget left): ** (exit) exited in: Electric.Replication.ShapeLogCollector.handle_event_async(...)
    ** (EXIT) no process: the process is not alive ...
```

Investigation shows these are one mechanism with several underlying gaps:

1. The hot replication path (`ShapeLogCollector` → `Partitions` → `Inspector`) does synchronous Postgres introspection but only handles one error value, `{:error, :connection_not_available}`. Every other legal return of the inspector contract crashes the `ShapeLogCollector`.
2. The SLC crash takes down the whole `Electric.Shapes.Supervisor` (`:one_for_all`), and while it restarts (which includes shape restore, potentially seconds to minutes), `ReplicationClient` retries the pending event **every 50 ms**, logging a full error — including the entire inspected event payload — on every attempt. A single incident can emit up to ~12,000 error log lines (10-minute budget / 50 ms), which flood logs and any error tracker attached to the logger.

## Root causes (line refs at current main, `9b360dd9d`)

### 1. `Partitions.handle_relation` is missing clauses for `:table_not_found` and generic `{:error, reason}`

`packages/sync-service/lib/electric/shapes/partitions.ex` — the `case Inspector.load_relation_info(...)` in `handle_relation/2` (~L126) only matches `{:ok, %{parent: ...}}`, `{:ok, _}`, and `{:error, :connection_not_available}`.

But the inspector legitimately returns:

- bare `:table_not_found` — e.g. when a table is dropped/renamed (or dropped+recreated with a new OID) between the WAL record being written and Electric introspecting it, such as while replaying backlog after a reconnect. Observed: `{:case_clause, :table_not_found}` crashes. Note `add_shape/3` in the same module *does* handle `:table_not_found` (~L67), so this looks like an omission.
- in-band DB errors: the relation query stringifies errors (`{:error, Exception.message(err)}` in `direct_inspector.ex` ~L97), and the column query returns raw `{:error, %Postgrex.Error{}}`. Observed crashes include `{:case_clause, {:error, "ERROR 53200 (out_of_memory) ..."}}`, `{:case_clause, {:error, %Postgrex.Error{postgres: %{code: :query_canceled}}}}` (source DB `statement_timeout`), and `{:case_clause, {:error, "ssl recv: closed"}}`. I.e. any hiccup on the *customer's* database crashes Electric's replication pipeline.

### 2. `pk_cols_of_relation` leaks bare `:table_not_found` into `Utils.map_while_ok`

`shape_log_collector.ex` `pk_cols_of_relation/2` is a `with` over `Inspector.load_relation_oid/load_column_info`; a bare `:table_not_found` falls through and reaches `map_while_ok`'s mapper `case` (`utils.ex:84`), whose contract is `{:ok, _} | {:error, _}` → `CaseClauseError`. Same trigger scenario as above, different crash site.

### 3. Shape-restore path crashes on `:connection_not_available`

`shape_log_collector.ex` `handle_continue(:restore_shapes)` (~L282) does a bare match:

```elixir
{:ok, partitions} = Partitions.add_shape(partitions, shape_handle, shape)
```

`add_shape` returns `{:error, :connection_not_available}` when the pool is unavailable, producing a `badmatch` crash *during restart* — i.e. exactly when the DB is likely still unhappy after one of the crashes above, turning one crash into a restart loop. The other `add_shape` call site (in the shape-registration update path) handles this value gracefully.

### 4. Connection-class in-band errors bypass the `:connection_not_available` classification

`EtsInspector`'s `wrap_in_db_errors` only converts *raised* `DBConnection.ConnectionError`/checkout exits into `{:error, :connection_not_available}`. Connection errors returned in-band as tuples (e.g. `"ssl recv: closed"`) flow through unclassified and hit either `handle_relation`'s `case_clause` or `add_shape`'s catch-all `raise RuntimeError, "Unable to introspect table ..."` (`partitions.ex` ~L75).

## Amplification: an error logged every 50 ms, with the full event payload

`replication_client.ex`:

```elixir
@event_retry_delay 50
@max_event_retry_time 10 * 60_000
```

Both retry paths (`apply_event`'s `catch` for dispatch failures, and the `:DOWN` handler for handler crashes) log at `error` level **on every attempt** before re-scheduling 50 ms later. Until #4105 (1.6.0) these retries were completely silent (`catch _, _` with no logging), so the pendulum swung from zero visibility to up to 20 error lines/second for up to 10 minutes per event.

Two extra problems with the log line itself:

- The dispatch-path message embeds `Exception.format(...)` of the exit, which includes the full `inspect`ed `%TransactionFragment{}` — i.e. complete customer row data ends up in logs (and in any log-capturing error tracker) on every retry.
- Because the interpolated payload differs per event, error trackers can't group these — each incident shows up as a new issue.

Suggested shape of a fix for the logging: log the first failure at `error` (or `warning`), subsequent retries at `debug` or periodically (e.g. every 5–10 s), always log budget exhaustion at `error`; and log the event identity (xid/LSN/relation) rather than the full inspected payload.

## Suggested fixes for the crashes

- `Partitions.handle_relation`: handle `:table_not_found` like `add_shape` does, and decide an explicit policy for other `{:error, _}` (retryable vs. skip) instead of crashing.
- `pk_cols_of_relation`: normalize bare `:table_not_found` before it reaches `map_while_ok`.
- Restore path: handle `{:error, :connection_not_available}` from `add_shape` (retry/defer) instead of the bare `{:ok, _} =` match.
- `EtsInspector`/`DirectInspector`: classify connection-class in-band errors (TLS closed, etc.) as `:connection_not_available` so the existing recovery path applies.

## Context

- Retry loop introduced in #3097 (1.1.10, silent); per-attempt logging added in the async redesign in #4105 (1.6.0).
- All of the above verified present at current main (`9b360dd9d`); the only post-1.6.7 change to the involved files is telemetry (#4504).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unhandled inspector errors crash ShapeLogCollector; ReplicationClient then logs a full error every 50ms for up to 10 minutes #4563

Summary

Root causes (line refs at current main, `9b360dd9d`)

1. `Partitions.handle_relation` is missing clauses for `:table_not_found` and generic `{:error, reason}`

2. `pk_cols_of_relation` leaks bare `:table_not_found` into `Utils.map_while_ok`

3. Shape-restore path crashes on `:connection_not_available`

4. Connection-class in-band errors bypass the `:connection_not_available` classification

Amplification: an error logged every 50 ms, with the full event payload

Suggested fixes for the crashes

Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Unhandled inspector errors crash ShapeLogCollector; ReplicationClient then logs a full error every 50ms for up to 10 minutes #4563

Description

Summary

Root causes (line refs at current main, 9b360dd9d)

1. Partitions.handle_relation is missing clauses for :table_not_found and generic {:error, reason}

2. pk_cols_of_relation leaks bare :table_not_found into Utils.map_while_ok

3. Shape-restore path crashes on :connection_not_available

4. Connection-class in-band errors bypass the :connection_not_available classification

Amplification: an error logged every 50 ms, with the full event payload

Suggested fixes for the crashes

Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Root causes (line refs at current main, `9b360dd9d`)

1. `Partitions.handle_relation` is missing clauses for `:table_not_found` and generic `{:error, reason}`

2. `pk_cols_of_relation` leaks bare `:table_not_found` into `Utils.map_while_ok`

3. Shape-restore path crashes on `:connection_not_available`

4. Connection-class in-band errors bypass the `:connection_not_available` classification