Flaky test: PublicationManagerTest "handles relation tracker restart" races the supervisor restart

## Summary

The test `component restarts handles relation tracker restart` in `packages/sync-service/test/electric/replication/publication_manager_test.exs:641` is flaky. It fails intermittently with a `no process` exit when calling `PublicationManager.remove_shape/2` after restarting the `RelationTracker`.

Observed on an unrelated PR ([#4562](https://github.com/electric-sql/electric/pull/4562), a telemetry-only change), [CI run 27678953105](https://github.com/electric-sql/electric/actions/runs/27678953105/job/81861197105), seed `839006`:

```
1) test component restarts handles relation tracker restart (Electric.Replication.PublicationManagerTest)
   test/electric/replication/publication_manager_test.exs:641
   ** (exit) exited in: GenServer.call({:via, Registry, {:"Electric.ProcessRegistry:...",
        {Electric.Replication.PublicationManager.RelationTracker, nil}}}, {:remove_shape, "..."}, 5000)
      ** (EXIT) no process: the process is not alive or there's no process currently associated
        with the given name, possibly because its application isn't started
   code: PublicationManager.remove_shape(ctx.stack_id, shape_handle)
   stacktrace:
     (elixir 1.19.5) lib/gen_server.ex:1135: GenServer.call/3
     (electric 1.7.2) lib/electric/replication/publication_manager/relation_tracker.ex:77: ...RelationTracker.remove_shape/2
     test/electric/replication/publication_manager_test.exs:660: (test)
```

## Root cause

The test is a timing race:

```elixir
# Stop the RelationTracker - supervisor will restart it
relation_tracker_name = PublicationManager.RelationTracker.name(ctx.stack_id)
GenServer.stop(relation_tracker_name)                 # 653: async restart by supervisor

# After restart, the publication manager should repopulate from ShapeStatus.
# The publication should still have the relation.
assert_pub_tables(ctx, [ctx.relation], 2_000)         # 657: passes IMMEDIATELY

# Verify we can still remove the shape after the restart
PublicationManager.remove_shape(ctx.stack_id, shape_handle)  # 660: (EXIT) no process
assert_pub_tables(ctx, [], 2_000)
```

The `assert_pub_tables` on line 657 is intended to act as the "wait for restart" barrier, but it isn't one. The relation was already in the publication before the restart and the restart never removes it, so the assertion is satisfied **immediately** — well before the supervisor has restarted the `RelationTracker` and re-registered it in `Electric.ProcessRegistry`. Line 660 then does a `GenServer.call` to the via-Registry name and hits `no process` because the new tracker hasn't registered yet.

This subsystem's restart tests have a history of flakiness (see #3573 "Fix flaky restart test"); this is the same class of bug. The test was added in #3359.

## Draft fix

Explicitly wait for the restarted process to be registered (and finished restoring) before issuing the next call, rather than relying on a no-op publication assertion. `RelationTracker` already exposes `wait_for_restore/2`, but it does a `GenServer.call` to the via-name and would itself hit `no process` if called before re-registration — so we first poll until the process is alive.

```elixir
test "handles relation tracker restart", ctx do
  shape = generate_shape(ctx.relation_with_oid, @where_clause_1)

  {:ok, shape_handle} = Electric.ShapeCache.ShapeStatus.add_shape(ctx.stack_id, shape)
  assert :ok = PublicationManager.add_shape(ctx.stack_id, shape_handle, shape)
  assert_pub_tables(ctx, [ctx.relation])

  # Stop the RelationTracker - supervisor will restart it
  relation_tracker_name = PublicationManager.RelationTracker.name(ctx.stack_id)
  GenServer.stop(relation_tracker_name)

  # After restart, the publication manager should repopulate from ShapeStatus.
  # The publication should still have the relation.
  assert_pub_tables(ctx, [ctx.relation], 2_000)

  # Wait for the supervisor to actually restart the RelationTracker and finish
  # restoring before issuing further calls. The assert_pub_tables above passes
  # immediately (the relation is never removed), so it does NOT guarantee the
  # new process has registered yet — without this, remove_shape/2 races the
  # restart and hits the via-Registry name before the process exists.
  wait_until_registered(relation_tracker_name, 2_000)
  PublicationManager.RelationTracker.wait_for_restore(ctx.stack_id, timeout: 2_000)

  # Verify we can still remove the shape after the restart
  PublicationManager.remove_shape(ctx.stack_id, shape_handle)
  assert_pub_tables(ctx, [], 2_000)
end
```

Helper:

```elixir
defp wait_until_registered(name, timeout, start_time \\ :erlang.monotonic_time(:millisecond)) do
  case GenServer.whereis(name) do
    pid when is_pid(pid) ->
      pid

    nil ->
      if :erlang.monotonic_time(:millisecond) - start_time < timeout do
        Process.sleep(10)
        wait_until_registered(name, timeout, start_time)
      else
        flunk("RelationTracker was not re-registered within #{timeout}ms after restart")
      end
  end
end
```

Note: `GenServer.stop/1` is synchronous on the old pid, but `GenServer.whereis/1` can still briefly return the dying pid until the registry entry is cleared and the new one registered — a short `Process.sleep` before the first `whereis` poll, or a `Process.monitor`/`:DOWN` wait on the old pid before polling, makes the wait fully robust. Either refinement is fine; the key point is to stop treating the line-657 assertion as a restart barrier.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Flaky test: PublicationManagerTest "handles relation tracker restart" races the supervisor restart #4606

Summary

Root cause

Draft fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Flaky test: PublicationManagerTest "handles relation tracker restart" races the supervisor restart #4606

Description

Summary

Root cause

Draft fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions