Summary
The test component restarts handles relation tracker restart in packages/sync-service/test/electric/replication/publication_manager_test.exs:641 is flaky. It fails intermittently with a no process exit when calling PublicationManager.remove_shape/2 after restarting the RelationTracker.
Observed on an unrelated PR (#4562, a telemetry-only change), CI run 27678953105, seed 839006:
1) test component restarts handles relation tracker restart (Electric.Replication.PublicationManagerTest)
test/electric/replication/publication_manager_test.exs:641
** (exit) exited in: GenServer.call({:via, Registry, {:"Electric.ProcessRegistry:...",
{Electric.Replication.PublicationManager.RelationTracker, nil}}}, {:remove_shape, "..."}, 5000)
** (EXIT) no process: the process is not alive or there's no process currently associated
with the given name, possibly because its application isn't started
code: PublicationManager.remove_shape(ctx.stack_id, shape_handle)
stacktrace:
(elixir 1.19.5) lib/gen_server.ex:1135: GenServer.call/3
(electric 1.7.2) lib/electric/replication/publication_manager/relation_tracker.ex:77: ...RelationTracker.remove_shape/2
test/electric/replication/publication_manager_test.exs:660: (test)
Root cause
The test is a timing race:
# Stop the RelationTracker - supervisor will restart it
relation_tracker_name = PublicationManager.RelationTracker.name(ctx.stack_id)
GenServer.stop(relation_tracker_name) # 653: async restart by supervisor
# After restart, the publication manager should repopulate from ShapeStatus.
# The publication should still have the relation.
assert_pub_tables(ctx, [ctx.relation], 2_000) # 657: passes IMMEDIATELY
# Verify we can still remove the shape after the restart
PublicationManager.remove_shape(ctx.stack_id, shape_handle) # 660: (EXIT) no process
assert_pub_tables(ctx, [], 2_000)
The assert_pub_tables on line 657 is intended to act as the "wait for restart" barrier, but it isn't one. The relation was already in the publication before the restart and the restart never removes it, so the assertion is satisfied immediately — well before the supervisor has restarted the RelationTracker and re-registered it in Electric.ProcessRegistry. Line 660 then does a GenServer.call to the via-Registry name and hits no process because the new tracker hasn't registered yet.
This subsystem's restart tests have a history of flakiness (see #3573 "Fix flaky restart test"); this is the same class of bug. The test was added in #3359.
Draft fix
Explicitly wait for the restarted process to be registered (and finished restoring) before issuing the next call, rather than relying on a no-op publication assertion. RelationTracker already exposes wait_for_restore/2, but it does a GenServer.call to the via-name and would itself hit no process if called before re-registration — so we first poll until the process is alive.
test "handles relation tracker restart", ctx do
shape = generate_shape(ctx.relation_with_oid, @where_clause_1)
{:ok, shape_handle} = Electric.ShapeCache.ShapeStatus.add_shape(ctx.stack_id, shape)
assert :ok = PublicationManager.add_shape(ctx.stack_id, shape_handle, shape)
assert_pub_tables(ctx, [ctx.relation])
# Stop the RelationTracker - supervisor will restart it
relation_tracker_name = PublicationManager.RelationTracker.name(ctx.stack_id)
GenServer.stop(relation_tracker_name)
# After restart, the publication manager should repopulate from ShapeStatus.
# The publication should still have the relation.
assert_pub_tables(ctx, [ctx.relation], 2_000)
# Wait for the supervisor to actually restart the RelationTracker and finish
# restoring before issuing further calls. The assert_pub_tables above passes
# immediately (the relation is never removed), so it does NOT guarantee the
# new process has registered yet — without this, remove_shape/2 races the
# restart and hits the via-Registry name before the process exists.
wait_until_registered(relation_tracker_name, 2_000)
PublicationManager.RelationTracker.wait_for_restore(ctx.stack_id, timeout: 2_000)
# Verify we can still remove the shape after the restart
PublicationManager.remove_shape(ctx.stack_id, shape_handle)
assert_pub_tables(ctx, [], 2_000)
end
Helper:
defp wait_until_registered(name, timeout, start_time \\ :erlang.monotonic_time(:millisecond)) do
case GenServer.whereis(name) do
pid when is_pid(pid) ->
pid
nil ->
if :erlang.monotonic_time(:millisecond) - start_time < timeout do
Process.sleep(10)
wait_until_registered(name, timeout, start_time)
else
flunk("RelationTracker was not re-registered within #{timeout}ms after restart")
end
end
end
Note: GenServer.stop/1 is synchronous on the old pid, but GenServer.whereis/1 can still briefly return the dying pid until the registry entry is cleared and the new one registered — a short Process.sleep before the first whereis poll, or a Process.monitor/:DOWN wait on the old pid before polling, makes the wait fully robust. Either refinement is fine; the key point is to stop treating the line-657 assertion as a restart barrier.
Summary
The test
component restarts handles relation tracker restartinpackages/sync-service/test/electric/replication/publication_manager_test.exs:641is flaky. It fails intermittently with ano processexit when callingPublicationManager.remove_shape/2after restarting theRelationTracker.Observed on an unrelated PR (#4562, a telemetry-only change), CI run 27678953105, seed
839006:Root cause
The test is a timing race:
The
assert_pub_tableson line 657 is intended to act as the "wait for restart" barrier, but it isn't one. The relation was already in the publication before the restart and the restart never removes it, so the assertion is satisfied immediately — well before the supervisor has restarted theRelationTrackerand re-registered it inElectric.ProcessRegistry. Line 660 then does aGenServer.callto the via-Registry name and hitsno processbecause the new tracker hasn't registered yet.This subsystem's restart tests have a history of flakiness (see #3573 "Fix flaky restart test"); this is the same class of bug. The test was added in #3359.
Draft fix
Explicitly wait for the restarted process to be registered (and finished restoring) before issuing the next call, rather than relying on a no-op publication assertion.
RelationTrackeralready exposeswait_for_restore/2, but it does aGenServer.callto the via-name and would itself hitno processif called before re-registration — so we first poll until the process is alive.Helper:
Note:
GenServer.stop/1is synchronous on the old pid, butGenServer.whereis/1can still briefly return the dying pid until the registry entry is cleared and the new one registered — a shortProcess.sleepbefore the firstwhereispoll, or aProcess.monitor/:DOWNwait on the old pid before polling, makes the wait fully robust. Either refinement is fine; the key point is to stop treating the line-657 assertion as a restart barrier.