Skip to content

fix: terminate label_propagation on bipartite / weight-symmetric graphs#1456

Open
kaihirota wants to merge 1 commit into
getzep:mainfrom
kaihirota:fix/label-propagation-non-convergence
Open

fix: terminate label_propagation on bipartite / weight-symmetric graphs#1456
kaihirota wants to merge 1 commit into
getzep:mainfrom
kaihirota:fix/label-propagation-non-convergence

Conversation

@kaihirota
Copy link
Copy Markdown

@kaihirota kaihirota commented Apr 30, 2026

Summary

The synchronous label propagation loop in
graphiti_core.utils.maintenance.community_operations.label_propagation (and its
duplicate in graphiti_core.driver.operations.graph_utils) reads labels from a
snapshot and writes to a parallel map, then swaps. On bipartite or
weight-symmetric graphs this is mathematically prone to oscillation: two label
assignments can flip-flop between iterations because no node ever sees a
neighbour's updated label within the same pass. The loop has no iteration cap,
so client.build_communities() never returns on certain inputs.

This PR closes four separate reports of the same root cause:

Reproduction: a 5-node weighted undirected graph

$$E = {(X, Y, 2),\ (X, Z, 2),\ (Y, A, 1),\ (Z, B, 1)}$$

— a shape that arises naturally from add_episode over rich, dated prose with
a hub entity and competing role claims — drives the synchronous LPA into an
indefinite oscillation. py-spy stack of a hung run shows the main thread
spending 100% CPU in label_propagation for over an hour with no forward
progress.

Fix:

  • Switch to asynchronous LPA: read and write to community_map in place so
    later nodes within the same iteration see earlier nodes' fresh labels. Async
    LPA still has no convergence proof on adversarial inputs, but it converges in
    practice on graphs that arise from realistic add_episode corpora.
  • Add a max_iterations=100 keyword arg as the hard termination guarantee.
    When the cap is reached, log a warning so callers can see that the returned
    clustering may not be a fixed point.
  • Replace the asymmetric tiebreak with a two-rule deterministic procedure: if
    the current community is among the candidates of highest total weight, keep
    it (self-stickiness — preserves the previous tiebreak's intent of avoiding
    gratuitous label churn on stable graphs); otherwise pick the smallest
    community ID among highest-weight candidates. Output is reproducible across
    runs.

The new keyword arg has a default, so all existing call sites (the four
database drivers under
graphiti_core/driver/{neo4j,falkordb,kuzu,neptune}/operations/graph_ops.py
and the legacy fallback in community_operations.get_community_clusters)
continue to work unchanged.

Tests at tests/utils/maintenance/test_community_operations.py cover the
oscillating regression case (asserts termination + node coverage), output
invariants (every input UUID appears exactly once), singleton nodes, an
isolated node alongside a connected component, complete graph collapse,
two-disjoint-component preservation, and run-to-run determinism. Each test is
parametrized over both copies of label_propagation so any future divergence
between them is caught.

Relationship to #1388

#1388 is an alternative fix for the same bug, also using async LPA but with an
oscillation/cycle detector instead of a hard iteration cap, and a larger diff
(+734/-69 vs +334/-45 here). This PR offers a smaller, more targeted
alternative: async LPA + max_iterations=100 cap + deterministic tiebreak,
with parametrized tests over both copies of label_propagation. Maintainers
should pick whichever shape they prefer — happy to close this in favor of
#1388, or vice versa.

Type of Change

  • Bug fix
  • New feature
  • Performance improvement
  • Documentation/Tests

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • All existing tests pass

Breaking Changes

  • This PR contains breaking changes

Checklist

  • Code follows project style guidelines (make lint passes)
  • Self-review completed
  • Documentation updated where necessary
  • No secrets or sensitive information committed

Related Issues

Closes #402
Closes #1355
Closes #1397
Closes #1400

The synchronous label propagation loop in
`graphiti_core.utils.maintenance.community_operations.label_propagation`
(and its duplicate in `graphiti_core.driver.operations.graph_utils`)
reads labels from a snapshot and writes to a parallel map, then swaps.
On bipartite or weight-symmetric graphs this is mathematically prone to
oscillation: two label assignments can flip-flop between iterations
because no node ever sees a neighbour's updated label within the same
pass. The loop has no iteration cap, so `client.build_communities()`
never returns on certain inputs.

Reproduction: a 5-node graph X-Y(2), X-Z(2), Y-A(1), Z-B(1) — a shape
that arises naturally from `add_episode` over rich, dated prose with a
hub entity and competing role claims — drives the synchronous LPA into
an indefinite oscillation. py-spy stack of a hung run shows the main
thread spending 100% CPU in `label_propagation` for over an hour with
no forward progress.

Fix:

- Switch to **asynchronous** LPA: read and write to `community_map` in
  place so later nodes within the same iteration see earlier nodes'
  fresh labels. Async LPA still has no convergence proof on adversarial
  inputs, but it converges in practice on graphs that arise from
  realistic `add_episode` corpora.
- Add a `max_iterations=100` keyword arg as the hard termination
  guarantee. When the cap is reached, log a warning so callers can see
  that the returned clustering may not be a fixed point.
- Replace the asymmetric tiebreak with a two-rule deterministic
  procedure: if the current community is among the candidates of
  highest total weight, keep it (self-stickiness — preserves the
  previous tiebreak's intent of avoiding gratuitous label churn on
  stable graphs); otherwise pick the smallest community ID among
  highest-weight candidates. Output is reproducible across runs.

The new keyword arg has a default, so all existing call sites
(the four database drivers under
`graphiti_core/driver/{neo4j,falkordb,kuzu,neptune}/operations/graph_ops.py`
and the legacy fallback in `community_operations.get_community_clusters`)
continue to work unchanged.

Tests at `tests/utils/maintenance/test_community_operations.py` cover
the oscillating regression case (asserts termination + node coverage),
output invariants (every input UUID appears exactly once), singleton
nodes, an isolated node alongside a connected component, complete
graph collapse, two-disjoint-component preservation, and run-to-run
determinism. Each test is parametrized over both copies of
`label_propagation` so any future divergence between them is caught.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment