Skip to content

Non-Deterministic K-Means Clustering Due to Worker Restart #2358

@jucor

Description

@jucor

Non-Deterministic K-Means Clustering Due to Worker Restart

Summary

The Clojure math worker produces different clustering results for the same conversation depending on whether it was restarted between computations. This creates inconsistent user experience and makes testing/validation extremely difficult.

Root Cause

The clustering algorithm uses "warm-start" initialization from previous computations, but the state required for this warm-start is partially ephemeral (in-memory only) and not fully persisted to the database.

What Gets Persisted vs Lost

Persisted to math_main table:

  • base-clusters (100 base clusters with participant members)
  • group-clusters (final chosen clustering, e.g., k=2)
  • ✅ Other math results (PCA, repness, etc.)

NOT Persisted (in-memory only):

  • group-clusterings - Map of ALL k-value clusterings: {2: [...], 3: [...], 4: [...], 5: [...]}

Source:

(-> conv
(utils/hash-map-subset #{:math_tick :raw-rating-mat :rating-mat :lastVoteTimestamp :mod-out :mod-in :zid :pca :in-conv :n :n-cmts :group-clusters :base-clusters :repness :group-votes :subgroup-clusters :subgroup-votes :subgroup-repness :group-aware-consensus :comment-priorities :meta-tids})
(assoc :last-vote-timestamp (get conv :lastVoteTimestamp)
:last-mod-timestamp (get conv :lastModTimestamp))
; Make sure there is an empty named matrix to operate on
(assoc :raw-rating-mat (nm/named-matrix))
; Update the base clusters to be unfolded
(update :base-clusters clust/unfold-clusters)
; Make sure in-conv is a set
(update :in-conv set)
(update :mod-out set)
(update :mod-in set)
(update :meta-tids set)))

The Problem

Scenario A: Worker NOT Restarted (Consistent Behavior)

; After first computation, in-memory state:
conv = {
  :base-clusters [...]           ; persisted to DB
  :group-clusterings {            ; ← in memory only!
    2 [...], 3 [...], 4 [...], 5 [...]
  }
  :group-clusters [...]          ; persisted to DB (e.g., k=2 winner)
}

; Second computation (new votes arrive):
; Line 438 (see below): :last-clusters (last-clusterings k)
; - For k=2: Uses previous k=2 clustering ✓
; - For k=3: Uses previous k=3 clustering ✓
; - For k=4: Uses previous k=4 clustering ✓
; - For k=5: Uses previous k=5 clustering ✓

Result: All k-values warm-started → stable, consistent clusters

Scenario B: Worker WAS Restarted (Inconsistent Behavior)

; After restart, loaded from DB only:
conv = {
  :base-clusters [...]           ; ✓ loaded from DB
  :group-clusterings nil          ; ✗ NOT in DB!
  :group-clusters [...]          ; loaded but not used for multi-k
}

; Second computation (same votes as Scenario A):
; Line 438 (see below): :last-clusters (last-clusterings k)
; - Returns nil for ALL k values!
; - Falls back to first-k initialization

Result: All k-values cold-started → DIFFERENT CLUSTERS than Scenario A

Impact

1. User Experience

  • Group memberships change arbitrarily after worker restarts
  • Participant assignments flip between groups
  • Breaks continuity of conversation visualization
  • Confusing for moderators tracking opinion groups over time

2. Testing & Validation

  • Cannot reliably test Python implementation against Clojure
  • real_data/ math_blobs have unknown provenance (warm or cold started?)
  • No "ground truth" to compare against
  • Regression tests fail unpredictably

3. Production Issues

  • Different results between dev (frequent restarts) and prod (long-running)
  • Deployments cause clustering changes even with no new votes
  • Scaling events (pod restarts) affect clustering

Evidence

Code Locations

Where :last-clusters is used:

  1. Base clusters:

    :last-clusters (:base-clusters conv)

    • ✅ Works: base-clusters IS in database
  2. Group clusters:

    (when-let [last-clusterings (:group-clusterings conv)]
    (last-clusterings k))

    • ❌ Fails after restart: group-clusterings NOT in database

Database schema:

polis/server/schema.sql

Lines 640 to 645 in 93e3274

CREATE TABLE public.math_main (
zid integer NOT NULL,
math_env character varying(999) NOT NULL,
data jsonb NOT NULL,
last_vote_timestamp bigint NOT NULL,
caching_tick bigint DEFAULT 0 NOT NULL,

  • Only ONE row per conversation (UNIQUE (zid, math_env))
  • No versioning - overwrites previous data

What gets written:

(-> conv
(utils/hash-map-subset #{:math_tick :raw-rating-mat :rating-mat :lastVoteTimestamp :mod-out :mod-in :zid :pca :in-conv :n :n-cmts :group-clusters :base-clusters :repness :group-votes :subgroup-clusters :subgroup-votes :subgroup-repness :group-aware-consensus :comment-priorities :meta-tids})
(assoc :last-vote-timestamp (get conv :lastVoteTimestamp)
:last-mod-timestamp (get conv :lastModTimestamp))
; Make sure there is an empty named matrix to operate on
(assoc :raw-rating-mat (nm/named-matrix))
; Update the base clusters to be unfolded
(update :base-clusters clust/unfold-clusters)
; Make sure in-conv is a set
(update :in-conv set)
(update :mod-out set)
(update :mod-in set)
(update :meta-tids set)))

  • Note: group-clusterings is NOT included in the persisted subset!

Empirical Test Results

Testing Python (cold-start always) vs Clojure reference data:

  • Biodiversity: 21% Jaccard similarity (expected >95%)
  • VW: 19% Jaccard similarity (expected >95%)

The Clojure data was likely generated with warm-started clusters from unknown previous state.

Proposed Solutions

Note: These solutions are for the Clojure codebase only. Since we are actively migrating to Python (which doesn't have this bug), fixing this in Clojure is low priority unless the Clojure math worker will continue running in production for an extended period.

Option 1: Persist group-clusterings (Recommended)

Change: Include group-clusterings in the persisted state at

(utils/hash-map-subset #{:math_tick :raw-rating-mat :rating-mat :lastVoteTimestamp :mod-out :mod-in :zid :pca :in-conv :n :n-cmts :group-clusters :base-clusters :repness :group-votes :subgroup-clusters :subgroup-votes :subgroup-repness :group-aware-consensus :comment-priorities :meta-tids})

Add :group-clusterings to the hash-map-subset call.

Pros:

  • Minimal code change
  • Preserves warm-start behavior consistently
  • Fixes non-determinism immediately

Cons:

  • Increases math_main.data size (~4x more cluster data)
  • One-time migration needed for existing conversations

Option 2: Remove Warm-Start for Group Clustering

Change: Always use first-k initialization for group clustering at

(when-let [last-clusterings (:group-clusterings conv)]
(last-clusterings k))

Remove the :last-clusters parameter from the clusters/kmeans call.

Pros:

  • Fully deterministic behavior
  • No database changes needed
  • Matches Python implementation

Cons:

  • Loses stability/continuity benefit of warm-start
  • Groups may change more between updates
  • May affect user experience

Option 3: Hybrid Approach

Keep warm-start for base clusters (persisted), remove for group clusters (ephemeral):

Pros:

  • Best of both worlds
  • Base clusters stay stable (most important)
  • Group selection deterministic

Cons:

  • Asymmetric behavior (may be confusing)
  • Still changes current behavior

Recommendation

For Clojure (if still in production): Implement Option 1 - persist group-clusterings to database.

Rationale:

  1. Preserves existing warm-start benefits
  2. Makes behavior consistent across restarts
  3. Small storage cost is worth the reliability
  4. Aligns with original design intent (warm-start for continuity)

For Python migration: This bug does not exist in the Python implementation, which uses deterministic cold-start initialization. Priority should be completing the Python migration rather than fixing legacy Clojure code.

Additional improvement (if fixing Clojure): Add versioning to math_main table for historical comparison and debugging.

References

  • Clojure clustering logic:
    :last-clusters
    ; A little pedantic here in case no clustering yet for this k
    (when-let [last-clusterings (:group-clusterings conv)]
    (last-clusterings k))
  • Conv manager persistence:
    (defn write-conv-updates!
    [{:as conv-man :keys [postgres]} {:as updated-conv :keys [zid]} math-tick]
    ;; TODO Really need to extract these writes so that mod updates do whta they're supposed to! And also run in async/thread for better parallelism
    ; Format and upload main results
    (async/thread
    (doseq [[prep-fn upload-fn] [[prep-main db/upload-math-main] ; main math results, for client
    [prep-bidToPid db/upload-math-bidtopid] ; bidtopid mapping, for server
    [prep-ptpt-stats db/upload-math-ptptstats]]]
    (->> updated-conv
    prep-fn
    (upload-fn postgres zid math-tick)))
    (log/info "Finished uploading math results for zid:" zid)))
    (defn restructure-json-conv
    [conv]
    (-> conv
    (utils/hash-map-subset #{:math_tick :raw-rating-mat :rating-mat :lastVoteTimestamp :mod-out :mod-in :zid :pca :in-conv :n :n-cmts :group-clusters :base-clusters :repness :group-votes :subgroup-clusters :subgroup-votes :subgroup-repness :group-aware-consensus :comment-priorities :meta-tids})
    (assoc :last-vote-timestamp (get conv :lastVoteTimestamp)
    :last-mod-timestamp (get conv :lastModTimestamp))
    ; Make sure there is an empty named matrix to operate on
    (assoc :raw-rating-mat (nm/named-matrix))
    ; Update the base clusters to be unfolded
    (update :base-clusters clust/unfold-clusters)
    ; Make sure in-conv is a set
    (update :in-conv set)
    (update :mod-out set)
    (update :mod-in set)
    (update :meta-tids set)))
  • Database schema:

    polis/server/schema.sql

    Lines 640 to 645 in 93e3274

    CREATE TABLE public.math_main (
    zid integer NOT NULL,
    math_env character varying(999) NOT NULL,
    data jsonb NOT NULL,
    last_vote_timestamp bigint NOT NULL,
    caching_tick bigint DEFAULT 0 NOT NULL,

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions