Non-Deterministic K-Means Clustering Due to Worker Restart
Summary
The Clojure math worker produces different clustering results for the same conversation depending on whether it was restarted between computations. This creates inconsistent user experience and makes testing/validation extremely difficult.
Root Cause
The clustering algorithm uses "warm-start" initialization from previous computations, but the state required for this warm-start is partially ephemeral (in-memory only) and not fully persisted to the database.
What Gets Persisted vs Lost
Persisted to math_main table:
- ✅
base-clusters (100 base clusters with participant members)
- ✅
group-clusters (final chosen clustering, e.g., k=2)
- ✅ Other math results (PCA, repness, etc.)
NOT Persisted (in-memory only):
- ❌
group-clusterings - Map of ALL k-value clusterings: {2: [...], 3: [...], 4: [...], 5: [...]}
Source:
|
(-> conv |
|
(utils/hash-map-subset #{:math_tick :raw-rating-mat :rating-mat :lastVoteTimestamp :mod-out :mod-in :zid :pca :in-conv :n :n-cmts :group-clusters :base-clusters :repness :group-votes :subgroup-clusters :subgroup-votes :subgroup-repness :group-aware-consensus :comment-priorities :meta-tids}) |
|
(assoc :last-vote-timestamp (get conv :lastVoteTimestamp) |
|
:last-mod-timestamp (get conv :lastModTimestamp)) |
|
; Make sure there is an empty named matrix to operate on |
|
(assoc :raw-rating-mat (nm/named-matrix)) |
|
; Update the base clusters to be unfolded |
|
(update :base-clusters clust/unfold-clusters) |
|
; Make sure in-conv is a set |
|
(update :in-conv set) |
|
(update :mod-out set) |
|
(update :mod-in set) |
|
(update :meta-tids set))) |
The Problem
Scenario A: Worker NOT Restarted (Consistent Behavior)
; After first computation, in-memory state:
conv = {
:base-clusters [...] ; persisted to DB
:group-clusterings { ; ← in memory only!
2 [...], 3 [...], 4 [...], 5 [...]
}
:group-clusters [...] ; persisted to DB (e.g., k=2 winner)
}
; Second computation (new votes arrive):
; Line 438 (see below): :last-clusters (last-clusterings k)
; - For k=2: Uses previous k=2 clustering ✓
; - For k=3: Uses previous k=3 clustering ✓
; - For k=4: Uses previous k=4 clustering ✓
; - For k=5: Uses previous k=5 clustering ✓
Result: All k-values warm-started → stable, consistent clusters
Scenario B: Worker WAS Restarted (Inconsistent Behavior)
; After restart, loaded from DB only:
conv = {
:base-clusters [...] ; ✓ loaded from DB
:group-clusterings nil ; ✗ NOT in DB!
:group-clusters [...] ; loaded but not used for multi-k
}
; Second computation (same votes as Scenario A):
; Line 438 (see below): :last-clusters (last-clusterings k)
; - Returns nil for ALL k values!
; - Falls back to first-k initialization
Result: All k-values cold-started → DIFFERENT CLUSTERS than Scenario A
Impact
1. User Experience
- Group memberships change arbitrarily after worker restarts
- Participant assignments flip between groups
- Breaks continuity of conversation visualization
- Confusing for moderators tracking opinion groups over time
2. Testing & Validation
- Cannot reliably test Python implementation against Clojure
real_data/ math_blobs have unknown provenance (warm or cold started?)
- No "ground truth" to compare against
- Regression tests fail unpredictably
3. Production Issues
- Different results between dev (frequent restarts) and prod (long-running)
- Deployments cause clustering changes even with no new votes
- Scaling events (pod restarts) affect clustering
Evidence
Code Locations
Where :last-clusters is used:
-
Base clusters:
|
:last-clusters (:base-clusters conv) |
- ✅ Works:
base-clusters IS in database
-
Group clusters:
|
(when-let [last-clusterings (:group-clusterings conv)] |
|
(last-clusterings k)) |
- ❌ Fails after restart:
group-clusterings NOT in database
Database schema:
|
CREATE TABLE public.math_main ( |
|
zid integer NOT NULL, |
|
math_env character varying(999) NOT NULL, |
|
data jsonb NOT NULL, |
|
last_vote_timestamp bigint NOT NULL, |
|
caching_tick bigint DEFAULT 0 NOT NULL, |
- Only ONE row per conversation (
UNIQUE (zid, math_env))
- No versioning - overwrites previous data
What gets written:
|
(-> conv |
|
(utils/hash-map-subset #{:math_tick :raw-rating-mat :rating-mat :lastVoteTimestamp :mod-out :mod-in :zid :pca :in-conv :n :n-cmts :group-clusters :base-clusters :repness :group-votes :subgroup-clusters :subgroup-votes :subgroup-repness :group-aware-consensus :comment-priorities :meta-tids}) |
|
(assoc :last-vote-timestamp (get conv :lastVoteTimestamp) |
|
:last-mod-timestamp (get conv :lastModTimestamp)) |
|
; Make sure there is an empty named matrix to operate on |
|
(assoc :raw-rating-mat (nm/named-matrix)) |
|
; Update the base clusters to be unfolded |
|
(update :base-clusters clust/unfold-clusters) |
|
; Make sure in-conv is a set |
|
(update :in-conv set) |
|
(update :mod-out set) |
|
(update :mod-in set) |
|
(update :meta-tids set))) |
- Note:
group-clusterings is NOT included in the persisted subset!
Empirical Test Results
Testing Python (cold-start always) vs Clojure reference data:
- Biodiversity: 21% Jaccard similarity (expected >95%)
- VW: 19% Jaccard similarity (expected >95%)
The Clojure data was likely generated with warm-started clusters from unknown previous state.
Proposed Solutions
Note: These solutions are for the Clojure codebase only. Since we are actively migrating to Python (which doesn't have this bug), fixing this in Clojure is low priority unless the Clojure math worker will continue running in production for an extended period.
Option 1: Persist group-clusterings (Recommended)
Change: Include group-clusterings in the persisted state at
|
(utils/hash-map-subset #{:math_tick :raw-rating-mat :rating-mat :lastVoteTimestamp :mod-out :mod-in :zid :pca :in-conv :n :n-cmts :group-clusters :base-clusters :repness :group-votes :subgroup-clusters :subgroup-votes :subgroup-repness :group-aware-consensus :comment-priorities :meta-tids}) |
Add :group-clusterings to the hash-map-subset call.
Pros:
- Minimal code change
- Preserves warm-start behavior consistently
- Fixes non-determinism immediately
Cons:
- Increases
math_main.data size (~4x more cluster data)
- One-time migration needed for existing conversations
Option 2: Remove Warm-Start for Group Clustering
Change: Always use first-k initialization for group clustering at
|
(when-let [last-clusterings (:group-clusterings conv)] |
|
(last-clusterings k)) |
Remove the :last-clusters parameter from the clusters/kmeans call.
Pros:
- Fully deterministic behavior
- No database changes needed
- Matches Python implementation
Cons:
- Loses stability/continuity benefit of warm-start
- Groups may change more between updates
- May affect user experience
Option 3: Hybrid Approach
Keep warm-start for base clusters (persisted), remove for group clusters (ephemeral):
Pros:
- Best of both worlds
- Base clusters stay stable (most important)
- Group selection deterministic
Cons:
- Asymmetric behavior (may be confusing)
- Still changes current behavior
Recommendation
For Clojure (if still in production): Implement Option 1 - persist group-clusterings to database.
Rationale:
- Preserves existing warm-start benefits
- Makes behavior consistent across restarts
- Small storage cost is worth the reliability
- Aligns with original design intent (warm-start for continuity)
For Python migration: This bug does not exist in the Python implementation, which uses deterministic cold-start initialization. Priority should be completing the Python migration rather than fixing legacy Clojure code.
Additional improvement (if fixing Clojure): Add versioning to math_main table for historical comparison and debugging.
References
- Clojure clustering logic:
|
:last-clusters |
|
; A little pedantic here in case no clustering yet for this k |
|
(when-let [last-clusterings (:group-clusterings conv)] |
|
(last-clusterings k)) |
- Conv manager persistence:
|
(defn write-conv-updates! |
|
[{:as conv-man :keys [postgres]} {:as updated-conv :keys [zid]} math-tick] |
|
;; TODO Really need to extract these writes so that mod updates do whta they're supposed to! And also run in async/thread for better parallelism |
|
; Format and upload main results |
|
(async/thread |
|
(doseq [[prep-fn upload-fn] [[prep-main db/upload-math-main] ; main math results, for client |
|
[prep-bidToPid db/upload-math-bidtopid] ; bidtopid mapping, for server |
|
[prep-ptpt-stats db/upload-math-ptptstats]]] |
|
(->> updated-conv |
|
prep-fn |
|
(upload-fn postgres zid math-tick))) |
|
(log/info "Finished uploading math results for zid:" zid))) |
|
|
|
(defn restructure-json-conv |
|
[conv] |
|
(-> conv |
|
(utils/hash-map-subset #{:math_tick :raw-rating-mat :rating-mat :lastVoteTimestamp :mod-out :mod-in :zid :pca :in-conv :n :n-cmts :group-clusters :base-clusters :repness :group-votes :subgroup-clusters :subgroup-votes :subgroup-repness :group-aware-consensus :comment-priorities :meta-tids}) |
|
(assoc :last-vote-timestamp (get conv :lastVoteTimestamp) |
|
:last-mod-timestamp (get conv :lastModTimestamp)) |
|
; Make sure there is an empty named matrix to operate on |
|
(assoc :raw-rating-mat (nm/named-matrix)) |
|
; Update the base clusters to be unfolded |
|
(update :base-clusters clust/unfold-clusters) |
|
; Make sure in-conv is a set |
|
(update :in-conv set) |
|
(update :mod-out set) |
|
(update :mod-in set) |
|
(update :meta-tids set))) |
- Database schema:
|
CREATE TABLE public.math_main ( |
|
zid integer NOT NULL, |
|
math_env character varying(999) NOT NULL, |
|
data jsonb NOT NULL, |
|
last_vote_timestamp bigint NOT NULL, |
|
caching_tick bigint DEFAULT 0 NOT NULL, |
Non-Deterministic K-Means Clustering Due to Worker Restart
Summary
The Clojure math worker produces different clustering results for the same conversation depending on whether it was restarted between computations. This creates inconsistent user experience and makes testing/validation extremely difficult.
Root Cause
The clustering algorithm uses "warm-start" initialization from previous computations, but the state required for this warm-start is partially ephemeral (in-memory only) and not fully persisted to the database.
What Gets Persisted vs Lost
Persisted to
math_maintable:base-clusters(100 base clusters with participant members)group-clusters(final chosen clustering, e.g., k=2)NOT Persisted (in-memory only):
group-clusterings- Map of ALL k-value clusterings:{2: [...], 3: [...], 4: [...], 5: [...]}Source:
polis/math/src/polismath/conv_man.clj
Lines 173 to 185 in 93e3274
The Problem
Scenario A: Worker NOT Restarted (Consistent Behavior)
Result: All k-values warm-started → stable, consistent clusters
Scenario B: Worker WAS Restarted (Inconsistent Behavior)
Result: All k-values cold-started → DIFFERENT CLUSTERS than Scenario A
Impact
1. User Experience
2. Testing & Validation
real_data/math_blobs have unknown provenance (warm or cold started?)3. Production Issues
Evidence
Code Locations
Where
:last-clustersis used:Base clusters:
polis/math/src/polismath/math/conversation.clj
Line 406 in 93e3274
base-clustersIS in databaseGroup clusters:
polis/math/src/polismath/math/conversation.clj
Lines 438 to 439 in 93e3274
group-clusteringsNOT in databaseDatabase schema:
polis/server/schema.sql
Lines 640 to 645 in 93e3274
UNIQUE (zid, math_env))What gets written:
polis/math/src/polismath/conv_man.clj
Lines 173 to 185 in 93e3274
group-clusteringsis NOT included in the persisted subset!Empirical Test Results
Testing Python (cold-start always) vs Clojure reference data:
The Clojure data was likely generated with warm-started clusters from unknown previous state.
Proposed Solutions
Option 1: Persist
group-clusterings(Recommended)Change: Include
group-clusteringsin the persisted state atpolis/math/src/polismath/conv_man.clj
Line 174 in 93e3274
Add
:group-clusteringsto thehash-map-subsetcall.Pros:
Cons:
math_main.datasize (~4x more cluster data)Option 2: Remove Warm-Start for Group Clustering
Change: Always use first-k initialization for group clustering at
polis/math/src/polismath/math/conversation.clj
Lines 438 to 439 in 93e3274
Remove the
:last-clustersparameter from theclusters/kmeanscall.Pros:
Cons:
Option 3: Hybrid Approach
Keep warm-start for base clusters (persisted), remove for group clusters (ephemeral):
Pros:
Cons:
Recommendation
For Clojure (if still in production): Implement Option 1 - persist
group-clusteringsto database.Rationale:
For Python migration: This bug does not exist in the Python implementation, which uses deterministic cold-start initialization. Priority should be completing the Python migration rather than fixing legacy Clojure code.
Additional improvement (if fixing Clojure): Add versioning to
math_maintable for historical comparison and debugging.References
polis/math/src/polismath/math/conversation.clj
Lines 436 to 439 in 93e3274
polis/math/src/polismath/conv_man.clj
Lines 158 to 185 in 93e3274
polis/server/schema.sql
Lines 640 to 645 in 93e3274