Non-Deterministic K-Means Clustering Due to Worker Restart

# Non-Deterministic K-Means Clustering Due to Worker Restart

## Summary

The Clojure math worker produces **different clustering results for the same conversation** depending on whether it was restarted between computations. This creates inconsistent user experience and makes testing/validation extremely difficult.

## Root Cause

The clustering algorithm uses "warm-start" initialization from previous computations, but the state required for this warm-start is **partially ephemeral** (in-memory only) and not fully persisted to the database.

### What Gets Persisted vs Lost

**Persisted to `math_main` table:**
- ✅ `base-clusters` (100 base clusters with participant members)
- ✅ `group-clusters` (final chosen clustering, e.g., k=2)
- ✅ Other math results (PCA, repness, etc.)

**NOT Persisted (in-memory only):**
- ❌ `group-clusterings` - Map of ALL k-value clusterings: `{2: [...], 3: [...], 4: [...], 5: [...]}`

Source:https://github.com/compdemocracy/polis/blob/93e327447b3d2d2017b86fa23f009aee6b92c29b/math/src/polismath/conv_man.clj#L173-L185

## The Problem

### Scenario A: Worker NOT Restarted (Consistent Behavior)

```clojure
; After first computation, in-memory state:
conv = {
  :base-clusters [...]           ; persisted to DB
  :group-clusterings {            ; ← in memory only!
    2 [...], 3 [...], 4 [...], 5 [...]
  }
  :group-clusters [...]          ; persisted to DB (e.g., k=2 winner)
}

; Second computation (new votes arrive):
; Line 438 (see below): :last-clusters (last-clusterings k)
; - For k=2: Uses previous k=2 clustering ✓
; - For k=3: Uses previous k=3 clustering ✓
; - For k=4: Uses previous k=4 clustering ✓
; - For k=5: Uses previous k=5 clustering ✓
```

**Result:** All k-values warm-started → stable, consistent clusters

### Scenario B: Worker WAS Restarted (Inconsistent Behavior)

```clojure
; After restart, loaded from DB only:
conv = {
  :base-clusters [...]           ; ✓ loaded from DB
  :group-clusterings nil          ; ✗ NOT in DB!
  :group-clusters [...]          ; loaded but not used for multi-k
}

; Second computation (same votes as Scenario A):
; Line 438 (see below): :last-clusters (last-clusterings k)
; - Returns nil for ALL k values!
; - Falls back to first-k initialization
```

**Result:** All k-values cold-started → **DIFFERENT CLUSTERS** than Scenario A

## Impact

### 1. User Experience
- Group memberships change arbitrarily after worker restarts
- Participant assignments flip between groups
- Breaks continuity of conversation visualization
- Confusing for moderators tracking opinion groups over time

### 2. Testing & Validation
- Cannot reliably test Python implementation against Clojure
- `real_data/` math_blobs have unknown provenance (warm or cold started?)
- No "ground truth" to compare against
- Regression tests fail unpredictably

### 3. Production Issues
- Different results between dev (frequent restarts) and prod (long-running)
- Deployments cause clustering changes even with no new votes
- Scaling events (pod restarts) affect clustering

## Evidence

### Code Locations

**Where `:last-clusters` is used:**

1. **Base clusters**:   https://github.com/compdemocracy/polis/blob/93e327447b3d2d2017b86fa23f009aee6b92c29b/math/src/polismath/math/conversation.clj#L406
   - ✅ Works: `base-clusters` IS in database

2. **Group clusters**: https://github.com/compdemocracy/polis/blob/93e327447b3d2d2017b86fa23f009aee6b92c29b/math/src/polismath/math/conversation.clj#L438-L439
   - ❌ Fails after restart: `group-clusterings` NOT in database

**Database schema**: https://github.com/compdemocracy/polis/blob/93e327447b3d2d2017b86fa23f009aee6b92c29b/server/schema.sql#L640-L645
- Only ONE row per conversation (`UNIQUE (zid, math_env)`)
- No versioning - overwrites previous data

**What gets written**: https://github.com/compdemocracy/polis/blob/93e327447b3d2d2017b86fa23f009aee6b92c29b/math/src/polismath/conv_man.clj#L173-L185
- Note: `group-clusterings` is NOT included in the persisted subset!

### Empirical Test Results

Testing Python (cold-start always) vs Clojure reference data:
- Biodiversity: **21% Jaccard similarity** (expected >95%)
- VW: **19% Jaccard similarity** (expected >95%)

The Clojure data was likely generated with warm-started clusters from unknown previous state.

## Proposed Solutions

> **Note:** These solutions are for the **Clojure codebase only**. Since we are actively migrating to Python (which doesn't have this bug), fixing this in Clojure is **low priority** unless the Clojure math worker will continue running in production for an extended period.

### Option 1: Persist `group-clusterings` (Recommended)

**Change:** Include `group-clusterings` in the persisted state at https://github.com/compdemocracy/polis/blob/93e327447b3d2d2017b86fa23f009aee6b92c29b/math/src/polismath/conv_man.clj#L174

Add `:group-clusterings` to the `hash-map-subset` call.

**Pros:**
- Minimal code change
- Preserves warm-start behavior consistently
- Fixes non-determinism immediately

**Cons:**
- Increases `math_main.data` size (~4x more cluster data)
- One-time migration needed for existing conversations

### Option 2: Remove Warm-Start for Group Clustering

**Change:** Always use first-k initialization for group clustering at https://github.com/compdemocracy/polis/blob/93e327447b3d2d2017b86fa23f009aee6b92c29b/math/src/polismath/math/conversation.clj#L438-L439

Remove the `:last-clusters` parameter from the `clusters/kmeans` call.

**Pros:**
- Fully deterministic behavior
- No database changes needed
- Matches Python implementation

**Cons:**
- Loses stability/continuity benefit of warm-start
- Groups may change more between updates
- May affect user experience

### Option 3: Hybrid Approach

Keep warm-start for base clusters (persisted), remove for group clusters (ephemeral):

**Pros:**
- Best of both worlds
- Base clusters stay stable (most important)
- Group selection deterministic

**Cons:**
- Asymmetric behavior (may be confusing)
- Still changes current behavior

## Recommendation

**For Clojure (if still in production):** Implement Option 1 - persist `group-clusterings` to database.

**Rationale:**
1. Preserves existing warm-start benefits
2. Makes behavior consistent across restarts
3. Small storage cost is worth the reliability
4. Aligns with original design intent (warm-start for continuity)

**For Python migration:** This bug does not exist in the Python implementation, which uses deterministic cold-start initialization. Priority should be completing the Python migration rather than fixing legacy Clojure code.

**Additional improvement (if fixing Clojure):** Add versioning to `math_main` table for historical comparison and debugging.

## References
- Clojure clustering logic: https://github.com/compdemocracy/polis/blob/93e327447b3d2d2017b86fa23f009aee6b92c29b/math/src/polismath/math/conversation.clj#L436-L439
- Conv manager persistence: https://github.com/compdemocracy/polis/blob/93e327447b3d2d2017b86fa23f009aee6b92c29b/math/src/polismath/conv_man.clj#L158-L185
- Database schema: https://github.com/compdemocracy/polis/blob/93e327447b3d2d2017b86fa23f009aee6b92c29b/server/schema.sql#L640-L645


	(-> conv
	(utils/hash-map-subset #{:math_tick :raw-rating-mat :rating-mat :lastVoteTimestamp :mod-out :mod-in :zid :pca :in-conv :n :n-cmts :group-clusters :base-clusters :repness :group-votes :subgroup-clusters :subgroup-votes :subgroup-repness :group-aware-consensus :comment-priorities :meta-tids})
	(assoc :last-vote-timestamp (get conv :lastVoteTimestamp)
	:last-mod-timestamp (get conv :lastModTimestamp))
	; Make sure there is an empty named matrix to operate on
	(assoc :raw-rating-mat (nm/named-matrix))
	; Update the base clusters to be unfolded
	(update :base-clusters clust/unfold-clusters)
	; Make sure in-conv is a set
	(update :in-conv set)
	(update :mod-out set)
	(update :mod-in set)
	(update :meta-tids set)))

	CREATE TABLE public.math_main (
	zid integer NOT NULL,
	math_env character varying(999) NOT NULL,
	data jsonb NOT NULL,
	last_vote_timestamp bigint NOT NULL,
	caching_tick bigint DEFAULT 0 NOT NULL,

	(-> conv
	(utils/hash-map-subset #{:math_tick :raw-rating-mat :rating-mat :lastVoteTimestamp :mod-out :mod-in :zid :pca :in-conv :n :n-cmts :group-clusters :base-clusters :repness :group-votes :subgroup-clusters :subgroup-votes :subgroup-repness :group-aware-consensus :comment-priorities :meta-tids})
	(assoc :last-vote-timestamp (get conv :lastVoteTimestamp)
	:last-mod-timestamp (get conv :lastModTimestamp))
	; Make sure there is an empty named matrix to operate on
	(assoc :raw-rating-mat (nm/named-matrix))
	; Update the base clusters to be unfolded
	(update :base-clusters clust/unfold-clusters)
	; Make sure in-conv is a set
	(update :in-conv set)
	(update :mod-out set)
	(update :mod-in set)
	(update :meta-tids set)))

	:last-clusters
	; A little pedantic here in case no clustering yet for this k
	(when-let [last-clusterings (:group-clusterings conv)]
	(last-clusterings k))

	(defn write-conv-updates!
	[{:as conv-man :keys [postgres]} {:as updated-conv :keys [zid]} math-tick]
	;; TODO Really need to extract these writes so that mod updates do whta they're supposed to! And also run in async/thread for better parallelism
	; Format and upload main results
	(async/thread
	(doseq [[prep-fn upload-fn] [[prep-main db/upload-math-main] ; main math results, for client
	[prep-bidToPid db/upload-math-bidtopid] ; bidtopid mapping, for server
	[prep-ptpt-stats db/upload-math-ptptstats]]]
	(->> updated-conv
	prep-fn
	(upload-fn postgres zid math-tick)))
	(log/info "Finished uploading math results for zid:" zid)))

	(defn restructure-json-conv
	[conv]
	(-> conv
	(utils/hash-map-subset #{:math_tick :raw-rating-mat :rating-mat :lastVoteTimestamp :mod-out :mod-in :zid :pca :in-conv :n :n-cmts :group-clusters :base-clusters :repness :group-votes :subgroup-clusters :subgroup-votes :subgroup-repness :group-aware-consensus :comment-priorities :meta-tids})
	(assoc :last-vote-timestamp (get conv :lastVoteTimestamp)
	:last-mod-timestamp (get conv :lastModTimestamp))
	; Make sure there is an empty named matrix to operate on
	(assoc :raw-rating-mat (nm/named-matrix))
	; Update the base clusters to be unfolded
	(update :base-clusters clust/unfold-clusters)
	; Make sure in-conv is a set
	(update :in-conv set)
	(update :mod-out set)
	(update :mod-in set)
	(update :meta-tids set)))

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-Deterministic K-Means Clustering Due to Worker Restart #2358

Non-Deterministic K-Means Clustering Due to Worker Restart

Summary

Root Cause

What Gets Persisted vs Lost

The Problem

Scenario A: Worker NOT Restarted (Consistent Behavior)

Scenario B: Worker WAS Restarted (Inconsistent Behavior)

Impact

1. User Experience

2. Testing & Validation

3. Production Issues

Evidence

Code Locations

Empirical Test Results

Proposed Solutions

Option 1: Persist `group-clusterings` (Recommended)

Option 2: Remove Warm-Start for Group Clustering

Option 3: Hybrid Approach

Recommendation

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Non-Deterministic K-Means Clustering Due to Worker Restart #2358

Description

Non-Deterministic K-Means Clustering Due to Worker Restart

Summary

Root Cause

What Gets Persisted vs Lost

The Problem

Scenario A: Worker NOT Restarted (Consistent Behavior)

Scenario B: Worker WAS Restarted (Inconsistent Behavior)

Impact

1. User Experience

2. Testing & Validation

3. Production Issues

Evidence

Code Locations

Empirical Test Results

Proposed Solutions

Option 1: Persist group-clusterings (Recommended)

Option 2: Remove Warm-Start for Group Clustering

Option 3: Hybrid Approach

Recommendation

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Option 1: Persist `group-clusterings` (Recommended)