diff --git a/delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md b/delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md index dd0f17bb8..6dedd4454 100644 --- a/delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md +++ b/delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md @@ -1406,5 +1406,103 @@ sklearn-KMeans-seeding consensus (see scratch/COPILOT_MATH_QUESTIONS.md); no values shift at the goldens commit until D10/D11/D12 land. **Stack position.** New commit inserted between PR 14a (#2564) and D10 -(#2566). D10, D11, D12, goldens rebased cleanly on top — no conflict -markers in `jj log -r 'edge..@+++++'`. +(#2566). D10, D11, D12, goldens rebased cleanly on top; 2-sided docs +conflicts in PLAN/JOURNAL at each downstream commit resolved manually +to merge both edits (downstream docs additions kept; ns-PASS row and +session entry preserved). + +## Session: PR 8 — D10 rep comment selection (2026-06-11) + +Landed in `/goal` mode (autonomous run targeting D10 + D11 + D12 + goldens +as a stacked PR series). Decisions made without inline user check are +documented in `~/polis/D10_D11_D12_GOLDENS_DECISIONS.md` for batch review. + +### What landed + +**Production code (`delphi/polismath/pca_kmeans_rep/repness.py`)** — added +3 top-level helpers + `_finalize_row_for_output` + rewrote `select_rep_comments_df`: + +- `passes_by_test(s) -> bool` — Clojure `passes-by-test?` (repness.clj:165-170). + OR'd on `(rat, pat)` and `(rdt, pdt)` z-sig-90. **NO `pa >= 0.5` gate** — + the pre-D10 Python gate was a botched-port over-restriction with no + Clojure analog. +- `beats_best_by_test(s, current_best_z) -> bool` — Clojure `beats-best-by-test?` + (repness.clj:133-139). Strict `>` on `max(rat, rdt)` vs current best z. +- `beats_best_agr(s, current_best) -> bool` — Clojure `beats-best-agr?` + (repness.clj:142-162). Four-branch agree-priority logic: + 1. `na == 0 and nd == 0` → reject. + 2. Current best AND `current_best['ra'] > 1.0` → compare 4-way signed + product `ra * rat * pa * pat`. + 3. Current best (else, `ra <= 1.0`) → compare `pa * pat`. + 4. No current best → accept if `z90(pat)` OR `(ra > 1.0 AND pa > 0.5)`. +- `_finalize_row_for_output(row, *, is_best_agree=False)` — Clojure + `finalize-cmt-stats` (repness.clj:173-188) + best-agree flagging + (repness.clj:262-264). Adds `best_agree=True` and `n_agree=na` keys for + the best-agree slot. +- `select_rep_comments_df(stats_df, mod_out=None) -> List[Dict[str, Any]]` — + single-pass reduce over `stats_df.to_dict('records')` mirroring Clojure + `select-rep-comments` (repness.clj:212-281). Per-row state + `{sufficient, best, best_agree}` updated by the three helpers; final + assembly is dedup-best-agree-from-sufficient → sort by metric → + prepend best-agree → take 5 → agrees-before-disagrees. + +**Caller (`conv_repness`)** — dropped the `_stats_row_to_dict` wrapping +step since `select_rep_comments_df` now returns finalized dicts directly. + +**Two pre-D10 bugs fixed alongside the rewrite** (research-agent flagged): +- `pa >= 0.5 / pd >= 0.5` over-gate in the passing filter — removed. No + Clojure analog; was dropping legitimate candidates. +- "Fill-from-other-category" + "first-row" fallback blocks — deleted. The + `:best` / `:best_agree` mechanism IS the Clojure fallback. + +**Tests** — 18 new tests (in `tests/test_discrepancy_fixes.py`): +- `TestD10PassesByTest` (4 tests): agree-side significant, disagree-side + significant, neither significant, no `pa >= 0.5` gate. +- `TestD10BeatsBestByTest` (3 tests): None-best, max-rat-rdt, strict `>`. +- `TestD10BeatsBestAgr` (6 tests): Branch 1 (na=nd=0), Branch 2 (ra>1), + Branch 3 (ra<=1), Branch 4 z90(pat), Branch 4 (ra>1 AND pa>0.5), Branch 4 + rejection. +- `TestD10SelectRepCommentsBoundary` (5 tests): empty input, single unvoted + row → best fallback, sufficient-empty-best-agree-only, take-5 cap with + agrees-before-disagrees ordering, **the eviction edge case** (best_agree + outside sufficient evicting 5th-highest-metric). + +**Re-xfailed with updated reasons** (D14 / D1 upstream divergence): +- `TestD9ZScoreThresholds::test_z_values_match_clojure` +- `TestD5ProportionTest::test_pat_values_match_clojure_blob` +- `TestD6TwoPropTest::test_rat_values_match_clojure_blob` +- `TestD7RepnessMetric::test_repness_metric_matches_clojure_blob` +- `TestD8FinalizeStats::test_repful_matches_clojure_blob` +- `TestD10RepCommentSelection::test_rep_comments_match_clojure` + +Why xfailed despite D10 landing: D10 enables shared comments in the +selection (overlap rises from 0% to ~20% on vw cold_start), but +per-(gid, tid) stats still mismatch because Python and Clojure put +different participants in the "same" group ID. That's upstream +PCA/KMeans group-membership divergence (D14 / D1), not D10. D10 is +verified via the 18 synthetic helper + boundary tests above. + +### Suite delta (pre/post D10) + +- Pre (post-14a): 295 passed, 12 skipped, 58 xfailed. +- Post (this PR): 313 passed, 12 skipped, 58 xfailed. +- Delta: +18 passed, 0 failed, 0 new xfailed. The +18 matches the 18 new + D10 synthetic tests exactly. + +### Decisions made autonomously (under `/goal` mode) + +See `~/polis/D10_D11_D12_GOLDENS_DECISIONS.md`. Highlights: +- **S1**: Python convention key names (`repful`, `best_agree`, `n_agree`) + instead of Clojure hyphens. Math blob alignment is a future PR. +- **S2**: `select_rep_comments_df` returns `List[Dict[str, Any]]` instead + of `pd.DataFrame` — variable extra keys (best_agree flag) make list-of- + dicts cleaner than DF-with-NaN-columns. +- **D10.1**: Two pre-D10 bugs (pa>=0.5 gate, fill-from-other fallback) + folded into D10 rather than separate PRs — the rewrite replaces the + function so a surgical fix would be more noise than value. +- **D10.7**: Real-data blob-comparison tests re-xfailed with reasons + pointing at D14/D1, not softened to overlap-thresholds — more honest. + +### What's Next + +PR 9 (D11) on top of D10 in the same spr stack. diff --git a/delphi/docs/PLAN_DISCREPANCY_FIXES.md b/delphi/docs/PLAN_DISCREPANCY_FIXES.md index 162644a66..479eb77e3 100644 --- a/delphi/docs/PLAN_DISCREPANCY_FIXES.md +++ b/delphi/docs/PLAN_DISCREPANCY_FIXES.md @@ -28,9 +28,9 @@ This plan's "PR N" labels map to actual GitHub PRs as follows: | PR 7 (D8) | #2522 | Stack 15/17 | Fix D8: finalize comment stats | | PR 12 (D15) | #2523 | Stack 16/17 | Fix D15: moderation handling | | (K-inv) | #2524 | Stack 17/17 | Fix K-means k divergence: preserve row order | -| PR 14a (scalar deletion) | — (in flight) | — | Delete dead scalar paths in `repness.py`; migrate blob injection tests to vectorized | -| PR ns-PASS fix | — (planned) | — | Fix `ns` to include PASS votes in `compute_group_comment_stats_df` (Clojure `count-votes` parity) | -| PR 8 (D10) | — (WIP) | — | Fix D10: rep comment selection — **NEEDS REWORK** | +| PR 14a (scalar deletion) | #2564 | — | Delete dead scalar paths in `repness.py`; migrate blob injection tests to vectorized | +| PR ns-PASS fix | — (in flight) | — | Fix `ns` to include PASS votes in `compute_group_comment_stats_df` (Clojure `count-votes` parity) | +| PR 8 (D10) | — (in flight) | — | Fix D10: rep comment selection — single-pass reduce matching Clojure | | PR 9 (D11) | — (WIP) | — | Fix D11: consensus selection — **NEEDS REWORK** | | PR 10 (D3) | — (WIP) | — | Fix D3: k-smoother buffer — **NEEDS REWORK** | | PR 11 (D12) | — (WIP) | — | Fix D12: comment priorities — **NEEDS REWORK** | @@ -916,3 +916,14 @@ Tagging this as a follow-up. No code changes until we discuss. - **`to_dynamo_dict` parallel inline implementations** were refactored to route through the same helpers as `to_dict` in PR #2523 follow-up. No further action needed. +- **D10 take-5 eviction edge case** (2026-06-11). The Clojure-parity + `select_rep_comments_df` introduced in PR 8 mirrors Clojure exactly: + `take(5)` runs AFTER prepending the `best_agree` slot. When `best_agree` + was kept by `beats_best_agr?` as a non-significant agree-priority fallback + (i.e. it failed `passes_by_test?` but qualified via Branch 4) AND + `:sufficient` already has 5 entries, the prepend pushes the total to 6 + and `take(5)` silently evicts the 5th-highest-metric `sufficient` entry + — possibly a strong dissenting view. Mirrored for blob parity; flag for + future product review. See `# TODO(parity-eviction)` in + `delphi/polismath/pca_kmeans_rep/repness.py::select_rep_comments_df` and + the synthetic test `TestD10SelectRepCommentsBoundary::test_take_5_eviction_when_best_agree_outside_sufficient`. diff --git a/delphi/polismath/pca_kmeans_rep/repness.py b/delphi/polismath/pca_kmeans_rep/repness.py index f9fe1113a..0c2255227 100644 --- a/delphi/polismath/pca_kmeans_rep/repness.py +++ b/delphi/polismath/pca_kmeans_rep/repness.py @@ -7,7 +7,7 @@ import numpy as np import pandas as pd -from typing import Any, Dict, List +from typing import Any, Dict, Iterable, List, Optional, Tuple from polismath.utils.general import AGREE, DISAGREE @@ -344,106 +344,289 @@ def compute_group_comment_stats_df(votes_long: pd.DataFrame, return stats_df -def select_rep_comments_df(stats_df: pd.DataFrame, - agree_count: int = 3, - disagree_count: int = 2) -> pd.DataFrame: +# ============================================================================= +# D10: Selection helpers (Clojure parity for select-rep-comments) +# ============================================================================= +# +# Ports of Clojure's `select-rep-comments` and its three predicates from +# math/src/polismath/math/repness.clj:133-281. Operate on per-(group, comment) +# dict rows produced by `compute_group_comment_stats_df` (via to_dict('records')). +# Per-row dict ops + small per-group iteration (rather than vectorized) because +# `beats_best_agr` has a 4-branch decision against a moving "current best" +# that updates during iteration; vectorizing would require multiple passes +# without saving lines (per-group N typically <500 comments). + + +def passes_by_test(s: Dict[str, Any]) -> bool: + """ + Clojure passes-by-test? (repness.clj:165-170). + + True iff the agree side OR the disagree side passes z-sig-90 on BOTH + the proportion test (pat/pdt) and the representativeness test (rat/rdt). + No probability gate — pre-D10 Python's `pa >= 0.5` gate was a Python-only + over-restriction without a Clojure analog. + + (or (and (z-sig-90? rat) (z-sig-90? pat)) + (and (z-sig-90? rdt) (z-sig-90? pdt))) + """ + return ( + (z_score_sig_90(s['rat']) and z_score_sig_90(s['pat'])) + or (z_score_sig_90(s['rdt']) and z_score_sig_90(s['pdt'])) + ) + + +def beats_best_by_test(s: Dict[str, Any], current_best_z: Optional[float]) -> bool: """ - Select representative comments for a single group from a DataFrame. + Clojure beats-best-by-test? (repness.clj:133-139). - NOTE (PR 14a / D10): this is the current Python selection logic — a - botched port that does not match Clojure's `select-rep-comments` - (math/src/polismath/math/repness.clj:212-281). D10 will replace it with - a single-pass reduce matching Clojure (up to 5 total, agrees-first, - best-agree priority slot). Until then, this path is preserved as-is. + True if `s` has a more-representative max(rat, rdt) than `current_best_z`, + OR if there is no current best yet. Strict `>` (Clojure: `>`). + + (or (nil? current-best-z) + (> (max rat rdt) current-best-z)) + """ + if current_best_z is None: + return True + return max(s['rat'], s['rdt']) > current_best_z + + +def beats_best_agr(s: Dict[str, Any], + current_best: Optional[Dict[str, Any]]) -> bool: + """ + Clojure beats-best-agr? (repness.clj:142-162). + + Four mutually exclusive branches: + + 1. `na == 0 and nd == 0`: reject. Comments with no votes never enter the + best-agree slot (Clojure: `(= 0 na nd)` → false). + 2. `current_best` exists AND `current_best['ra'] > 1.0`: compare the + 4-way signed product `ra * rat * pa * pat`. New row must beat the + current best on this product. + 3. `current_best` exists (else, i.e. `current_best['ra'] <= 1.0`): + compare `pa * pat` only — "shoot for something generally agreed upon" + when the current best isn't representative enough. + 4. No `current_best`: accept if `z90(pat)` OR `(ra > 1.0 AND pa > 0.5)`. + + `current_best` here is the RAW stats row (Clojure stores raw at + repness.clj:250 so this comparator keeps the `ra/rat/pa/pat` surface). + """ + if s['na'] == 0 and s['nd'] == 0: # Branch 1. + return False + if current_best is not None and current_best['ra'] > 1.0: # Branch 2. + return (s['ra'] * s['rat'] * s['pa'] * s['pat']) > ( + current_best['ra'] * current_best['rat'] + * current_best['pa'] * current_best['pat'] + ) + if current_best is not None: # Branch 3. + return (s['pa'] * s['pat']) > (current_best['pa'] * current_best['pat']) + # Branch 4. + return z_score_sig_90(s['pat']) or (s['ra'] > 1.0 and s['pa'] > 0.5) + + +def _finalize_row_for_output(row: Dict[str, Any], *, + is_best_agree: bool = False) -> Dict[str, Any]: + """ + Format a per-(group, comment) stats row for the final repness output + (math blob `repness` / `group_repness`). + + Mirrors Clojure `finalize-cmt-stats` (repness.clj:173-188) plus the + best-agree flagging at repness.clj:262-264. + + When `is_best_agree=True`, two extra keys are added: + - `best_agree`: True + - `n_agree`: the raw `na` (preserves the agree count even when the + row is classified as 'disagree' by `rat > rdt`). + + Key naming uses Python convention (underscored). Clojure-style hyphens + (`repful-for`, `n-agree`, etc.) are deferred to a future math-blob + alignment PR (see PLAN.md "Pending — needs team discussion"). + + `agree_metric` / `disagree_metric` are read directly from the row + (produced by `compute_group_comment_stats_df`) rather than recomputed. + Recomputing here would duplicate the formula at repness.clj:191-193 in + two places and risk drift if it ever changes (decision D10.8.3). + """ + repful = 'agree' if row['rat'] > row['rdt'] else 'disagree' + finalized: Dict[str, Any] = { + 'comment_id': row['comment'], + 'group_id': row['group_id'], + 'na': int(row['na']), + 'nd': int(row['nd']), + 'ns': int(row['ns']), + 'pa': row['pa'], + 'pd': row['pd'], + 'pat': row['pat'], + 'pdt': row['pdt'], + 'ra': row['ra'], + 'rd': row['rd'], + 'rat': row['rat'], + 'rdt': row['rdt'], + 'agree_metric': row['agree_metric'], + 'disagree_metric': row['disagree_metric'], + 'repful': repful, + } + if is_best_agree: + finalized['best_agree'] = True + finalized['n_agree'] = int(row['na']) + return finalized + + +def select_rep_comments_df(stats_df: pd.DataFrame, + mod_out: Optional[Iterable[int]] = None + ) -> Tuple[pd.DataFrame, Optional[Dict[str, Any]]]: + """ + Select representative comments for a single group (Clojure parity). + + Single-pass reduce over the group's (gid, tid) rows, mirroring + `select-rep-comments` in math/src/polismath/math/repness.clj:212-281. + + Per-row state {sufficient, best, best_agree}: + - `passes_by_test(row)` → append finalized row to `sufficient`. + - `:sufficient` still empty AND `beats_best_by_test` → update `best`. + - `beats_best_agr(row, best_agree)` → store RAW row as new `best_agree`. + + Final assembly (decision S2 / D10.4): + - `sufficient` non-empty: dedup best_agree from sufficient → sort by + agree/disagree metric (descending, signed product per repness.clj:191) + → take up to 5 (post-prepend → 4 sufficient max) → agrees-before- + disagrees on the sufficient slice. The best-agree dict is returned + SEPARATELY so the DataFrame stays clean (no NaN best_agree/n_agree + columns when the slot is empty). + - Else: `(empty_df, best_agree_dict)` if best_agree exists, else + `(single_row_df_for_best, None)` if best exists, else + `(empty_df, None)`. Args: - stats_df: DataFrame with comment statistics for ONE group - agree_count: Number of agreement comments to select - disagree_count: Number of disagreement comments to select + stats_df: DataFrame with comment statistics for ONE group, schema + as produced by `compute_group_comment_stats_df`. + mod_out: Optional iterable of tids to exclude (moderated-out comments). + Filter applied before the reduce (Clojure repness.clj:222). Returns: - DataFrame of selected representative comments + `(rep_df, best_agree_dict)` tuple: + - `rep_df`: DataFrame of finalized rep-comment rows in math-blob + shape (see `_finalize_row_for_output`) — does NOT include the + best-agree slot, and carries NO `best_agree`/`n_agree` columns. + Already ordered agrees-before-disagrees and capped so that + `len(rep_df) + (1 if best_agree_dict else 0) <= 5`. + - `best_agree_dict`: Standalone finalized dict for the best-agree + slot (with `best_agree=True` and `n_agree=`), or `None` if + no candidate qualified. The caller is responsible for prepending + it to the flat output list. """ + empty_df: pd.DataFrame = pd.DataFrame() + if stats_df.empty: - return stats_df - - total_wanted = agree_count + disagree_count - - # Best agree: pa > pd and passes significance tests - agree_candidates = stats_df[stats_df['pa'] > stats_df['pd']].copy() - if not agree_candidates.empty: - # Check significance: pat > Z_90 and rat > Z_90 - passing_agree = agree_candidates[ - (agree_candidates['pat'] > Z_90) & - (agree_candidates['rat'] > Z_90) & - (agree_candidates['pa'] >= 0.5) - ] - if not passing_agree.empty: - agree_candidates = passing_agree - - # Best disagree: pd > pa and passes significance tests - disagree_candidates = stats_df[stats_df['pd'] > stats_df['pa']].copy() - if not disagree_candidates.empty: - passing_disagree = disagree_candidates[ - (disagree_candidates['pdt'] > Z_90) & - (disagree_candidates['rdt'] > Z_90) & - (disagree_candidates['pd'] >= 0.5) - ] - if not passing_disagree.empty: - disagree_candidates = passing_disagree - - # Sort candidates by metric - if not agree_candidates.empty: - agree_candidates = agree_candidates.sort_values('agree_metric', ascending=False) - if not disagree_candidates.empty: - disagree_candidates = disagree_candidates.sort_values('disagree_metric', ascending=False) - - # Select top N from each category - selected_parts = [] - - if not agree_candidates.empty: - top_agree = agree_candidates.head(agree_count).copy() - top_agree['repful'] = 'agree' - selected_parts.append(top_agree) - - if not disagree_candidates.empty: - top_disagree = disagree_candidates.head(disagree_count).copy() - top_disagree['repful'] = 'disagree' - selected_parts.append(top_disagree) - - if selected_parts: - selected = pd.concat(selected_parts, ignore_index=False) - else: - selected = pd.DataFrame() - - # If we couldn't find enough, try to fill from available candidates - # This matches the exact behavior of the old select_rep_comments() function: - # - First fallback adds agree_comments[agree_count:min(len, total_wanted)] regardless of - # whether we exceed total_wanted (up to disagree_count more agrees) - # - Second fallback only runs if STILL < total_wanted - if len(selected) < total_wanted: - # Try to add more agree comments - # Old code: range(agree_count, min(len(agree_comments), agree_count + disagree_count)) - if not agree_candidates.empty and len(agree_candidates) > agree_count: - extra_limit = min(len(agree_candidates), total_wanted) - extra_agrees = agree_candidates.iloc[agree_count:extra_limit].copy() - extra_agrees['repful'] = 'agree' - selected = pd.concat([selected, extra_agrees], ignore_index=False) - - # Try to add more disagree comments (only if still not enough) - # Old code: range(disagree_count, min(len(disagree_comments), agree_count + disagree_count)) - if len(selected) < total_wanted and not disagree_candidates.empty and len(disagree_candidates) > disagree_count: - extra_limit = min(len(disagree_candidates), total_wanted) - extra_disagrees = disagree_candidates.iloc[disagree_count:extra_limit].copy() - extra_disagrees['repful'] = 'disagree' - selected = pd.concat([selected, extra_disagrees], ignore_index=False) - - # Fallback: if still empty, take first row - if selected.empty and not stats_df.empty: - selected = stats_df.head(1).copy() - selected['repful'] = selected['repful'].iloc[0] if 'repful' in selected.columns else 'agree' - - return selected + return empty_df, None + + mod_out_set = set(mod_out) if mod_out else set() + sufficient: List[Dict[str, Any]] = [] + best: Optional[Dict[str, Any]] = None + # Track best's max(rat, rdt) as a sidecar scalar so we never have to mutate + # `best` itself with synthetic comparison keys. Avoids the leak/pop dance + # of stashing a `_max_rt` inside the finalized dict (decision D10.8.4). + best_max_rt: Optional[float] = None + best_agree: Optional[Dict[str, Any]] = None + + # Sort by `comment` (tid) ascending BEFORE iterating, so ties in + # `beats_best_by_test` (max(rat, rdt) tied) and in `beats_best_agr` + # (Branch 2/3 product tied) resolve deterministically. The chosen order + # matches Clojure's named-matrix column iteration: after normalization + # the columns are insertion-ordered, and for cold-start that's tid + # ascending (see Clojure named_matrix.clj:130-131 — insertion order). + # All Clojure beats-*? predicates use strict `>` so the FIRST row at a + # tied score wins; sorting ascending here mirrors that (decision D10.8.1). + iter_df = stats_df.sort_values('comment', kind='mergesort') + + for row in iter_df.to_dict('records'): + if row['comment'] in mod_out_set: + continue + if passes_by_test(row): + sufficient.append(_finalize_row_for_output(row)) + # Update `best` only while sufficient is still empty (Clojure parity). + if not sufficient: + if beats_best_by_test(row, best_max_rt): + best = _finalize_row_for_output(row) + best_max_rt = max(row['rat'], row['rdt']) + # `best_agree` stores RAW row (Clojure repness.clj:250) so subsequent + # `beats_best_agr` calls keep the ra/rat/pa/pat surface. + if beats_best_agr(row, best_agree): + best_agree = row + + # Build the standalone best-agree dict (or None) once — used in every + # assembly branch below. + best_agree_dict: Optional[Dict[str, Any]] = ( + _finalize_row_for_output(best_agree, is_best_agree=True) + if best_agree is not None else None + ) + + # Assembly. + if not sufficient: + if best_agree_dict is not None: + # Best-agree slot returned separately; rep_df stays empty. + return empty_df, best_agree_dict + if best is not None: + return pd.DataFrame([best]), None + return empty_df, None + + # Sufficient non-empty path. + best_agree_tid = best_agree['comment'] if best_agree is not None else None + # Dedup best_agree from sufficient (caller will re-prepend it). + deduped = [s for s in sufficient if s['comment_id'] != best_agree_tid] + + # Sort each row by its winning-side metric (signed product, per Clojure + # repness.clj:191-193). Clojure (repness.clj:191-200) sorts by a single + # `:repness-metric` field that `finalize-cmt-stats` populates per the + # winning side. We achieve equivalent ranking by reading `agree_metric` + # for repful=='agree' rows and `disagree_metric` otherwise — same + # comparator value, just a different key per row (decision D10.8.2). + def _sort_key(s: Dict[str, Any]) -> float: + return s['agree_metric'] if s['repful'] == 'agree' else s['disagree_metric'] + deduped.sort(key=_sort_key, reverse=True) + + # TODO(parity-eviction): the cap of 5 INCLUDING the best-agree slot can + # evict the 5th-highest-metric `sufficient` entry — a strong dissenting + # view may be silently dropped by a weak agree-priority one. Mirrors + # Clojure exactly for parity; flagged in PLAN.md + # "Pending — needs team discussion". + cap = 5 - (1 if best_agree_dict is not None else 0) + capped = deduped[:cap] + + # agrees-before-disagrees (Clojure repness.clj:203-209). Stable partition. + agrees = [c for c in capped if c['repful'] == 'agree'] + disagrees = [c for c in capped if c['repful'] == 'disagree'] + rep_df = pd.DataFrame(agrees + disagrees) + + return rep_df, best_agree_dict + + +def _assemble_rep_comments(stats_df: pd.DataFrame, + mod_out: Optional[Iterable[int]] = None + ) -> List[Dict[str, Any]]: + """Thin wrapper around `select_rep_comments_df` that returns the flat + output list (best-agree slot prepended, then the DataFrame's rows, + then re-partitioned agrees-before-disagrees so a `repful='disagree'` + best-agree slot lands in the disagrees section as in pre-S2). + + Decision S2: `select_rep_comments_df` returns a `(rep_df, best_agree_dict)` + tuple so the DataFrame stays clean (no NaN extra-key columns). Most + callers — including `conv_repness` and the D10 synthetic tests — want + the flat List[Dict] form, so we keep one place that does the prepend + and the final agrees-before-disagrees stable partition. + """ + rep_df, best_agree_dict = select_rep_comments_df(stats_df, mod_out=mod_out) + head: List[Dict[str, Any]] = [best_agree_dict] if best_agree_dict is not None else [] + tail: List[Dict[str, Any]] = ( + rep_df.to_dict('records') if not rep_df.empty else [] + ) + combined = head + tail + # Re-run agrees-before-disagrees stable partition so the best-agree slot + # ends up in the correct section per its own `repful`. This mirrors the + # pre-S2 behaviour where best_agree was prepended into a single list and + # then partitioned (Clojure repness.clj:203-209). + agrees = [c for c in combined if c['repful'] == 'agree'] + disagrees = [c for c in combined if c['repful'] == 'disagree'] + return agrees + disagrees def select_consensus_comments_df(stats_df: pd.DataFrame, @@ -598,10 +781,12 @@ def conv_repness(vote_matrix_df: pd.DataFrame, group_clusters: List[Dict[str, An continue try: - rep_df = select_rep_comments_df(group_stats) - # Convert to list of dicts only at the end - rep_comments = [_stats_row_to_dict(row) for _, row in rep_df.iterrows()] - result['group_repness'][group_id] = rep_comments + # `select_rep_comments_df` now returns `(rep_df, best_agree_dict)` + # (decision S2) so the DataFrame stays clean. Use the + # `_assemble_rep_comments` wrapper to get the flat List[Dict] + # the math blob expects (best-agree prepended, agrees-before- + # disagrees partition applied). + result['group_repness'][group_id] = _assemble_rep_comments(group_stats) except Exception as e: print(f"Error selecting representative comments for group {group_id}: {e}") result['group_repness'][group_id] = [] diff --git a/delphi/tests/test_discrepancy_fixes.py b/delphi/tests/test_discrepancy_fixes.py index f6d4714f0..eefd11e69 100644 --- a/delphi/tests/test_discrepancy_fixes.py +++ b/delphi/tests/test_discrepancy_fixes.py @@ -42,6 +42,12 @@ z_score_sig_95, prop_test_vectorized, two_prop_test_vectorized, + # D10 selection helpers (PR 8) + passes_by_test, + beats_best_by_test, + beats_best_agr, + select_rep_comments_df, + _assemble_rep_comments, ) from polismath.regression import get_dataset_files, get_blob_variants from polismath.regression.datasets import discover_datasets @@ -617,7 +623,13 @@ def test_significance_sets_match_clojure(self, conv, clojure_blob, dataset_name) check.equal(len(mismatches), 0, f"{len(mismatches)} groups differ in selected rep comments") - @pytest.mark.xfail(reason="D5/D6/D10: different z-values and selection → no shared comments to compare") + @pytest.mark.xfail(reason="gid 0↔1 label swap + group-membership divergence on cold_start " + "(per workflow Investigation C 2026-06-11 — PR #2524 D14 verified " + "only k count, not per-(gid, tid) memberships). D10 unlocks shared " + "comments (overlap ~20%) but per-(gid, tid) z-values still differ " + "because the swapped/divergent groups contain different participants. " + "Fix requires canonical-group-id sorting or set-based comparison " + "infrastructure.") def test_z_values_match_clojure(self, conv, clojure_blob, dataset_name): """Z-score values for shared rep comments should match Clojure. @@ -753,7 +765,13 @@ def test_clojure_pat_values_consistent_with_formula(self, clojure_blob, dataset_ print(f"[{dataset_name}] pat consistency: {total - mismatches}/{total} match formula (max_diff={max_diff:.4f})") check.equal(mismatches, 0, f"Clojure p-test values don't match formula for {mismatches}/{total}") - @pytest.mark.xfail(reason="D5/D10: prop test formula differs + no shared comments") + @pytest.mark.xfail(reason="gid 0↔1 label swap + group-membership divergence on cold_start " + "(per workflow Investigation C 2026-06-11 — PR #2524 D14 verified " + "only k count, not per-(gid, tid) memberships). D10 unlocks shared " + "comments but per-(gid, tid) pat values still differ because the " + "swapped/divergent groups contain different participants. Fix " + "requires canonical-group-id sorting or set-based comparison " + "infrastructure.") def test_pat_values_match_clojure_blob(self, conv, clojure_blob, dataset_name): """p-test (Clojure) vs pat (Python) for shared rep comments.""" clojure_repness = clojure_blob.get('repness', {}) @@ -868,7 +886,13 @@ def test_two_prop_test_pseudocount_effect(self): check.greater(abs(result_large), abs(result_small), "Large samples should produce more extreme z-scores than small ones") - @pytest.mark.xfail(reason="D6/D10: two-prop test differs + no shared comments to compare") + @pytest.mark.xfail(reason="gid 0↔1 label swap + group-membership divergence on cold_start " + "(per workflow Investigation C 2026-06-11 — PR #2524 D14 verified " + "only k count, not per-(gid, tid) memberships). D10 unlocks shared " + "comments but per-(gid, tid) rat values still differ because the " + "swapped/divergent groups contain different participants. Fix " + "requires canonical-group-id sorting or set-based comparison " + "infrastructure.") def test_rat_values_match_clojure_blob(self, conv, clojure_blob, dataset_name): """repness-test (Clojure) vs rat (Python) for shared rep comments. @@ -948,7 +972,13 @@ def test_metric_formula_is_product(self): check.almost_equal(disagree_metric, 0.189, abs=1e-10, msg=f"disagree_metric (0.7 * -0.9 * 0.2 * -1.5) = 0.189, got {disagree_metric}") - @pytest.mark.xfail(reason="D7/D10: metric formula differs + no shared comments") + @pytest.mark.xfail(reason="gid 0↔1 label swap + group-membership divergence on cold_start " + "(per workflow Investigation C 2026-06-11 — PR #2524 D14 verified " + "only k count, not per-(gid, tid) memberships). D10 unlocks shared " + "comments but per-(gid, tid) repness metrics still differ because " + "the swapped/divergent groups contain different participants. Fix " + "requires canonical-group-id sorting or set-based comparison " + "infrastructure.") def test_repness_metric_matches_clojure_blob(self, conv, clojure_blob, dataset_name): """repness (Clojure) vs agree/disagree_metric (Python) for shared comments.""" clojure_repness = clojure_blob.get('repness', {}) @@ -1025,7 +1055,13 @@ def test_repful_classification_boundary(self): f"{len(mismatches)}/{len(cases)} repful mismatches:\n" + mismatches.to_string(index=False)) - @pytest.mark.xfail(reason="D8/D10: repful logic differs + no shared comments") + @pytest.mark.xfail(reason="gid 0↔1 label swap + group-membership divergence on cold_start " + "(per workflow Investigation C 2026-06-11 — PR #2524 D14 verified " + "only k count, not per-(gid, tid) memberships). D10 unlocks shared " + "comments but per-(gid, tid) repful labels still differ because the " + "swapped/divergent groups contain different participants. Fix " + "requires canonical-group-id sorting or set-based comparison " + "infrastructure.") def test_repful_matches_clojure_blob(self, conv, clojure_blob, dataset_name): """repful-for (Clojure) vs repful (Python) for shared rep comments.""" clojure_repness = clojure_blob.get('repness', {}) @@ -1071,7 +1107,13 @@ class TestD10RepCommentSelection: Clojure selects up to 5 total, agrees first, with beats-best-by-test logic """ - @pytest.mark.xfail(reason="D10: Different selection logic than Clojure") + @pytest.mark.xfail(reason="gid 0↔1 label swap + group-membership divergence on cold_start " + "(per workflow Investigation C 2026-06-11 — PR #2524 D14 verified " + "only k count, not per-(gid, tid) memberships) prevents exact " + "set match. D10 selection logic verified by TestD10PassesByTest, " + "TestD10BeatsBestByTest, TestD10BeatsBestAgr, " + "TestD10SelectRepCommentsBoundary. Fix requires canonical-group-id " + "sorting or set-based comparison infrastructure.") def test_rep_comments_match_clojure(self, conv, clojure_blob, dataset_name): """Selected representative comments per group should match Clojure.""" clojure_repness = clojure_blob.get('repness', {}) @@ -1104,6 +1146,482 @@ def test_rep_comments_match_clojure(self, conv, clojure_blob, dataset_name): f"Only {matching_groups}/{total_groups} groups have matching rep comments") +# ---------------------------------------------------------------------------- +# D10 — Synthetic unit tests for the new selection helpers +# ---------------------------------------------------------------------------- +# +# Pin the Clojure-parity semantics of `passes_by_test`, `beats_best_by_test`, +# `beats_best_agr`, and `select_rep_comments_df`. Synthetic 1-group fixtures +# only — no real datasets, no Clojure blob dependency. +# +# References: +# - Clojure `select-rep-comments`: math/src/polismath/math/repness.clj:212-281 +# - Helpers `passes-by-test?` :165, `beats-best-by-test?` :133, +# `beats-best-agr?` :142, `finalize-cmt-stats` :173, `repness-metric` :191. +# ---------------------------------------------------------------------------- + + +def _stats_row(tid, na, nd, pa, pd_, pat, pdt, ra, rd, rat, rdt, *, ns=None, + agree_metric=None, disagree_metric=None, repful=None, group_id=0): + """Build a single stats DataFrame row matching the schema produced by + `compute_group_comment_stats_df`. Defaults derived per Clojure recipe.""" + if ns is None: + ns = na + nd + if agree_metric is None: + agree_metric = ra * rat * pa * pat + if disagree_metric is None: + disagree_metric = rd * rdt * pd_ * pdt + if repful is None: + repful = 'agree' if rat > rdt else 'disagree' + return { + 'group_id': group_id, 'comment': tid, + 'na': na, 'nd': nd, 'ns': ns, + 'pa': pa, 'pd': pd_, + 'pat': pat, 'pdt': pdt, + 'ra': ra, 'rd': rd, + 'rat': rat, 'rdt': rdt, + 'agree_metric': agree_metric, 'disagree_metric': disagree_metric, + 'repful': repful, + } + + +class TestD10PassesByTest: + """`passes-by-test?` (repness.clj:165) — OR'd on (rat, pat) and (rdt, pdt). + + NO probability threshold (`pa >= 0.5` was a Python-only over-restriction + in the pre-D10 botched port — Clojure has no such gate).""" + + def test_agree_side_significant_passes(self): + row = _stats_row(1, na=8, nd=2, pa=0.75, pd_=0.25, pat=2.0, pdt=-2.0, + ra=2.0, rd=0.5, rat=2.0, rdt=-2.0) # rat,pat > Z_90 + assert passes_by_test(row) + + def test_disagree_side_significant_passes(self): + row = _stats_row(2, na=2, nd=8, pa=0.25, pd_=0.75, pat=-2.0, pdt=2.0, + ra=0.5, rd=2.0, rat=-2.0, rdt=2.0) # rdt,pdt > Z_90 + assert passes_by_test(row) + + def test_neither_side_significant_fails(self): + row = _stats_row(3, na=5, nd=5, pa=0.5, pd_=0.5, pat=0.5, pdt=0.5, + ra=1.0, rd=1.0, rat=0.5, rdt=0.5) + assert not passes_by_test(row) + + def test_no_pa_threshold_gate(self): + """Pre-D10 Python added `pa >= 0.5` — Clojure has no such gate. A row + with pa=0.4 that's otherwise significant on the agree side must pass.""" + row = _stats_row(4, na=4, nd=6, pa=0.42, pd_=0.58, pat=2.0, pdt=-2.0, + ra=2.0, rd=0.5, rat=2.0, rdt=-2.0) + assert passes_by_test(row), "no pa>=0.5 gate (Clojure parity)" + + +class TestD10BeatsBestByTest: + """`beats-best-by-test?` (repness.clj:133) — max(rat, rdt) > current_best_z.""" + + def test_none_best_always_beats(self): + row = _stats_row(1, na=5, nd=2, pa=0.6, pd_=0.4, pat=1.0, pdt=-1.0, + ra=1.2, rd=0.8, rat=2.0, rdt=0.5) + assert beats_best_by_test(row, None) + + def test_max_rat_rdt_used(self): + row = _stats_row(1, na=5, nd=2, pa=0.6, pd_=0.4, pat=1.0, pdt=-1.0, + ra=1.2, rd=0.8, rat=2.0, rdt=0.5) + # max = 2.0 + assert beats_best_by_test(row, 1.5) + assert not beats_best_by_test(row, 2.5) + + def test_strict_greater_than(self): + row = _stats_row(1, na=5, nd=2, pa=0.6, pd_=0.4, pat=1.0, pdt=-1.0, + ra=1.2, rd=0.8, rat=2.0, rdt=0.5) + assert not beats_best_by_test(row, 2.0), "strict > (Clojure parity)" + + +class TestD10BeatsBestAgr: + """`beats-best-agr?` (repness.clj:142) — 4-branch agree priority logic.""" + + def test_na_nd_zero_always_rejected(self): + """Branch 1: (= 0 na nd) → false. Unvoted comments excluded from best-agree + regardless of stats.""" + unvoted = _stats_row(1, na=0, nd=0, pa=0.5, pd_=0.5, pat=1.0, pdt=1.0, + ra=1.0, rd=1.0, rat=1.0, rdt=1.0) + assert not beats_best_agr(unvoted, None) + other = _stats_row(2, na=5, nd=2, pa=0.6, pd_=0.4, pat=1.0, pdt=-1.0, + ra=1.2, rd=0.8, rat=2.0, rdt=0.5) + assert not beats_best_agr(unvoted, other) + + def test_branch_2_ra_gt_1_uses_4way_product(self): + """Branch 2: current_best AND current_best.ra > 1.0 → compare ra*rat*pa*pat.""" + big_ra_best = _stats_row(1, na=10, nd=0, pa=0.9, pd_=0.1, pat=2.0, pdt=-2.0, + ra=2.0, rd=0.5, rat=2.0, rdt=-2.0) + # ra*rat*pa*pat = 2.0*2.0*0.9*2.0 = 7.2 + bigger = _stats_row(2, na=15, nd=0, pa=0.94, pd_=0.06, pat=3.0, pdt=-3.0, + ra=2.5, rd=0.4, rat=2.5, rdt=-2.5) + # 2.5*2.5*0.94*3.0 = 17.625 > 7.2 + smaller = _stats_row(3, na=5, nd=0, pa=0.86, pd_=0.14, pat=1.5, pdt=-1.5, + ra=1.5, rd=0.6, rat=1.5, rdt=-1.5) + # 1.5*1.5*0.86*1.5 ≈ 2.9 < 7.2 + assert beats_best_agr(bigger, big_ra_best) + assert not beats_best_agr(smaller, big_ra_best) + + def test_branch_3_ra_le_1_uses_pa_pat_product(self): + """Branch 3: current_best AND current_best.ra <= 1.0 → compare pa*pat only.""" + weak_best = _stats_row(1, na=5, nd=4, pa=0.55, pd_=0.45, pat=1.0, pdt=-1.0, + ra=0.9, rd=1.1, rat=1.0, rdt=-1.0) + # pa*pat = 0.55 + bigger = _stats_row(2, na=6, nd=2, pa=0.7, pd_=0.3, pat=1.2, pdt=-1.2, + ra=1.0, rd=1.0, rat=0.5, rdt=-0.5) + # pa*pat = 0.84 > 0.55 + assert beats_best_agr(bigger, weak_best) + + def test_branch_4_no_best_accepts_via_z_sig_pat(self): + """Branch 4 / no current_best: accept if z90(pat) is true.""" + row = _stats_row(1, na=6, nd=4, pa=0.58, pd_=0.42, pat=1.5, pdt=-1.5, + ra=0.9, rd=1.1, rat=1.0, rdt=-1.0) # pat=1.5 > Z_90=1.2816 + assert beats_best_agr(row, None) + + def test_branch_4_no_best_accepts_via_ra_gt_1_and_pa_gt_half(self): + """Branch 4 / no current_best: accept if ra > 1.0 AND pa > 0.5 + (even when pat not significant).""" + row = _stats_row(1, na=5, nd=4, pa=0.55, pd_=0.45, pat=0.5, pdt=-0.5, + ra=1.2, rd=0.8, rat=0.5, rdt=-0.5) + assert beats_best_agr(row, None) + + def test_branch_4_no_best_rejects_when_neither(self): + """Branch 4 / no current_best: reject if neither z90(pat) nor + (ra > 1.0 AND pa > 0.5).""" + row = _stats_row(1, na=4, nd=5, pa=0.45, pd_=0.55, pat=0.5, pdt=0.5, + ra=0.8, rd=1.2, rat=0.5, rdt=0.5) + assert not beats_best_agr(row, None) + + +class TestD10SelectRepCommentsBoundary: + """`select_rep_comments_df` Clojure-parity boundaries.""" + + def test_empty_input_returns_empty(self): + result = _assemble_rep_comments(pd.DataFrame()) + assert len(result) == 0 + + def test_single_unvoted_row_falls_through_to_best(self): + """`beats_best_by_test` does NOT filter na=nd=0; only `beats_best_agr` + Branch 1 does. So a sole na=nd=0 row still ends up in the `:best` + fallback (Clojure parity — repness.clj:244-247). The `:best_agree` + slot stays empty (Branch 1 rejects). Output is [best], not []. + """ + rows = [ + _stats_row(1, na=0, nd=0, pa=0.5, pd_=0.5, pat=0.0, pdt=0.0, + ra=1.0, rd=1.0, rat=0.0, rdt=0.0), + ] + result = _assemble_rep_comments(pd.DataFrame(rows)) + assert len(result) == 1 + assert result[0]['comment_id'] == 1 + # NOT the best-agree slot (Branch 1 rejected na=nd=0). + assert 'best_agree' not in result[0] + + def test_sufficient_empty_best_agree_only(self): + """Sufficient empty + best_agree exists → returns [best_agree_finalized].""" + # passes_by_test fails (pat=pdt below z90, rat=rdt below z90). + # beats_best_agr triggers via Branch 4: z90(pat) is true (pat=1.5). + rows = [ + _stats_row(1, na=6, nd=4, pa=0.58, pd_=0.42, pat=1.5, pdt=-1.5, + ra=0.9, rd=1.1, rat=1.0, rdt=-1.0), + # Filler row to make this not trivially the only one — also fails + # passes_by_test and beats_best_agr. + _stats_row(2, na=3, nd=5, pa=0.4, pd_=0.6, pat=-0.5, pdt=0.5, + ra=0.8, rd=1.2, rat=-0.3, rdt=0.3), + ] + result = _assemble_rep_comments(pd.DataFrame(rows)) + assert len(result) == 1 + # New select_rep_comments_df returns (rep_df, best_agree_dict); the + # `_assemble_rep_comments` wrapper returns the flat List[Dict] + # (decision S2). + row = result[0] + assert row['comment_id'] == 1 + # best_agree flag emitted in Python-convention key naming (decision S1 / Q2). + assert row.get('best_agree') is True, "best_agree slot should be flagged" + assert row.get('n_agree') == 6, "n_agree should be na from the raw best-agree row" + + def test_take_5_cap_agrees_before_disagrees(self): + """7 sufficient candidates (4 agree-passing, 3 disagree-passing). + Sort by metric desc → take 5 → agrees-before-disagrees.""" + rows = [ + # 4 agree-passing, large to small agree_metric + _stats_row(1, na=9, nd=1, pa=0.83, pd_=0.17, pat=2.5, pdt=-2.5, + ra=2.0, rd=0.5, rat=2.5, rdt=-2.5), # agree_metric ~10.4 + _stats_row(2, na=8, nd=2, pa=0.75, pd_=0.25, pat=2.0, pdt=-2.0, + ra=1.8, rd=0.55, rat=2.0, rdt=-2.0), # ~5.4 + _stats_row(3, na=7, nd=3, pa=0.67, pd_=0.33, pat=1.5, pdt=-1.5, + ra=1.5, rd=0.6, rat=1.5, rdt=-1.5), # ~2.27 + _stats_row(4, na=6, nd=4, pa=0.58, pd_=0.42, pat=1.3, pdt=-1.3, + ra=1.3, rd=0.7, rat=1.3, rdt=-1.3), # ~1.27 + # 3 disagree-passing, large to small disagree_metric + _stats_row(5, na=1, nd=9, pa=0.17, pd_=0.83, pat=-2.5, pdt=2.5, + ra=0.5, rd=2.0, rat=-2.5, rdt=2.5), # ~10.4 + _stats_row(6, na=2, nd=8, pa=0.25, pd_=0.75, pat=-2.0, pdt=2.0, + ra=0.55, rd=1.8, rat=-2.0, rdt=2.0), # ~5.4 + _stats_row(7, na=3, nd=7, pa=0.33, pd_=0.67, pat=-1.5, pdt=1.5, + ra=0.6, rd=1.5, rat=-1.5, rdt=1.5), # ~2.27 + ] + result = _assemble_rep_comments(pd.DataFrame(rows)) + assert len(result) == 5 + # Agrees-before-disagrees: all agrees precede all disagrees in the output. + repful_values = [r['repful'] for r in result] # List[Dict] per S2 + last_agree_idx = -1 + first_disagree_idx = len(repful_values) + for i, v in enumerate(repful_values): + if v == 'agree': + last_agree_idx = i + elif v == 'disagree' and first_disagree_idx == len(repful_values): + first_disagree_idx = i + assert last_agree_idx < first_disagree_idx, \ + f"agrees must come before disagrees, got order: {repful_values}" + + def test_take_5_eviction_when_best_agree_outside_sufficient(self): + """The eviction edge case (flagged in PLAN for future review). + + Sufficient has 5 entries, best_agree is OUTSIDE sufficient (failed + passes_by_test). Prepending best_agree pushes total to 6, take(5) drops + the lowest-metric sufficient entry. + + Fixture design (subtle): + - tid 1: best_agree slot. Fails passes_by_test (rat=1.0, pat=1.0 + both below Z_90=1.2816). Branch 4 accepts via ra>1.0 AND pa>0.5. + Its agree_metric (ra*rat*pa*pat = 0.9) is LARGER than every + sufficient row's metric, so subsequent rows can't beat it via + Branch 2. + - tid 2-6: pass passes_by_test (rat,pat at 1.3 > Z_90), with + DECREASING agree_metrics all SMALLER than 0.9, so Branch 2 keeps + tid 1 as best_agree throughout. + + Expected: tid 1 prepended, sort gives [tid 5, 4, 3, 2, 6] (desc by + agree_metric), take(5) drops tid 6 (smallest metric). + + See PLAN.md "Pending — needs team discussion": take-5 eviction. + """ + rows = [ + # best_agree slot: fails passes_by_test, qualifies via Branch 4 + # (ra=1.5>1 AND pa=0.6>0.5). agree_metric = 1.5*1.0*0.6*1.0 = 0.9. + _stats_row(1, na=6, nd=4, pa=0.6, pd_=0.4, pat=1.0, pdt=-1.0, + ra=1.5, rd=0.7, rat=1.0, rdt=-1.0), + # 5 sufficient rows, each with agree_metric < 0.9. + # ra=1.0, rat=1.3, pa=0.5, pat=1.3 → agree_metric = 0.845 + _stats_row(2, na=5, nd=5, pa=0.5, pd_=0.5, pat=1.3, pdt=-1.3, + ra=1.0, rd=1.0, rat=1.3, rdt=-1.3), + # ra=0.9, rat=1.3, pa=0.4, pat=1.3 → agree_metric = 0.609 + _stats_row(3, na=4, nd=6, pa=0.4, pd_=0.6, pat=1.3, pdt=-1.3, + ra=0.9, rd=1.1, rat=1.3, rdt=-1.3), + # ra=0.8, rat=1.3, pa=0.3, pat=1.3 → agree_metric = 0.406 + _stats_row(4, na=3, nd=7, pa=0.3, pd_=0.7, pat=1.3, pdt=-1.3, + ra=0.8, rd=1.2, rat=1.3, rdt=-1.3), + # ra=0.7, rat=1.3, pa=0.2, pat=1.3 → agree_metric = 0.237 + _stats_row(5, na=2, nd=8, pa=0.2, pd_=0.8, pat=1.3, pdt=-1.3, + ra=0.7, rd=1.3, rat=1.3, rdt=-1.3), + # ra=0.6, rat=1.3, pa=0.15, pat=1.3 → agree_metric = 0.152 (smallest, evicted) + _stats_row(6, na=1, nd=9, pa=0.15, pd_=0.85, pat=1.3, pdt=-1.3, + ra=0.6, rd=1.4, rat=1.3, rdt=-1.3), + ] + result = _assemble_rep_comments(pd.DataFrame(rows)) + assert len(result) == 5 + tids = [r['comment_id'] for r in result] + # best_agree (tid 1) prepended at position 0. + assert tids[0] == 1, f"best-agree slot at position 0, got {tids[0]}" + # Tid 6 (smallest sufficient metric) evicted. + assert 6 not in tids, f"lowest-metric sufficient should be evicted, got {tids}" + # Rest are tids 2-5 in some agree-first ordering. + assert set(tids[1:]) == {2, 3, 4, 5}, f"expected tids 2-5 to remain, got {tids[1:]}" + # best_agree flag on position 0. + assert result[0].get('best_agree') is True + assert result[0].get('n_agree') == 6 # raw na from tid 1 + + +class TestD10TestGaps: + """Additional D10 coverage filling gaps identified in decisions D10.8. + + These pin behaviours not previously asserted: + - mod_out filtering on the best-agree path. + - Deterministic tiebreak (lowest tid wins) on `beats_best_by_test` + max(rat,rdt) ties and on the `_sort_key` agree_metric ties. + - Disagree-only path through assembly + agrees-before-disagrees no-op. + - All-uninformative `ns=0` rows: passes_by_test fails, Branch 1 rejects + best_agree; best may still get set via Branch 4 / beats_best_by_test. + - Negative-ra rows handled correctly by Branch 2 (signed 4-way product). + """ + + # --- Deliverable 3: deterministic max(rat, rdt) tiebreak ------------------ + + def test_tied_max_rt_uses_deterministic_tiebreak(self): + """Two rows with identical max(rat, rdt) — strict `>` means the FIRST + iterated row wins. Sorting by `comment` (tid) ascending makes that + the LOWER tid (Clojure named-matrix insertion order parity, decision + D10.8.1).""" + rows = [ + # Pass through `best` slot (neither passes passes_by_test — + # rat/rdt below Z_90), tied max(rat, rdt) = 1.0. + # Insert in REVERSE tid order to prove we sort, not just take input order. + _stats_row(7, na=4, nd=2, pa=0.55, pd_=0.45, pat=0.5, pdt=-0.5, + ra=1.0, rd=1.0, rat=1.0, rdt=-1.0), + _stats_row(3, na=4, nd=2, pa=0.55, pd_=0.45, pat=0.5, pdt=-0.5, + ra=1.0, rd=1.0, rat=1.0, rdt=-1.0), + ] + result = _assemble_rep_comments(pd.DataFrame(rows)) + assert len(result) == 1 + # Tid 3 (lower) wins the `best` slot under the tid-ascending tiebreak. + assert result[0]['comment_id'] == 3, \ + f"lowest-tid wins tied max(rat,rdt); got {result[0]['comment_id']}" + + # --- Deliverable 5: 5 gap tests ------------------------------------------- + + def test_mod_out_excludes_best_agree_candidate(self): + """A `mod_out` tid that would otherwise own the best-agree slot is + filtered before the reduce (Clojure repness.clj:222). The next-best + candidate becomes best_agree.""" + rows = [ + # tid 1: would-be best_agree (ra=2.0>1, pa=0.8>0.5 → Branch 4 accepts; + # strong ra*rat*pa*pat = 2.0*2.0*0.8*2.0 = 6.4 so it dominates Branch 2). + _stats_row(1, na=8, nd=2, pa=0.8, pd_=0.2, pat=2.0, pdt=-2.0, + ra=2.0, rd=0.5, rat=2.0, rdt=-2.0), + # tid 2: next-best (ra*rat*pa*pat = 1.5*1.5*0.7*1.5 ≈ 2.36). + _stats_row(2, na=6, nd=3, pa=0.7, pd_=0.3, pat=1.5, pdt=-1.5, + ra=1.5, rd=0.6, rat=1.5, rdt=-1.5), + # tid 3: weaker. + _stats_row(3, na=5, nd=4, pa=0.55, pd_=0.45, pat=1.0, pdt=-1.0, + ra=1.1, rd=0.9, rat=1.0, rdt=-1.0), + ] + df = pd.DataFrame(rows) + # Without mod_out: tid 1 wins best_agree. + baseline = _assemble_rep_comments(df) + baseline_best_agree = next(r for r in baseline if r.get('best_agree')) + assert baseline_best_agree['comment_id'] == 1 + # With tid 1 moderated out: tid 2 must win best_agree, tid 1 absent. + result = _assemble_rep_comments(df, mod_out=[1]) + tids = [r['comment_id'] for r in result] + assert 1 not in tids, f"mod_out tid 1 must be excluded, got {tids}" + flagged = [r for r in result if r.get('best_agree')] + assert len(flagged) == 1, "exactly one best_agree slot" + assert flagged[0]['comment_id'] == 2, \ + f"next-best (tid 2) should become best_agree, got {flagged[0]['comment_id']}" + + def test_tied_agree_metric_in_sort_uses_deterministic_tiebreak(self): + """Two `sufficient` rows with identical `agree_metric` resolve + deterministically. `list.sort` is stable in CPython, so the lower-tid + row (which entered `sufficient` first thanks to the tid-ascending + iter sort) appears first after descending sort by metric. + + Decision D10.8.1: lowest tid wins ties.""" + # Both rows pass passes_by_test (rat,pat at 2.0 > Z_90). + # Identical agree_metric: ra*rat*pa*pat is the SAME for both. + # ra=1.5, rat=2.0, pa=0.7, pat=2.0 → agree_metric = 4.2 (both). + # Insert in REVERSE tid order to prove the deterministic outcome + # comes from the sort, not the input order. + rows = [ + _stats_row(9, na=7, nd=3, pa=0.7, pd_=0.3, pat=2.0, pdt=-2.0, + ra=1.5, rd=0.6, rat=2.0, rdt=-2.0), + _stats_row(2, na=7, nd=3, pa=0.7, pd_=0.3, pat=2.0, pdt=-2.0, + ra=1.5, rd=0.6, rat=2.0, rdt=-2.0), + ] + result = _assemble_rep_comments(pd.DataFrame(rows)) + # 2 sufficient rows; one of them is also best_agree. + # Order: best_agree (tid 2, lowest tid wins beats_best_agr ties via + # strict-> first-row-wins) prepended, then deduped sufficient (tid 9). + tids = [r['comment_id'] for r in result] + assert tids[0] == 2, \ + f"lowest-tid wins tied beats_best_agr Branch 2 product; got {tids}" + assert result[0].get('best_agree') is True + + def test_disagree_only_group(self): + """All sufficient rows are `repful='disagree'`. Sort works on + `disagree_metric`; agrees-before-disagrees partition is a no-op.""" + rows = [ + # 3 disagree-passing rows (rdt,pdt > Z_90), descending disagree_metric. + _stats_row(1, na=1, nd=9, pa=0.17, pd_=0.83, pat=-2.5, pdt=2.5, + ra=0.5, rd=2.0, rat=-2.5, rdt=2.5), # disagree_metric ≈ 8.6 + _stats_row(2, na=2, nd=8, pa=0.25, pd_=0.75, pat=-2.0, pdt=2.0, + ra=0.55, rd=1.8, rat=-2.0, rdt=2.0), # ≈ 5.4 + _stats_row(3, na=3, nd=7, pa=0.33, pd_=0.67, pat=-1.5, pdt=1.5, + ra=0.6, rd=1.5, rat=-1.5, rdt=1.5), # ≈ 2.27 + ] + result = _assemble_rep_comments(pd.DataFrame(rows)) + # All rows have repful='disagree' (rdt > rat for each). + assert len(result) >= 1 + assert all(r['repful'] == 'disagree' for r in result), \ + f"all rows should be disagree, got {[r['repful'] for r in result]}" + assert len(result) <= 5, "take-5 cap holds" + # best_agree may also be present (Branch 4 doesn't require agree side + # to dominate — z90(pat) is false here, ra<1 for all, so Branch 4 + # rejects all candidates and best_agree stays None for all entries). + # No row should be flagged as best_agree given the fixture. + assert not any(r.get('best_agree') for r in result), \ + "no row qualifies for best_agree under Branch 4 with ra<1 and pat None=True on first row, best gets set, + so output is exactly [best].""" + # Build via _stats_row but override pat/pdt to match the n=0 collapse: + # prop_test_vectorized(0, 0) = 2*sqrt(1)*(1/1 - 0.5) = 1.0 < Z_90. + rows = [ + _stats_row(1, na=0, nd=0, pa=0.5, pd_=0.5, pat=1.0, pdt=1.0, + ra=1.0, rd=1.0, rat=0.5, rdt=0.5, ns=0), + _stats_row(2, na=0, nd=0, pa=0.5, pd_=0.5, pat=1.0, pdt=1.0, + ra=1.0, rd=1.0, rat=0.3, rdt=0.4, ns=0), + ] + result = _assemble_rep_comments(pd.DataFrame(rows)) + # passes_by_test fails (pat=1.0 1.0) compares the SIGNED 4-way product + `ra * rat * pa * pat`. A candidate with negative `ra` and negative + `rat` produces a positive product that can beat the current best, + while a candidate with single negative factor produces a negative + product that cannot.""" + # Set up so iteration order: tid 1 (current_best), tid 2 (negative + # single factor, should NOT beat), tid 3 (two negatives → positive, + # should beat ONLY if its product is larger). + current_best_row = _stats_row( + 1, na=8, nd=2, pa=0.8, pd_=0.2, pat=2.0, pdt=-2.0, + ra=2.0, rd=0.5, rat=2.0, rdt=-2.0) + # current_best product = 2.0*2.0*0.8*2.0 = 6.4. + + # Single negative factor → negative product → loses on strict >. + single_neg = _stats_row( + 2, na=1, nd=1, pa=0.5, pd_=0.5, pat=1.0, pdt=1.0, + ra=-0.5, rd=1.0, rat=1.0, rdt=1.0) + # product = -0.5*1.0*0.5*1.0 = -0.25 < 6.4 → does NOT beat. + assert not beats_best_agr(single_neg, current_best_row), \ + "single negative factor → negative product loses Branch 2" + + # Two negatives → positive product. Make it LARGER than 6.4. + # ra=-5.0, rat=-2.0, pa=0.9, pat=2.0 → -5 * -2 * 0.9 * 2 = 18.0 > 6.4. + two_neg = _stats_row( + 3, na=5, nd=5, pa=0.9, pd_=0.1, pat=2.0, pdt=-2.0, + ra=-5.0, rd=0.2, rat=-2.0, rdt=-2.0) + assert beats_best_agr(two_neg, current_best_row), \ + "two negative factors → positive 18.0 > 6.4 wins Branch 2" + + # Two negatives but product NOT larger → loses. + two_neg_small = _stats_row( + 4, na=1, nd=1, pa=0.5, pd_=0.5, pat=1.0, pdt=1.0, + ra=-1.0, rd=1.0, rat=-1.0, rdt=1.0) + # product = -1 * -1 * 0.5 * 1 = 0.5 < 6.4 → does NOT beat. + assert not beats_best_agr(two_neg_small, current_best_row), \ + "two negatives but small positive product (0.5) still loses to 6.4" + + # ============================================================================ # D11 — Consensus Comment Selection # ============================================================================