Skip to content

[Review only — do not merge] RankSolrPropagator hardening (already on develop)#299

Open
mdorf wants to merge 4 commits into
rank-propagator-hardening-basefrom
rank-propagator-hardening
Open

[Review only — do not merge] RankSolrPropagator hardening (already on develop)#299
mdorf wants to merge 4 commits into
rank-propagator-hardening-basefrom
rank-propagator-hardening

Conversation

@mdorf

@mdorf mdorf commented Jun 23, 2026

Copy link
Copy Markdown
Member

Purpose — review only, do not merge

This is a retrospective review PR. These four commits were pushed straight to develop during live staging iteration on the rank propagator (issue ncbo/ncbo_cron#132), after the original PR #298 had already merged. They are already on develop; this PR exists only to give the team a single, reviewable diff of the hardening work and a place to comment. The base branch (rank-propagator-hardening-base) is pinned at the commit just before the series, so the diff is exactly the propagator and its test. Do not merge — merging changes nothing on develop and only churns throwaway branches. Close after review; the two rank-propagator-hardening* branches can then be deleted.

What these commits do

All four touch only lib/ontologies_linked_data/services/rank_solr_propagator.rb and its test:

  • 6bfd79d8 — progress logging (per-ontology counts + every-50k progress, flushed), commitWithin, and skip-unchanged (per-acronym last-propagated rank tracked in Redis so steady-state weekly runs skip stable ontologies).
  • c3a52dfd — fix SolrCloud "distributed update stalled" 500s: stop committing mid-stream (commitWithin paused replicas and backed up the forwarding queue); send batches with commit: false and commit once between ontologies; add retry-with-backoff so a transient stall is survived; batch size tunable via RANK_SOLR_BATCH_SIZE.
  • c36ef753 — make backpressure observable: retries log at WARN with a greppable BACKPRESSURE marker (ERROR when exhausted), and the run summary reports Solr retries: N (0 = the retry path never ran).
  • e559fe9c — broaden retry coverage to the connection/timeout families (RSolr::Error::ConnectionRefused, Errno::ECONNREFUSED/ECONNRESET/ETIMEDOUT, Net::*Timeout, SocketError); a momentary ConnectionRefused had slipped past the original RSolr::Error::Http-only list and aborted a run.

Context

mdorf added 4 commits June 22, 2026 20:40
…anged

Large ontologies previously produced no output until fully done and
piled up uncommitted Solr updates, so the job looked hung and slowed
down over time. This adds:

- per-ontology and per-batch progress logging (flushed), with an
  up-front doc count, so activity is visible during big ontologies
- commitWithin (60s) instead of commit:false, bounding Solr's
  transaction log and making partial runs durable; final hard commit
  is retained
- batch size 1000 -> 5000 to cut round-trips
- skip-unchanged: ontologies whose rank is unchanged since the last
  propagation (tracked per acronym in Redis) are skipped; force: true
  re-propagates everything (e.g. after a collection rebuild)

Refs ncbo/ncbo_cron#132
The weekly run stalled on staging with HTTP 500 'distributed update
stalled' errors. Cause: commitWithin issued soft commits *during* the
update stream, pausing replicas while the leader's forwarding queue
backed up. Fixes:

- drop commitWithin; send batches with commit:false and issue one
  commit *between* ontologies, when no updates are in flight
- retry transient Solr errors (stalls, timeouts) with exponential
  backoff instead of aborting the run; combined with the skip cache an
  unrecoverable failure resumes where it left off
- batch size default 2500, overridable via RANK_SOLR_BATCH_SIZE for
  staging tuning without a deploy

Refs ncbo/ncbo_cron#132
… summary

Retries were logged at INFO, buried among progress lines. Now each
retry logs at WARN with a greppable BACKPRESSURE marker (ERROR when
retries are exhausted), and the final summary reports 'Solr retries: N'
— 0 means the retry/backoff path never ran (clean environment), and a
non-zero count is flagged at WARN with a hint to lower
RANK_SOLR_BATCH_SIZE. Adds tests that force a transient stall (asserting
recovery, the warning, and the count) and that a clean run reports 0.

Refs ncbo/ncbo_cron#132
…rPropagator

A momentary ConnectionRefused (Solr stayed up; transient connection
hiccup) aborted a live run because with_retry only caught
RSolr::Error::Http and a few Errno types. In rsolr's hierarchy
ConnectionRefused < Errno::ECONNREFUSED, a separate family from Http,
so it was never retried. Broaden the rescue list to cover the
connection/timeout families (ConnectionRefused, ECONNREFUSED,
ETIMEDOUT, Net::*Timeout, SocketError, ...) so transient blips are
waited out and surface as BACKPRESSURE/retry-count instead of killing
the run. Test now raises a real RSolr::Error::ConnectionRefused.

Refs ncbo/ncbo_cron#132
@codecov

codecov Bot commented Jun 23, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 86.30137% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.06%. Comparing base (e1349d0) to head (e559fe9).

Files with missing lines Patch % Lines
...ogies_linked_data/services/rank_solr_propagator.rb 86.30% 10 Missing ⚠️
Additional details and impacted files
@@                        Coverage Diff                         @@
##           rank-propagator-hardening-base     #299      +/-   ##
==================================================================
+ Coverage                           81.03%   81.06%   +0.02%     
==================================================================
  Files                                 101      101              
  Lines                                6840     6902      +62     
==================================================================
+ Hits                                 5543     5595      +52     
- Misses                               1297     1307      +10     
Flag Coverage Δ
ag 80.86% <86.30%> (-0.09%) ⬇️
fs 80.91% <86.30%> (-0.10%) ⬇️
gd 80.87% <86.30%> (+0.02%) ⬆️
unittests 81.06% <86.30%> (+0.02%) ⬆️
vo 80.99% <86.30%> (+0.05%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant