[Review only — do not merge] RankSolrPropagator hardening (already on develop)#299
Open
mdorf wants to merge 4 commits into
Open
[Review only — do not merge] RankSolrPropagator hardening (already on develop)#299mdorf wants to merge 4 commits into
mdorf wants to merge 4 commits into
Conversation
…anged Large ontologies previously produced no output until fully done and piled up uncommitted Solr updates, so the job looked hung and slowed down over time. This adds: - per-ontology and per-batch progress logging (flushed), with an up-front doc count, so activity is visible during big ontologies - commitWithin (60s) instead of commit:false, bounding Solr's transaction log and making partial runs durable; final hard commit is retained - batch size 1000 -> 5000 to cut round-trips - skip-unchanged: ontologies whose rank is unchanged since the last propagation (tracked per acronym in Redis) are skipped; force: true re-propagates everything (e.g. after a collection rebuild) Refs ncbo/ncbo_cron#132
The weekly run stalled on staging with HTTP 500 'distributed update stalled' errors. Cause: commitWithin issued soft commits *during* the update stream, pausing replicas while the leader's forwarding queue backed up. Fixes: - drop commitWithin; send batches with commit:false and issue one commit *between* ontologies, when no updates are in flight - retry transient Solr errors (stalls, timeouts) with exponential backoff instead of aborting the run; combined with the skip cache an unrecoverable failure resumes where it left off - batch size default 2500, overridable via RANK_SOLR_BATCH_SIZE for staging tuning without a deploy Refs ncbo/ncbo_cron#132
… summary Retries were logged at INFO, buried among progress lines. Now each retry logs at WARN with a greppable BACKPRESSURE marker (ERROR when retries are exhausted), and the final summary reports 'Solr retries: N' — 0 means the retry/backoff path never ran (clean environment), and a non-zero count is flagged at WARN with a hint to lower RANK_SOLR_BATCH_SIZE. Adds tests that force a transient stall (asserting recovery, the warning, and the count) and that a clean run reports 0. Refs ncbo/ncbo_cron#132
…rPropagator A momentary ConnectionRefused (Solr stayed up; transient connection hiccup) aborted a live run because with_retry only caught RSolr::Error::Http and a few Errno types. In rsolr's hierarchy ConnectionRefused < Errno::ECONNREFUSED, a separate family from Http, so it was never retried. Broaden the rescue list to cover the connection/timeout families (ConnectionRefused, ECONNREFUSED, ETIMEDOUT, Net::*Timeout, SocketError, ...) so transient blips are waited out and surface as BACKPRESSURE/retry-count instead of killing the run. Test now raises a real RSolr::Error::ConnectionRefused. Refs ncbo/ncbo_cron#132
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## rank-propagator-hardening-base #299 +/- ##
==================================================================
+ Coverage 81.03% 81.06% +0.02%
==================================================================
Files 101 101
Lines 6840 6902 +62
==================================================================
+ Hits 5543 5595 +52
- Misses 1297 1307 +10
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose — review only, do not merge
This is a retrospective review PR. These four commits were pushed straight to
developduring live staging iteration on the rank propagator (issue ncbo/ncbo_cron#132), after the original PR #298 had already merged. They are already ondevelop; this PR exists only to give the team a single, reviewable diff of the hardening work and a place to comment. The base branch (rank-propagator-hardening-base) is pinned at the commit just before the series, so the diff is exactly the propagator and its test. Do not merge — merging changes nothing ondevelopand only churns throwaway branches. Close after review; the tworank-propagator-hardening*branches can then be deleted.What these commits do
All four touch only
lib/ontologies_linked_data/services/rank_solr_propagator.rband its test:6bfd79d8— progress logging (per-ontology counts + every-50k progress, flushed),commitWithin, and skip-unchanged (per-acronym last-propagated rank tracked in Redis so steady-state weekly runs skip stable ontologies).c3a52dfd— fix SolrCloud "distributed update stalled" 500s: stop committing mid-stream (commitWithinpaused replicas and backed up the forwarding queue); send batches withcommit: falseand commit once between ontologies; add retry-with-backoff so a transient stall is survived; batch size tunable viaRANK_SOLR_BATCH_SIZE.c36ef753— make backpressure observable: retries log at WARN with a greppableBACKPRESSUREmarker (ERROR when exhausted), and the run summary reportsSolr retries: N(0 = the retry path never ran).e559fe9c— broaden retry coverage to the connection/timeout families (RSolr::Error::ConnectionRefused,Errno::ECONNREFUSED/ECONNRESET/ETIMEDOUT,Net::*Timeout,SocketError); a momentaryConnectionRefusedhad slipped past the originalRSolr::Error::Http-only list and aborted a run.Context