Summary
Add an always-on, per-field coherence read-outcome diagnostic to the
unequal-rank multi-source reader, so that when a bounds cylinder appears to
report infrequently we can tell why — a coherence problem (reads being
rejected) vs. a slow upstream sender (no new data) — and quantify how often
multi-source reads actually straddle a publish.
Follow-on to the flexible (unequal) rank assignments work and its per-field
coherence policy (see doc/designs/flexible_rank_assignments.md §Coherence;
strict-coherence decision for BEST_XHAT/RECENT_XHATS in #741, DUALS in
the merged phase-3a work).
Motivation
Under unequal ranks, a cylinder assembles each per-scenario field from several
peer ranks via the overlap map. The per-field coherence policy decides what
happens when those sources disagree on write_id:
- strict (
DUALS, BEST_XHAT, RECENT_XHATS) → the read is rejected and
retried;
- relaxed (
NONANTS_VALS, XFEAS, ...) → the read is accepted but
blended across iterations.
A bounds cylinder computes its bound from a field it reads (the Lagrangian
spoke reads DUALS; FWPH reads BEST_XHAT/RECENT_XHATS). If those reads keep
getting rejected, the cylinder rarely gets fresh input and so reports a new
bound infrequently. Today that symptom is indistinguishable from "the upstream
sender is just slow." This diagnostic separates the two.
It also empirically answers the open design question behind the strict-coherence
choice: how often does a multi-source read actually straddle a publish? If a
strict field's rejection rate is negligible, strict coherence is effectively
free; if it climbs (e.g. under an asynchronous APH sender), we learn it here
rather than in the field.
What to measure
Per field, per reader cylinder (counters live on each SPCommunicator,
so attribution to a specific spoke+field is automatic), accumulate over
multi-source reads (reads with >= 2 sources; single-source reads can't miss):
total — multi-source reads
not_new — coherent, but write_id did not advance (sender hasn't published)
new_accepted — advanced + coherent → used
rejected_incoherent — sources disagreed → strict reject (strict fields only)
accepted_mixed — sources disagreed but accepted (relaxed fields only)
Derived: coherence miss rate = (rejected_incoherent + accepted_mixed) / total. Diagnosis: rejected_incoherent dominating ⇒ coherence; not_new
dominating ⇒ slow sender.
(Optionally also distinguish, for strict reads, a rejection caused by this
rank's sources disagreeing vs. the cross-reader _write_ids_agree collective
rejecting — the former is the fundamental coherence miss.)
Where to hook
The single choke point is reduce_source_write_ids(source_ids, strict) and its
call site in SPCommunicator._flex_get_multi_source — it already has
source_ids and computes agreement. Keep reduce_source_write_ids pure; do the
counting at the call site (where self holds the counters). Cost is ~two integer
increments per multi-source read, so counting is always-on.
This instruments the consumer's input reads, not the bound scalars: the
bound fields (OBJECTIVE_INNER_BOUND / OBJECTIVE_OUTER_BOUND) are single-source
and never hit the coherence path.
Reporting
- Accumulate always (negligible overhead), so a misbehaving run can be
inspected without having pre-armed it.
- Finalize summary: concise per-field breakdown, rank-0-gated, printed only
for fields that did multi-source reads. Aggregate across the reader cylinder's
ranks with one MPI reduction at report time (not per read).
- Opt-in periodic line (flag, e.g. every N iterations) for live debugging.
- Expose the counters as an attribute for programmatic/test access.
- Inert at equal ranks (no multi-source reads) — a pure flex-path diagnostic.
Suggested placement
A small follow-up branch on top of the flex stack, paired with the APH
verification phase (currently DLWoodruff#17 / phase 5): the async
sender is the one place misses actually occur, so its integration test is the
natural test bed, and the counter doubles as that phase's verification.
Related
Summary
Add an always-on, per-field coherence read-outcome diagnostic to the
unequal-rank multi-source reader, so that when a bounds cylinder appears to
report infrequently we can tell why — a coherence problem (reads being
rejected) vs. a slow upstream sender (no new data) — and quantify how often
multi-source reads actually straddle a publish.
Follow-on to the flexible (unequal) rank assignments work and its per-field
coherence policy (see
doc/designs/flexible_rank_assignments.md§Coherence;strict-coherence decision for
BEST_XHAT/RECENT_XHATSin #741,DUALSinthe merged phase-3a work).
Motivation
Under unequal ranks, a cylinder assembles each per-scenario field from several
peer ranks via the overlap map. The per-field coherence policy decides what
happens when those sources disagree on
write_id:DUALS,BEST_XHAT,RECENT_XHATS) → the read is rejected andretried;
NONANTS_VALS,XFEAS, ...) → the read is accepted butblended across iterations.
A bounds cylinder computes its bound from a field it reads (the Lagrangian
spoke reads
DUALS; FWPH readsBEST_XHAT/RECENT_XHATS). If those reads keepgetting rejected, the cylinder rarely gets fresh input and so reports a new
bound infrequently. Today that symptom is indistinguishable from "the upstream
sender is just slow." This diagnostic separates the two.
It also empirically answers the open design question behind the strict-coherence
choice: how often does a multi-source read actually straddle a publish? If a
strict field's rejection rate is negligible, strict coherence is effectively
free; if it climbs (e.g. under an asynchronous APH sender), we learn it here
rather than in the field.
What to measure
Per field, per reader cylinder (counters live on each
SPCommunicator,so attribution to a specific spoke+field is automatic), accumulate over
multi-source reads (reads with >= 2 sources; single-source reads can't miss):
total— multi-source readsnot_new— coherent, butwrite_iddid not advance (sender hasn't published)new_accepted— advanced + coherent → usedrejected_incoherent— sources disagreed → strict reject (strict fields only)accepted_mixed— sources disagreed but accepted (relaxed fields only)Derived: coherence miss rate =
(rejected_incoherent + accepted_mixed) / total. Diagnosis:rejected_incoherentdominating ⇒ coherence;not_newdominating ⇒ slow sender.
(Optionally also distinguish, for strict reads, a rejection caused by this
rank's sources disagreeing vs. the cross-reader
_write_ids_agreecollectiverejecting — the former is the fundamental coherence miss.)
Where to hook
The single choke point is
reduce_source_write_ids(source_ids, strict)and itscall site in
SPCommunicator._flex_get_multi_source— it already hassource_idsand computes agreement. Keepreduce_source_write_idspure; do thecounting at the call site (where
selfholds the counters). Cost is ~two integerincrements per multi-source read, so counting is always-on.
This instruments the consumer's input reads, not the bound scalars: the
bound fields (
OBJECTIVE_INNER_BOUND/OBJECTIVE_OUTER_BOUND) are single-sourceand never hit the coherence path.
Reporting
inspected without having pre-armed it.
for fields that did multi-source reads. Aggregate across the reader cylinder's
ranks with one MPI reduction at report time (not per read).
Suggested placement
A small follow-up branch on top of the flex stack, paired with the APH
verification phase (currently DLWoodruff#17 / phase 5): the async
sender is the one place misses actually occur, so its integration test is the
natural test bed, and the counter doubles as that phase's verification.
Related
doc/designs/flexible_rank_assignments.md§Coherence--ph-xfeas-spoke-rank-ratio: Flexible ranks: add CG-hub XFEAS integration test + --ph-xfeas-spoke-rank-ratio (blocked on #729) #730 (closed by flex-ranks phase 4a: two-stage xhat fields (BEST_XHAT/RECENT_XHATS/XFEAS) #741)