Skip to content

Flexible ranks: per-field coherence read-outcome diagnostic (miss rate / why a bounds cylinder reports infrequently) #742

Description

@DLWoodruff

Summary

Add an always-on, per-field coherence read-outcome diagnostic to the
unequal-rank multi-source reader, so that when a bounds cylinder appears to
report infrequently we can tell why — a coherence problem (reads being
rejected) vs. a slow upstream sender (no new data) — and quantify how often
multi-source reads actually straddle a publish.

Follow-on to the flexible (unequal) rank assignments work and its per-field
coherence policy (see doc/designs/flexible_rank_assignments.md §Coherence;
strict-coherence decision for BEST_XHAT/RECENT_XHATS in #741, DUALS in
the merged phase-3a work).

Motivation

Under unequal ranks, a cylinder assembles each per-scenario field from several
peer ranks via the overlap map. The per-field coherence policy decides what
happens when those sources disagree on write_id:

  • strict (DUALS, BEST_XHAT, RECENT_XHATS) → the read is rejected and
    retried
    ;
  • relaxed (NONANTS_VALS, XFEAS, ...) → the read is accepted but
    blended
    across iterations.

A bounds cylinder computes its bound from a field it reads (the Lagrangian
spoke reads DUALS; FWPH reads BEST_XHAT/RECENT_XHATS). If those reads keep
getting rejected, the cylinder rarely gets fresh input and so reports a new
bound infrequently. Today that symptom is indistinguishable from "the upstream
sender is just slow." This diagnostic separates the two.

It also empirically answers the open design question behind the strict-coherence
choice: how often does a multi-source read actually straddle a publish? If a
strict field's rejection rate is negligible, strict coherence is effectively
free; if it climbs (e.g. under an asynchronous APH sender), we learn it here
rather than in the field.

What to measure

Per field, per reader cylinder (counters live on each SPCommunicator,
so attribution to a specific spoke+field is automatic), accumulate over
multi-source reads (reads with >= 2 sources; single-source reads can't miss):

  • total — multi-source reads
  • not_new — coherent, but write_id did not advance (sender hasn't published)
  • new_accepted — advanced + coherent → used
  • rejected_incoherent — sources disagreed → strict reject (strict fields only)
  • accepted_mixed — sources disagreed but accepted (relaxed fields only)

Derived: coherence miss rate = (rejected_incoherent + accepted_mixed) / total. Diagnosis: rejected_incoherent dominating ⇒ coherence; not_new
dominating ⇒ slow sender.

(Optionally also distinguish, for strict reads, a rejection caused by this
rank's
sources disagreeing vs. the cross-reader _write_ids_agree collective
rejecting — the former is the fundamental coherence miss.)

Where to hook

The single choke point is reduce_source_write_ids(source_ids, strict) and its
call site in SPCommunicator._flex_get_multi_source — it already has
source_ids and computes agreement. Keep reduce_source_write_ids pure; do the
counting at the call site (where self holds the counters). Cost is ~two integer
increments per multi-source read, so counting is always-on.

This instruments the consumer's input reads, not the bound scalars: the
bound fields (OBJECTIVE_INNER_BOUND / OBJECTIVE_OUTER_BOUND) are single-source
and never hit the coherence path.

Reporting

  • Accumulate always (negligible overhead), so a misbehaving run can be
    inspected without having pre-armed it.
  • Finalize summary: concise per-field breakdown, rank-0-gated, printed only
    for fields that did multi-source reads. Aggregate across the reader cylinder's
    ranks with one MPI reduction at report time (not per read).
  • Opt-in periodic line (flag, e.g. every N iterations) for live debugging.
  • Expose the counters as an attribute for programmatic/test access.
  • Inert at equal ranks (no multi-source reads) — a pure flex-path diagnostic.

Suggested placement

A small follow-up branch on top of the flex stack, paired with the APH
verification phase (currently DLWoodruff#17 / phase 5): the async
sender is the one place misses actually occur, so its integration test is the
natural test bed, and the counter doubles as that phase's verification.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions