Fix five hybrid-sim per-interval sample-grid bugs#36
Open
PhotonicVelocity wants to merge 6 commits into
Open
Conversation
_anim_frame reschedules via `t + _dt`, so the scheduled time accumulates FP error and drifts below the k*dt grid by more than the interval-overlap dedup tol (1e-15) around tick 93 at fps=30. The dedup in _interval_hybrid then stops recognizing result.t[0] as a duplicate and every subsequent interval double-logs its start sample. Snap the next event to k*_dt by recomputing the tick index from current t: `(round(t / _dt) + 1) * _dt`. No accumulation, no drift.
The interval-overlap dedup at the top of _interval_hybrid unconditionally skips result.t[0] when it matches t0, which on the first call drops the t=0 IC sample for every continuous-state diagram — tlist[0] ends up at 1/fps, watch logs miss the IC, and the live window / movie writer start from the first integration step. Mirror the t=0 pre-pass the discrete-only path already does: evaluate the diagram at (t=0, y=x0) and call _record_sample_and_service_hooks so the IC lands in tlist, xlist, watch logs, and sinks. Also call display_manager.refresh() once so flush_events() and any movie writer's grab_frame() pick up the IC frame (the refresh hook only normally fires from _anim_frame, scheduled at interactive_dt, not 0).
Two coupled bugs fire on every interval whose t_eval grid doesn't end at t1 (every non-dt-aligned boundary): 1. `_interval_hybrid` returns `treached = result.t[-1]` — the last OUTPUT sample, not the integrator's reach. On natural completion the solver always reached t1, but result.t[-1] is the last t_eval point before t1. The main loop sees treached < t1, bumps t0 by event_tol, and re-enters integration on a tiny residual. The residual call builds an empty t_eval grid, falls through to a t_eval-less solve_ivp call, and dumps every natural integrator step into result.t — ~33 samples per boundary at max_step=1e-3, 100+ on non-trivial graphs with adaptive RK45. Use t1 for treached when result.status == 0; fall back to result.t[-1] only for event-terminated runs (status == 1). 2. With treached == t1 the spurious residual call is gone, but the sample loop only logs t_eval points — non-dt-aligned boundaries no longer land in tlist at all. Artists don't update at the boundary moment and watch logs miss the boundary times. Extend t_eval to include t1 when it isn't already on the dt grid.
GraphicsBlock.step called writer.grab_frame() on every sink tick, so the MP4 ended up with one frame per integration time point — at fps=30 with a non-trivial integrator that's ~200 frames per 5 s of sim, a 6.9 s video at 30 fps writer rate. Move the grab to DisplayManager.refresh(), which already runs at interactive_dt cadence from _anim_frame (the same hook that calls flush_events() for live windows). _start_movie attaches the writer to fig._bdsim_movie_writer; a new _grab_movie_frame() helper in display.py picks it up per figure during refresh. Also widen the _anim_frame reschedule guard from `next_t < tf - tol` to `next_t <= tf + tol` so a frame landing exactly on tf still fires.
…ie cadence - HybridSimSampleGridTest (test_run_sim.py) pins the four per-interval sample-grid behaviors: t=0 IC, no FP drift duplicates, anim_frame boundaries in tlist, no residual sample explosion. - New unit tests for `_grab_movie_frame` (test_display_manager.py): no-op when no writer, grab when attached, swallow AttributeError. - Drop `test_step_movie_no_writer_attribute_error` (pins removed per-tick `grab_frame()` path) and the grab_frame assertion in `test_step_timestamp_overlay_updates`.
Returning None from _build_t_eval_grid (when no k*dt multiple lands in [t0, t1]) left ivp_args.t_eval unset, so solve_ivp fell back to natural per-step output — the same residual-sample explosion the previous fix closed for natural completion, but reachable through event-terminated re-entries, cross-domain timing (clock period not a multiple of dt), tf near a scheduled boundary, and near-simultaneous events. Change the contract: always return a non-empty grid containing the endpoints t0 and t1 plus any k*dt multiples in between. The "extend t_eval to include t1" block at the call site collapses into the builder. Caller already guarantees dt > 0 (validated at run start) and t0 < t1 (while-loop guard + event_tol skip).
Not up to standards ⛔
|
Author
|
@petercorke If you'd prefer these be raised in issues rather than directly in PR I can document them there. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Branch:
fix/hybrid-sim-loopThis PR fixes five bugs in the per-interval simulation loop that compromise
out.t,out.x, watch logs, and any graphics output (live window or MP4 writer) for diagrams with continuous state. Each fix is a self-contained commit with its own context below.Four of the bugs were uncovered while debugging a downstream MP4 workflow where a 5 s simulation rendered as a 6.9 s video at 30 fps. The fifth was uncovered during this PR's review of the residual-interval path — same observable symptom as fix 3, reached via a different upstream trigger.
All five fixes ship with regression tests that pin the externally observable behavior on
out.t(see Verification).Summary of fixes
9548365out.tcontains duplicate adjacent timestamps andlen(out.t)exceedsT*fps + 1for any hybrid run where1/fpsis an integer multiple ofdt. The inflation begins ~3 s into the run atfps=30; effective sink-tick rate doubles past that point (e.g. 207 samples over 150 intervals forT=5)._anim_framerescheduled itself viat + _dt; floating-point error intaccumulated untilresult.t[0]drifted out ofisclose(atol=1e-15)witht0in the per-interval dedup, breaking the dedup and double-logging samples.1caebd1out.t[0]is1/fps, not0— the t=0 initial-condition sample is missing fromout.t,out.x, watch logs, the live animation window, and any movie writer output. Affects every diagram with continuous states (bd.nstates > 0)._interval_hybridwas correct for intervals 2+ but unconditionally skippedresult.t[0]even on the very first interval, when there was no previous interval to have logged the sample. The discrete-only path at the top ofBDSim.runalready had a pre-pass for this; the hybrid path didn't.24f8e1cdt-aligned scheduled boundaries (e.g.dt=0.1, fps=3), each such boundary triggered a residualsolve_ivpcall that emitted every natural integrator step intoresult.t— 100s of sink-ticks between two clean dt-grid samples. Visible as a dense per-max_stepcluster of artist updates before each boundary._interval_hybridreturnedtreached = result.t[-1](the last OUTPUT sample), not the integrator's actual reach — so natural completion looked like "early termination" whenevert_evaldidn't includet1; (b) the residual call built an emptyt_evalgrid, fell through to at_eval-lesssolve_ivpcall, and emitted every internal step.3b0ada1GraphicsBlock.stepcalledwriter.grab_frame()unconditionally on every sink tick. The live window worked becauseflush_events()(which actually presents the canvas to screen) only fires at the fps cadence from_anim_frame → display_manager.refresh(); the movie writer had no equivalent presenting hook.9b03bcbout.tcontains a densemax_step-spaced cluster inside any interval too small to contain ak*dtmultiple — event-terminated residuals, mismatched(dt, fps, clock_period)combinations,tfnear a scheduled boundary, or near-simultaneous events. Samet_eval=None → solve_ivp natural-step dumppath fix #3 closed for one trigger; reachable through four others (detailed below)._build_t_eval_gridreturnsNonefor empty intervals; the call site treatsNoneas "don't sett_eval", andsolve_ivpdumps every internal integrator step intoresult.t.Per-commit detail
9548365— fix(run_sim): snap_anim_frametok*dtgrid to prevent driftIssue. Two grids of "tick times" coexist during a hybrid sim and are supposed to coincide whenever
1/fpsis an integer multiple ofdt(so every anim_frame tick lands on ak*dtgrid point —dt = 1/fpsis just the N=1 case):t_evalgrid, built per interval by_build_t_eval_gridasdt * np.arange(k0, k1 + 1)— locked to world time (k * dt), rebuilt fresh each interval, one IEEE multiplication per point._anim_framegrid, built incrementally — the eventq callable reschedules itself viat + _dteach time it fires, so tickk's scheduled time is the result ofk-1sequential additions starting from the previous tick's fire time.For the first few ticks the two agree to the bit:
Sequential addition accumulates rounding error. The interval-overlap dedup at the top of
_interval_hybridcomparesresult.t[0](sourced from thek*dtgrid) againstt0(the animframe tick time) withnp.isclose(rtol=0.0, atol=1e-15). Once the anim_frame value drifts _belowk*dtby more than 1e-15 — which it does aroundt ≈ 3.1forfps=30—np.clipno longer pullsresult.t[0]down ontot0, the dedup check fails, and the interval-start sample gets double-logged.Every subsequent interval drifts slightly further from the
k*dtgrid, causing integration points to be doubled.Demo. Trivial integrator at
fps=30,dt=1/30, run out toT=3.5— past the ~t=3.1drift threshold. Without the fix,len(out.t)overshoots the expected 105 andout.tshows a doubled boundary entry for each interval after the threshold at 3.1s.Fix. Snap the next event time to the canonical
k * _dtgrid by recomputing the tick index from the currentt(round(t / _dt) + 1). No accumulation, no drift, and the dedup keeps working as designed. Demo script shows exactly 105 samples, cleanly spaced onk*dt.Test.
tests/test_run_sim.py::HybridSimSampleGridTest::test_no_drift_doubled_samples.1caebd1— fix(run_sim): capture t=0 IC sample in hybrid-diagram runsIssue. For diagrams with continuous states, the very first sample at t=0 is never logged. The interval-overlap dedup at the top of
_interval_hybridunconditionally skipsresult.t[0]when it matchest0, andt0 = 0on the first call. The discrete-only path at the top ofBDSim.runhas a pre-pass that handles this; the hybrid path doesn't. Net effect:simstate.tlist[0]ist = 1/fps, not 0t = 1/fpsstatet = 1/fpsDemo. Same trivial integrator rig — first sample should be at
t=0but isn't.Fix. Mirror the pre-existing t=0 evaluation that the discrete-only path already does so hybrid diagrams get the same initial-condition coverage: evaluate at
t = 0, y = x0(no integration done — we're reading the diagram's outputs from the IC), then call_record_sample_and_service_hooks(0, x0)so the IC lands intlist,xlist, watch logs, and sinks.After that, also call
display_manager.refresh()once so the IC state gets out of the canvas buffer:flush_events()presents it to the live window, andgrab_frame()writes it into any movie writer. (Without this, the canvas buffer is current at t=0 from the pre-pass'scanvas.draw(), but the refresh hook only normally fires from_anim_frame, scheduled atinteractive_dtnot 0.)With this fix, the interval-dedup invariant — "every interval's
result.t[0]was already logged by the previous step" — becomes genuinely true for all intervals, including the first.Test.
tests/test_run_sim.py::HybridSimSampleGridTest::test_out_t_starts_at_zero.24f8e1c— fix(run_sim): treat solve_ivp natural completion as reaching t1Demo.
dt=0.1, fps=3makes the anim_frame boundaries (1/3, 2/3) miss the dt grid.max_step=1e-3forces the residualsolve_ivpcall to emit ~33 internal steps per boundary. Expected: 13 samples (11 on the dt grid + 2 boundaries at 1/3, 2/3). Today: 112, with a dense per-max_stepcluster right before each anim_frame tick.Issue. Two coupled problems fire on every interval whose
t_evalgrid doesn't end att1— i.e. every anim_frame boundary that isn't on thek*dtgrid.Bug 1: undershoot of
treached._interval_hybridreturnstreached = result.t[-1]— the last OUTPUT sample time, not the integrator's actual reach. On natural completionsolve_ivpintegrates all the way tot1, butresult.t[-1]is the lastt_evalpoint beforet1. The main loop seestreached < t1, takes the "integration ended early" branch, bumpst0byevent_tol, and re-enters integration on a tiny residual interval[t0_bumped, t1]. The residual call builds an emptyt_evalgrid (nodtmultiples remain), falls through to at_eval-lesssolve_ivpcall, and emits every natural integrator step intoresult.t. Withmax_step=1e-3that is ~33 samples per boundary; with adaptive RK45 and a non-trivial graph it can be 100-200+ samples per boundary.Bug 2:
anim_frameboundaries missing fromtlist. Once Bug 1 is fixed (sotreached == t1), the spurious residual call is gone, but the sample loop only logst_evalpoints — non-dt-aligned anim_frame boundaries no longer land intlistat all. Downstream, artists don't update at the boundary moment (the MP4 grab at the refresh hook captures stale state from the most recent dt-grid sample), and watch logs miss the boundary times.Fix.
t1fortreachedwhenresult.status == 0(natural completion); fall back toresult.t[-1]only for event-terminated runs (status == 1). This kills the spurious residual call.t_evalto includet1when it isn't already on thedtgrid. The integrator's output naturally lands on every scheduled boundary, sinks fire there, artists update, and the refresh-hook grab captures the right state.Tests.
tests/test_run_sim.py::HybridSimSampleGridTest::test_no_residual_sample_explosion.tests/test_run_sim.py::HybridSimSampleGridTest::test_anim_frame_boundary_in_tlist.3b0ada1— fix(graphics): grab MP4 frames at fps cadence, not per sink tickDemo. Tiny ANIMATION block writing an MP4 at
fps=3withdt=0.1The MP4 should hold 4 frames (one per fps refresh + the IC at t=0). Today it holds 13 — one per sink tick. Requiresffmpeg(which bundlesffprobe).Issue.
GraphicsBlock.stepcallswriter.grab_frame()on every sink tick, so the MP4 ends up with one frame per integration time point instead of one per fps-paced display refresh. The live window works correctly because matplotlib has a buffer/present split:canvas.draw()rasterises into the figure's buffer (cheap, fires per sink-tick),flush_events()presents the buffer to screen (the visible part — fires only fromMatplotlibDisplayManager.refresh(), which runs at fps cadence from_anim_frame). The MP4 writer had no equivalent gating — everygrab_frame()writes a frame.Fix. The live-window refresh path is already fps-paced:
DisplayManager.refresh()runs from_anim_frameatinteractive_dtcadence, iterates figures, and callsflush_events()to present each canvas buffer. Moving the movie grab onto the same hook gives it the same cadence for free. Concretely:_start_movie, attach the writer tofig._bdsim_movie_writerso the display manager can find it without walkingbd.blocklist._grab_movie_frame(fig)helper indisplay.py, called fromrefresh()andrefresh_figure()of bothMatplotlibDisplayManagerandNotebookDisplayManager(right alongsideflush_events()).writer.grab_frame()call inGraphicsBlock.step. The per-tickcanvas.draw()(rasterise into the buffer) stays._anim_frame's reschedule guard widens fromnext_t < tf - toltonext_t <= tf + tolso a frame landing exactly ontfstill fires. ForT=1, fps=3, frames now land att = 0, 1/3, 2/3, 1(4 frames) instead of stopping att = 0, 1/3, 2/3(3 frames).Tests.
tests/test_display_manager.py::test_grab_movie_frame_no_writer_is_noop,::test_grab_movie_frame_with_writer_grabs_once,::test_grab_movie_frame_swallows_attribute_error. The MP4-frame-count behavior (the bug's externally observable signature) isn't directly tested because it requires ffmpeg as a runtime dependency; the Demo above stays as the manual verification.9b03bcb— fix(run_sim): always include endpoints in_build_t_eval_gridIssue.
_build_t_eval_gridreturnsNonewhenever nok*dtmultiple lands in[t0, t1]. The integrator call site treatsNoneas "don't sett_eval", andsolve_ivpfalls back to natural per-step output — the same residual-sample explosion fix 3 closed for the natural-completion path. Fix 3 only stopped one producer of the empty-grid case; the case itself is still reachable through:solve_ivpreturnsstatus == 1partway through an interval; the main loop bumpst0byevent_toland re-enters on[t0_bumped, t1]. If the event landed withindtoft1, nok*dtlies in the residual.(dt, fps, clock periods)that aren't integer multiples of one another produces tiny intervals between consecutive boundaries. e.g.dt=0.1, fps=3, clock_period=0.07: the gap between anim_frame att=1/3and the next clock tick att=0.35is0.017 < dt.tfclose to the last scheduled boundary.T=0.34, dt=0.1, fps=3leaves a final interval[1/3, 0.34]of width0.007 < dt(the Demo below).tandt + ε(forε > event_tol) leave a tiny gap between them.In any of those cases the same
out.texplosion fires.Demo. Same trivial integrator. With
dt=0.1, fps=3, anim_frame fires att=1/3and the next anim_frame at2/3would be pasttf, so the final interval is[1/3, tf]. The nextk*dtmultiple above1/3is0.4, so anytf ∈ (1/3, 0.4)produces an empty-grid final interval. Pickingtf=0.34andmax_step=1e-3gives an interval of width0.007that the integrator must subdivide into ~7 steps, all of which dump intoout.t.Expected: 6 samples (
0, 0.1, 0.2, 0.3, 1/3, tf). Today: 12, with a densemax_step-spaced cluster across the residual.Fix. Change the builder's contract: always return a non-empty grid containing at least the interval endpoints, plus any
k*dtmultiples in between. That subsumes fix 3 Bug 2 (the explicit "extend t_eval to include t1" block at the call site) and lets us delete theNonefallback entirely.Call site collapses to:
The fix 3 Bug 2 "extend t_eval to include t1" block at the call site is removed (now redundant).
Upstream validation that makes this safe:
dt > 0is enforced at run start (if dt is not None and float(dt) <= 0: raise ValueError).t0 < t1is enforced by the main loop'swhile t0 < tf - event_tol:plus theif t1 <= t0 + event_tol: continueguard, so we never call the builder with a degenerate interval.Behavior shifts to be aware of:
result.t[0]is now always exactlyt0. The interval-overlap dedup at the top of_interval_hybridalready handles this —np.isclose(result.t[0], t0, atol=1e-15)trips and skips the duplicate. The IC pre-pass from fix 2 covers the first-interval case.result.t[-1]is now always exactlyt1on natural completion. Thet_final = t1 if status == 0 else result.t[-1]branch from fix 3 becomes equivalent tot_final = result.t[-1]on that path — but the status check stays for the event-termination case (whereresult.t[-1]is the event time, nott1).Tests. Unit tests for
_build_t_eval_griddirectly (currently has zero coverage):t1 - t0 < dtand nok*dtin[t0, t1]→ grid is[t0, t1].t0 == k*dtandt1 == (k+n)*dt→ clean dt-aligned sequence with no duplicates.t0 = 0.123, t1 = 0.456, dt = 0.1→ grid is[0.123, 0.2, 0.3, 0.4, 0.456].Plus a behavioral test mirroring the Demo:
tests/test_run_sim.py::HybridSimSampleGridTest::test_no_empty_grid_sample_explosion.Verification
12b26c2and unrelated to this PR (all intests/test_display_manager.py::test_notebook_*, walked-back per commit and confirmed).tests/test_run_sim.py::HybridSimSampleGridTest— 5 behavioral tests, one per fix's externally observable signature onout.t. -tests/test_run_sim.py::BuildTEvalGridTest— 3 unit tests for the new_build_t_eval_gridcontract. -tests/test_display_manager.py::test_grab_movie_frame_*— 3 unit tests for the new helper.