fix(nle): resume from a deep-floor checkpoint no longer SIGSEGVs#4
Open
liujonathan24 wants to merge 15 commits into
Open
fix(nle): resume from a deep-floor checkpoint no longer SIGSEGVs#4liujonathan24 wants to merge 15 commits into
liujonathan24 wants to merge 15 commits into
Conversation
…tion Previously the reveal_map / fog_of_war knobs called nle_reveal_level() at the end of vision_recalc(), which wrote the real terrain of every cell into the hero's remembered map (gbuf) via magic_map_background()+newsym(). That mutation was permanent: turning the knob back off could not un-reveal already-remembered terrain. Reimplement reveal as a render-time overlay on the emitted observation only. NetHackRL::fill_obs now, when reveal_map>0 or fog_of_war==0, fills any cell that is still unknown (nul_glyph) in the obs from back_to_glyph(x,y), mapped to char/color/special exactly as rl_print_glyph does (mapglyph + CLR_BLACK fixup + shuffled_glyph). It writes only into the emitted obs arrays -- never gbuf, levl[][].seenv, or newsym -- so the effect is fully reversible and leaves game state untouched. The default path (reveal_map==0, fog_of_war==1) skips the loop entirely, so vanilla behavior and golden parity are byte-identical. Remove the now-unused nle_reveal_level() from vision_recalc(), detect.c, and its declaration in nle.h.
… blob save/load Guards: reject ftell<=0/empty file in save; NULL/empty blob (rc=4) in load.
…or_connectivity/room_size
…downstair/first accessible)
fill_obs's reveal_map overlay filled unseen walls as blank stone because back_to_glyph() returns S_stone for any wall with seenv==0. Temporarily poke seenv=SVALL for wall/SDOOR cells before back_to_glyph(), then restore it (fully reversible; levl[][] is unchanged after the call), so walls now render their proper face via wall_angle(). Also overlay every live monster on the level (fmon list, skipping DEADMONSTER) using the normal mon_to_glyph()/display-RNG path, so reveal_map shows monsters anywhere on the floor, refreshed each step.
reveal_map fully subsumes the obsolete fog_of_war knob (the fill_obs overlay is now guarded solely by reveal_map>0). Drop the X(fog_of_war) entry from the tune catalog and update the stale vision.c comment. The catalog count drops by one.
Adds nle_seat_on_stair(nle, down): with down!=0 moves the hero onto the down staircase (xdnstair/ydnstair), else onto the up staircase, via u_on_newpos(). Returns nonzero if the requested stair is absent. Two-phase like goto_depth (caller steps once to re-render); sets context.botl. Lets the Map Viewer goto_depth(n) then seat the hero on the '>'.
Adds nle_level_up(nle, n): raises the hero n experience levels (capped at 30) through pluslvl(FALSE), so each level grants the normal HP / spell-power / intrinsic gains, then bumps u.uexp to the new level threshold so the next newexplevel() won't undo it (mirrors the xp_level setter). pluslvl emits messages; clearing iflags.window_inited around the calls routes them through raw_print to avoid a coroutine yield from this bare entry point. Caller steps once to refresh blstats.
…ond radius, even in lit rooms)
…isc (flashlight; no see-through/memory beyond)
…rsist as memory (keep only the vision_recalc sight-limit)
…ization (pairs with save_level)
…e SIGSEGV)
Resuming a checkpoint taken on a deep dungeon level crashed the process with a
SIGSEGV in jump_fcontext. nle_load_level stamps the saved level blob over the
CURRENT ledger slot, which after a resume()'s reset is level 1; getlev()'s
sanity check (lev && dlvl != lev) then saw e.g. 'level 5, not 1' and called
trickery() -> pline('Strange, this map is not as I remember it.') -> done(TRICKED).
pline routes through the rl window port, which yields the game coroutine
(jump_fcontext) from nle_load_level's MAIN context -> jump to a dead fcontext ->
SIGSEGV (the exact hazard the surrounding comment warns about).
A standalone load DELIBERATELY installs an arbitrary level into the current slot,
so the level-number check is wrong here; pass pid=0/lev=0 to disable both the
level and PID checks (pid/lev are used nowhere else in getlev). pid=0 also makes
cross-process resume safe (a checkpoint from another server run has a different
hackpid).
4 tasks
Restoring an in-memory snapshot taken at a nested window prompt (e.g. the live Monte-Carlo checkpoint demo, or a continual-harness rollout) could SIGSEGV in rl_clear_nhwindow or abort with a double-free in rl_yn_function. Root cause: the rl window port tracks its win-proc call stack in a libc-backed deque (nle_ctx_t->s_win_proc_calls), pushed/popped by a ScopedStack per window call. A nested prompt (yn_function -> nhgetch) yields with the deque two deep. nle_fr_restore swaps the coroutine stack back to the snapshot, but the deque lives outside the snapshot and keeps its shallower live depth. The restored stack's still-pending ScopedStack destructors then pop_back() past empty — UB that walks std::deque's _M_finish into garbage, so the next push_back writes to a near-null slot (_M_cur=0x1f0 in the captured core) and crashes. Fix: - ScopedStack::~ScopedStack guards pop-on-empty (the deque is a pure diagnostic call stack, never read, so skipping an unmatched pop is harmless). - nle_fr_restore clears the deque (new nle_rl_winproc_reset shim) so it cannot drift/grow across repeated restores. Verified: a nested-prompt snapshot + repeated restore + ctrl-R repro aborts on the old engine within ~80 cycles and survives 100+ cycles after the fix; the harness regression test test_snapshot_restore_nested_prompt_winproc covers it.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Resuming a saved game from a deep-floor checkpoint crashed
libnethack.sowith a SIGSEGV.Root cause.
nle_load_levelcalledgetlevwith the originalhackpid/level, sogetlev's trickery sanity check((pid && pid != hpid) || (lev && dlvl != lev))fired and emitted apline("trickery..."). That window call runs from the main context and yields into a now-dead fcontext (jump_fcontext→ garbage), segfaulting.Fix. Call
getlev(fd, 0, 0, FALSE)so the trickery check is skipped on programmatic level loads (pid0and lev0both disable the respective comparison). The level blob being loaded is already the one we wrote, so the sanity check guards against nothing here.Verified by resuming a
goto_depth=5checkpoint repeatedly — no crash. A regression test lives in the harness (environments/nethack/tests/test_checkpoint.py::test_checkpoint_resume_deep_floor).