Skip to content

fix(nle): resume from a deep-floor checkpoint no longer SIGSEGVs#4

Open
liujonathan24 wants to merge 15 commits into
mainfrom
fix-resume-trickery-sigsegv
Open

fix(nle): resume from a deep-floor checkpoint no longer SIGSEGVs#4
liujonathan24 wants to merge 15 commits into
mainfrom
fix-resume-trickery-sigsegv

Conversation

@liujonathan24

Copy link
Copy Markdown
Owner

Resuming a saved game from a deep-floor checkpoint crashed libnethack.so with a SIGSEGV.

Root cause. nle_load_level called getlev with the original hackpid/level, so getlev's trickery sanity check ((pid && pid != hpid) || (lev && dlvl != lev)) fired and emitted a pline ("trickery..."). That window call runs from the main context and yields into a now-dead fcontext (jump_fcontext → garbage), segfaulting.

Fix. Call getlev(fd, 0, 0, FALSE) so the trickery check is skipped on programmatic level loads (pid 0 and lev 0 both disable the respective comparison). The level blob being loaded is already the one we wrote, so the sanity check guards against nothing here.

Verified by resuming a goto_depth=5 checkpoint repeatedly — no crash. A regression test lives in the harness (environments/nethack/tests/test_checkpoint.py::test_checkpoint_resume_deep_floor).

liujonathan24 and others added 14 commits June 11, 2026 03:29
…tion

Previously the reveal_map / fog_of_war knobs called nle_reveal_level() at the
end of vision_recalc(), which wrote the real terrain of every cell into the
hero's remembered map (gbuf) via magic_map_background()+newsym(). That mutation
was permanent: turning the knob back off could not un-reveal already-remembered
terrain.

Reimplement reveal as a render-time overlay on the emitted observation only.
NetHackRL::fill_obs now, when reveal_map>0 or fog_of_war==0, fills any cell that
is still unknown (nul_glyph) in the obs from back_to_glyph(x,y), mapped to
char/color/special exactly as rl_print_glyph does (mapglyph + CLR_BLACK fixup +
shuffled_glyph). It writes only into the emitted obs arrays -- never gbuf,
levl[][].seenv, or newsym -- so the effect is fully reversible and leaves game
state untouched. The default path (reveal_map==0, fog_of_war==1) skips the loop
entirely, so vanilla behavior and golden parity are byte-identical.

Remove the now-unused nle_reveal_level() from vision_recalc(), detect.c, and its
declaration in nle.h.
… blob save/load

Guards: reject ftell<=0/empty file in save; NULL/empty blob (rc=4) in load.
fill_obs's reveal_map overlay filled unseen walls as blank stone because
back_to_glyph() returns S_stone for any wall with seenv==0. Temporarily
poke seenv=SVALL for wall/SDOOR cells before back_to_glyph(), then restore
it (fully reversible; levl[][] is unchanged after the call), so walls now
render their proper face via wall_angle().

Also overlay every live monster on the level (fmon list, skipping
DEADMONSTER) using the normal mon_to_glyph()/display-RNG path, so
reveal_map shows monsters anywhere on the floor, refreshed each step.
reveal_map fully subsumes the obsolete fog_of_war knob (the fill_obs
overlay is now guarded solely by reveal_map>0). Drop the X(fog_of_war)
entry from the tune catalog and update the stale vision.c comment. The
catalog count drops by one.
Adds nle_seat_on_stair(nle, down): with down!=0 moves the hero onto the
down staircase (xdnstair/ydnstair), else onto the up staircase, via
u_on_newpos(). Returns nonzero if the requested stair is absent. Two-phase
like goto_depth (caller steps once to re-render); sets context.botl. Lets
the Map Viewer goto_depth(n) then seat the hero on the '>'.
Adds nle_level_up(nle, n): raises the hero n experience levels (capped at
30) through pluslvl(FALSE), so each level grants the normal HP / spell-power
/ intrinsic gains, then bumps u.uexp to the new level threshold so the next
newexplevel() won't undo it (mirrors the xp_level setter). pluslvl emits
messages; clearing iflags.window_inited around the calls routes them through
raw_print to avoid a coroutine yield from this bare entry point. Caller steps
once to refresh blstats.
…isc (flashlight; no see-through/memory beyond)
…rsist as memory (keep only the vision_recalc sight-limit)
…e SIGSEGV)

Resuming a checkpoint taken on a deep dungeon level crashed the process with a
SIGSEGV in jump_fcontext. nle_load_level stamps the saved level blob over the
CURRENT ledger slot, which after a resume()'s reset is level 1; getlev()'s
sanity check (lev && dlvl != lev) then saw e.g. 'level 5, not 1' and called
trickery() -> pline('Strange, this map is not as I remember it.') -> done(TRICKED).
pline routes through the rl window port, which yields the game coroutine
(jump_fcontext) from nle_load_level's MAIN context -> jump to a dead fcontext ->
SIGSEGV (the exact hazard the surrounding comment warns about).

A standalone load DELIBERATELY installs an arbitrary level into the current slot,
so the level-number check is wrong here; pass pid=0/lev=0 to disable both the
level and PID checks (pid/lev are used nowhere else in getlev). pid=0 also makes
cross-process resume safe (a checkpoint from another server run has a different
hackpid).
Restoring an in-memory snapshot taken at a nested window prompt (e.g. the live
Monte-Carlo checkpoint demo, or a continual-harness rollout) could SIGSEGV in
rl_clear_nhwindow or abort with a double-free in rl_yn_function.

Root cause: the rl window port tracks its win-proc call stack in a libc-backed
deque (nle_ctx_t->s_win_proc_calls), pushed/popped by a ScopedStack per window
call. A nested prompt (yn_function -> nhgetch) yields with the deque two deep.
nle_fr_restore swaps the coroutine stack back to the snapshot, but the deque
lives outside the snapshot and keeps its shallower live depth. The restored
stack's still-pending ScopedStack destructors then pop_back() past empty —
UB that walks std::deque's _M_finish into garbage, so the next push_back writes
to a near-null slot (_M_cur=0x1f0 in the captured core) and crashes.

Fix:
- ScopedStack::~ScopedStack guards pop-on-empty (the deque is a pure diagnostic
  call stack, never read, so skipping an unmatched pop is harmless).
- nle_fr_restore clears the deque (new nle_rl_winproc_reset shim) so it cannot
  drift/grow across repeated restores.

Verified: a nested-prompt snapshot + repeated restore + ctrl-R repro aborts on
the old engine within ~80 cycles and survives 100+ cycles after the fix; the
harness regression test test_snapshot_restore_nested_prompt_winproc covers it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant