fix(nle): resume from a deep-floor checkpoint no longer SIGSEGVs by liujonathan24 · Pull Request #4 · liujonathan24/NetHack

liujonathan24 · 2026-06-13T17:09:25Z

Resuming a saved game from a deep-floor checkpoint crashed libnethack.so with a SIGSEGV.

Root cause. nle_load_level called getlev with the original hackpid/level, so getlev's trickery sanity check ((pid && pid != hpid) || (lev && dlvl != lev)) fired and emitted a pline ("trickery..."). That window call runs from the main context and yields into a now-dead fcontext (jump_fcontext → garbage), segfaulting.

Fix. Call getlev(fd, 0, 0, FALSE) so the trickery check is skipped on programmatic level loads (pid 0 and lev 0 both disable the respective comparison). The level blob being loaded is already the one we wrote, so the sanity check guards against nothing here.

Verified by resuming a goto_depth=5 checkpoint repeatedly — no crash. A regression test lives in the harness (environments/nethack/tests/test_checkpoint.py::test_checkpoint_resume_deep_floor).

…tion Previously the reveal_map / fog_of_war knobs called nle_reveal_level() at the end of vision_recalc(), which wrote the real terrain of every cell into the hero's remembered map (gbuf) via magic_map_background()+newsym(). That mutation was permanent: turning the knob back off could not un-reveal already-remembered terrain. Reimplement reveal as a render-time overlay on the emitted observation only. NetHackRL::fill_obs now, when reveal_map>0 or fog_of_war==0, fills any cell that is still unknown (nul_glyph) in the obs from back_to_glyph(x,y), mapped to char/color/special exactly as rl_print_glyph does (mapglyph + CLR_BLACK fixup + shuffled_glyph). It writes only into the emitted obs arrays -- never gbuf, levl[][].seenv, or newsym -- so the effect is fully reversible and leaves game state untouched. The default path (reveal_map==0, fog_of_war==1) skips the loop entirely, so vanilla behavior and golden parity are byte-identical. Remove the now-unused nle_reveal_level() from vision_recalc(), detect.c, and its declaration in nle.h.

… blob save/load Guards: reject ftell<=0/empty file in save; NULL/empty blob (rc=4) in load.

…or_connectivity/room_size

…downstair/first accessible)

…to_depth

fill_obs's reveal_map overlay filled unseen walls as blank stone because back_to_glyph() returns S_stone for any wall with seenv==0. Temporarily poke seenv=SVALL for wall/SDOOR cells before back_to_glyph(), then restore it (fully reversible; levl[][] is unchanged after the call), so walls now render their proper face via wall_angle(). Also overlay every live monster on the level (fmon list, skipping DEADMONSTER) using the normal mon_to_glyph()/display-RNG path, so reveal_map shows monsters anywhere on the floor, refreshed each step.

reveal_map fully subsumes the obsolete fog_of_war knob (the fill_obs overlay is now guarded solely by reveal_map>0). Drop the X(fog_of_war) entry from the tune catalog and update the stale vision.c comment. The catalog count drops by one.

Adds nle_seat_on_stair(nle, down): with down!=0 moves the hero onto the down staircase (xdnstair/ydnstair), else onto the up staircase, via u_on_newpos(). Returns nonzero if the requested stair is absent. Two-phase like goto_depth (caller steps once to re-render); sets context.botl. Lets the Map Viewer goto_depth(n) then seat the hero on the '>'.

Adds nle_level_up(nle, n): raises the hero n experience levels (capped at 30) through pluslvl(FALSE), so each level grants the normal HP / spell-power / intrinsic gains, then bumps u.uexp to the new level threshold so the next newexplevel() won't undo it (mirrors the xp_level setter). pluslvl emits messages; clearing iflags.window_inited around the calls routes them through raw_print to avoid a coroutine yield from this bare entry point. Caller steps once to refresh blstats.

…ond radius, even in lit rooms)

…isc (flashlight; no see-through/memory beyond)

…rsist as memory (keep only the vision_recalc sight-limit)

…ization (pairs with save_level)

…e SIGSEGV) Resuming a checkpoint taken on a deep dungeon level crashed the process with a SIGSEGV in jump_fcontext. nle_load_level stamps the saved level blob over the CURRENT ledger slot, which after a resume()'s reset is level 1; getlev()'s sanity check (lev && dlvl != lev) then saw e.g. 'level 5, not 1' and called trickery() -> pline('Strange, this map is not as I remember it.') -> done(TRICKED). pline routes through the rl window port, which yields the game coroutine (jump_fcontext) from nle_load_level's MAIN context -> jump to a dead fcontext -> SIGSEGV (the exact hazard the surrounding comment warns about). A standalone load DELIBERATELY installs an arbitrary level into the current slot, so the level-number check is wrong here; pass pid=0/lev=0 to disable both the level and PID checks (pid/lev are used nowhere else in getlev). pid=0 also makes cross-process resume safe (a checkpoint from another server run has a different hackpid).

Restoring an in-memory snapshot taken at a nested window prompt (e.g. the live Monte-Carlo checkpoint demo, or a continual-harness rollout) could SIGSEGV in rl_clear_nhwindow or abort with a double-free in rl_yn_function. Root cause: the rl window port tracks its win-proc call stack in a libc-backed deque (nle_ctx_t->s_win_proc_calls), pushed/popped by a ScopedStack per window call. A nested prompt (yn_function -> nhgetch) yields with the deque two deep. nle_fr_restore swaps the coroutine stack back to the snapshot, but the deque lives outside the snapshot and keeps its shallower live depth. The restored stack's still-pending ScopedStack destructors then pop_back() past empty — UB that walks std::deque's _M_finish into garbage, so the next push_back writes to a near-null slot (_M_cur=0x1f0 in the captured core) and crashes. Fix: - ScopedStack::~ScopedStack guards pop-on-empty (the deque is a pure diagnostic call stack, never read, so skipping an unmatched pop is harmless). - nle_fr_restore clears the deque (new nle_rl_winproc_reset shim) so it cannot drift/grow across repeated restores. Verified: a nested-prompt snapshot + repeated restore + ctrl-R repro aborts on the old engine within ~80 cycles and survives 100+ cycles after the fix; the harness regression test test_snapshot_restore_nested_prompt_winproc covers it.

liujonathan24 and others added 14 commits June 11, 2026 03:29

feat(nle): nle_save_level/nle_load_level/nle_free_blob — single-level…

b7d0423

… blob save/load Guards: reject ftell<=0/empty file in save; NULL/empty blob (rc=4) in load.

feat(nle): generation knobs mob_spawn/trap_density/locked_door/corrid…

70a7175

…or_connectivity/room_size

fix(nle): nle_load_level seats the hero on the loaded level (upstair/…

22cc153

…downstair/first accessible)

feat(nle): secure state setters (hp/max_hp/gold/xp_level/hunger) + go…

e991505

…to_depth

refactor(nle): remove the fog_of_war tune knob

22b5ed0

reveal_map fully subsumes the obsolete fog_of_war knob (the fill_obs overlay is now guarded solely by reveal_map>0). Drop the X(fog_of_war) entry from the tune catalog and update the stale vision.c comment. The catalog count drops by one.

feat(vision): vision_radius is a hard sight limit (clamp IN_SIGHT bey…

c7507ec

…ond radius, even in lit rooms)

feat(vision): vision_radius constricts the emitted view to a radius d…

a4ce2be

…isc (flashlight; no see-through/memory beyond)

revert(vision): drop the flashlight obs-mask — explored cells must pe…

b8160b3

…rsist as memory (keep only the vision_recalc sight-limit)

feat(nle): nle_save_player/nle_load_player — pointer-safe hero serial…

2c73561

…ization (pairs with save_level)

liujonathan24 mentioned this pull request Jun 13, 2026

Clean main: web console (checkpoint/restore, undo), demo GIFs, aggregated docs; relocate process artifacts to experimental liujonathan24/NetHackHarness#15

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(nle): resume from a deep-floor checkpoint no longer SIGSEGVs#4

fix(nle): resume from a deep-floor checkpoint no longer SIGSEGVs#4
liujonathan24 wants to merge 15 commits into
mainfrom
fix-resume-trickery-sigsegv

liujonathan24 commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

liujonathan24 commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant