Skip to content

fix(scanner): repair and re-align the JS fallback scanner#603

Open
IrosTheBeggar wants to merge 3 commits into
masterfrom
fix/js-scanner-data-loss-and-art-parity
Open

fix(scanner): repair and re-align the JS fallback scanner#603
IrosTheBeggar wants to merge 3 commits into
masterfrom
fix/js-scanner-data-loss-and-art-parity

Conversation

@IrosTheBeggar

@IrosTheBeggar IrosTheBeggar commented May 29, 2026

Copy link
Copy Markdown
Owner

Summary

Auditing the scanner surfaced a cluster of bugs in the JS fallback scanner (src/db/scanner.mjs). It had quietly diverged from the Rust scanner because the parity suite only ever exercises the Rust binary. Fixes, highest-severity first:

  1. Config rejection — the fallback was dead on arrival. task-queue.js sends waveformCacheDir in every jsonLoad and no longer sends generateWaveforms, but the JS Joi schema listed only the latter and rejected unknown keys — so the fallback exited with Invalid JSON Input before doing any work. Now accepts waveformCacheDir and tolerates unknown keys (.unknown(true)) so future Rust-only config fields can't silently break it again.

  2. Data loss on a vanished mount. With no accessibility guard, a transient CIFS/NFS outage made the walk find nothing and then DELETE FROM tracks WHERE scan_id != ? wipe every track for the library (cascading to albums / artists / user_album_stars). Added the upfront + post-walk guards the Rust scanner already has in run_scan.

  3. Album-art hash corruption + JS/Rust parity. Embedded and directory art were hashed via buffer.toString('utf-8') — a lossy UTF-8 round-trip, so the digest was neither the real MD5 of the image nor what Rust produces (it hashes raw bytes). Distinct covers could collide onto one filename. Now hashes the raw bytes and maps the embedded MIME type to an extension the way Rust does (image/jpeg.jpeg, not mime-types' .jpg).

  4. Sidecar stat storm. The fast-path issued ~22 statSync calls per file on every scan (even unchanged files and sidecar-free libraries). Added a per-directory listing cache (sidecarMtimeCached) mirroring the Rust scanner's DirListing — one readdir per directory.

  5. Double tree walk + symlink-cycle hang. Replaced the separate count-then-scan passes with a single collectFiles walk, and added realpath-based cycle detection when followSymlinks is enabled (walkdir gives the Rust scanner this for free).

Also fixes a stale comment line-reference in rust-parser/src/main.rs (comment-only).

Rebased onto master + ported the #612 Rust scanner optimizations

This branch is rebased onto current master, which now includes #612's scanner write-path overhaul. Most of #612 reaches the JS fallback automatically through the rebase — synchronous=NORMAL / cache_size / temp_store PRAGMAs and the NOT INNOT EXISTS orphan cleanup. Two wins needed a JS-side port, added in a separate commit (perf(scanner): port Rust write-path optimizations to the JS fallback):

  1. Keep the write lock free during extraction. The scan loop wrapped the entire walk — including parseMyFile (read + tag parse + hash + album-art I/O) — in one batched transaction, so a concurrent API write (playlist save, scrobble) could block for the length of a decode and blow past busy_timeout. Now the cheap unchanged-file scan_id bumps batch under one transaction, but processFile flushes that batch — releasing the write lock — before parsing each changed file, parses with no transaction open, then writes that one track in its own tight BEGIN…COMMIT. A mid-write failure rolls back just that track (the JS analogue of the Rust writer's per-song savepoint) instead of leaving a partial row for the next batch flush. Mirrors the Rust serial-path / greedy-drain restructure.

  2. Guard the per-album updateAlbumTags UPDATE so it's a no-op (0 rows, no row rewrite) when album_artist/compilation are unchanged — otherwise every track of a shared album rewrites the album row identically. Same guard ported from the Rust find_or_create_album.

Tests

New test/scanner-js-guard.test.mjs — the first test to drive scanner.mjs directly:

  • happy path: indexes a nested library, then rescans as a fast-path no-op (exercises the refactored walk + cached sidecar probe + the new per-file write transaction)
  • accessible-but-empty library: stale tracks are correctly swept
  • inaccessible directory: scan aborts (exit 1) and leaves tracks intact
  • every case uses a real task-queue-shaped config (incl. waveformCacheDir), so they regression-guard fix Added Dockerfile and modified the README #1

After the rebase + port:

  • Full project suite ✅ 1145 tests, 0 fail, 0 skipped (incl. scanner-js-guard, scanner-parity, audio-hash-parity, lyrics-parity, waveform, task-queue).
  • Direct JS-scanner smoke run (initial scan / no-op rescan / force-rescan): row counts identical to the Rust scanner, album rows byte-identical across a no-op rescan (guard no-ops correctly), no scanner warnings, scan_progress cleaned up.
  • No new lint vs. baseline.

Notes / out of scope

  • The Rust scanner's parallel album-art redundancy was addressed in perf(scanner): write-path throughput + lock-contention fixes #612 (per-directory OnceLock dedup), which this branch is now rebased on top of.
  • These fixes only touch the JS fallback path; installs running the Rust binary are unaffected. On a JS-fallback host, the first whole-library scan after upgrading will rewrite embedded-art filenames once (old files orphaned, not broken) as a consequence of fix UI obstructed by playback bar #3.

🤖 Generated with Claude Code

@IrosTheBeggar IrosTheBeggar force-pushed the fix/js-scanner-data-loss-and-art-parity branch 2 times, most recently from 92e426c to 6fb7487 Compare June 2, 2026 17:58
IrosTheBeggar and others added 2 commits June 2, 2026 17:01
The JS fallback scanner (src/db/scanner.mjs) had drifted from the Rust
scanner in ways the parity suite never catches — it only ever runs the
Rust binary. Fixes, highest-severity first:

1. Config rejection (fallback was dead on arrival). task-queue.js sends
   `waveformCacheDir` in every jsonLoad and no longer sends
   `generateWaveforms`; the Joi schema listed only the latter and
   rejected unknown keys, so the fallback exited with "Invalid JSON
   Input" before doing any work. Accept `waveformCacheDir` and tolerate
   unknown keys (.unknown(true)) so future Rust-only fields can't break
   the fallback again.

2. Data loss on a vanished mount. With no accessibility guard, a
   transient CIFS/NFS outage made the walk find nothing and the
   stale-cleanup DELETE wipe every track for the library (cascading to
   albums/artists/user_album_stars). Add the upfront + post-walk guards
   the Rust scanner already has.

3. Album-art hash corruption + parity. Embedded and directory art were
   hashed via buffer.toString('utf-8') — a lossy UTF-8 round-trip, so
   the digest was neither the real MD5 nor what Rust produces; distinct
   covers could collide onto one filename. Hash the raw bytes, and map
   the embedded MIME type to an extension the way Rust does
   (image/jpeg -> .jpeg, not mime-types' .jpg).

4. Sidecar stat storm. The fast-path issued ~22 statSync calls per file
   on every scan. Add a per-directory listing cache (sidecarMtimeCached)
   mirroring the Rust scanner's DirListing — one readdir per directory.

5. Double tree walk + symlink-cycle hang. Replace the separate
   count-then-scan passes with a single collectFiles walk, and add
   realpath-based cycle detection when followSymlinks is on (walkdir
   gives the Rust scanner this for free).

Adds test/scanner-js-guard.test.mjs — the first test to drive
scanner.mjs directly (happy-path walk, fast-path rescan, both guards,
real task-queue-shaped config). Stale comment line-refs fixed in
scanner.mjs and rust-parser/src/main.rs (comment-only; no binary change).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Bring the JS fallback scanner in line with the Rust scanner's #612
lock-contention + write-reduction work. (The synchronous=NORMAL / cache
PRAGMAs and the NOT EXISTS orphan cleanup already arrived via the rebase
onto master; these are the two that needed a JS-side port.)

- Keep the write lock free during extraction. The scan loop wrapped the
  whole walk — including parseMyFile (read + tag parse + hash + album-art
  I/O) — in one batched transaction, so a concurrent API write could block
  for the length of a decode and blow past busy_timeout. Now the cheap
  unchanged-file scan_id bumps batch under one transaction, but processFile
  flushes that batch (releasing the lock) BEFORE parsing each changed file,
  parses with no transaction open, then writes that one track in its own
  tight BEGIN...COMMIT. A mid-write failure rolls back just that track (the
  JS analogue of the Rust writer's per-song savepoint) instead of leaving a
  partial row to be committed on the next batch flush. Mirrors the Rust
  serial-path / greedy-drain restructure.
- Guard the per-album updateAlbumTags UPDATE so it's a no-op (0 rows, no
  row rewrite) when album_artist/compilation are unchanged — otherwise
  every track of a shared album rewrites the album row identically. Same
  guard ported from the Rust find_or_create_album.

DB output unchanged: full suite green (1145 tests, 0 fail, 0 skipped) incl.
scanner-js-guard end-to-end. A direct JS-scanner smoke run (initial /
no-op rescan / force-rescan) produces row counts identical to the Rust
scanner and leaves album rows byte-identical across a no-op rescan.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@IrosTheBeggar IrosTheBeggar force-pushed the fix/js-scanner-data-loss-and-art-parity branch from 6fb7487 to 0bcc566 Compare June 2, 2026 21:17
pull Bot pushed a commit to Spencerx/mStream that referenced this pull request Jun 10, 2026
task-queue.js sends one jsonLoad to both scanners, and it includes
waveformCacheDir (it stopped sending the older generateWaveforms
boolean). The JS fallback's Joi schema never listed the field, and Joi
rejects unknown keys by default — so every real launch of the fallback
died instantly with "Invalid JSON Input" / exit 1. Any install without
a working Rust binary has had a scanner that cannot run at all.

CI never caught it because the parity tests build their own jsonLoad
by hand (without waveformCacheDir) instead of going through
task-queue.js.

Accept the field, and add .unknown(true) so the next Rust-only field
added to task-queue.js's jsonLoad can't re-break the fallback the same
way — the fields the JS scanner actually reads are still validated.

Matches the identical hunk in PR IrosTheBeggar#603 so the eventual rebase
auto-merges.

Verified: forking scanner.mjs with a task-queue-shaped payload
(waveformCacheDir + an unknown extra field) against a fresh migrated DB
exits 1 with '"waveformCacheDir" is not allowed' before; exit 0 with a
clean scanComplete event after.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

UI obstructed by playback bar

1 participant