Skip to content

Beatport/Beatsource match error and API decoding, Fix ID3 tag duplication, date mapping, Updated dependencies.#526

Open
rosgr100 wants to merge 30 commits into
Marekkon5:masterfrom
rosgr100:master
Open

Beatport/Beatsource match error and API decoding, Fix ID3 tag duplication, date mapping, Updated dependencies.#526
rosgr100 wants to merge 30 commits into
Marekkon5:masterfrom
rosgr100:master

Conversation

@rosgr100

@rosgr100 rosgr100 commented May 21, 2026

Copy link
Copy Markdown

This PR introduces comprehensive stability improvements, dependency modernizations, and metadata formatting fixes across the core tagging engine, platform scrapers, and the CI/CD pipeline.

1. Dependency Modernization & Compatibility

  • Full Stack Update: Updated both backend (Rust/Cargo) and frontend (Node/PNPM) dependencies to their latest versions.
  • Compatibility Fixes: Resolved all breaking changes and code compatibility issues introduced by the dependency bumps, ensuring the application compiles cleanly and runs securely on the newest packages.

2. Beatport and Beatsource API Fix (beatport.rs) (beatsource.rs)

The Issue: The Beatsource search API broke due to a frontend update that removed the <script id="__NEXT_DATA__"> tag, which the old web scraper relied on to extract search results.
The Fix: HTML scraping logic and scraper dependencies entirely removed. The code now securely fetches a client credentials OAuth token from account.beatport.com and directly queries the official v4 catalog API (api.beatsource.com/v4/catalog/search). This returns the exact same data natively without relying on fragile DOM parsing.

Fixes #518 #520

3. Core ID3 & Tagging Engine Fixes (id3.rs, lib.rs [tag], lib.rs [autotagger])

  • Duplicate Tag Fix: Resolved a bug where User-Defined Text (TXXX) frames (e.g., UNIQUEFILEID, WWWAUDIOFILE) would stack and duplicate infinitely upon overwriting. The ID3 writer now explicitly clears existing extended text frames before appending new ones.
  • Date Format Standardization: Separated year and full date tags to align with standard mapping conventions. The standard YEAR frame is now strictly restricted to 4 digits (YYYY), while the full YYYY-MM-DD strings are injected into custom RELEASETIME and PUBLISHTIME tags.
  • Cover Art Cleanup: Updated cleanup logic in the tagging engine to resolve album art persistence bugs when modifying or overwriting covers.

4. API Decoding Stability (beatport.rs, beatsource.rs)

  • Implemented a clear_search_query helper to strip parentheses from search strings before hitting the v4/catalog/search endpoints. This prevents 400/403 errors when scraping tracks with bracketed metadata (like mixes or features) in the title.

5. Workflow & CI/CD Updates

  • Manual Triggers: Added workflow_dispatch to the on: block in .github/workflows/build.yml to allow the build process to be triggered manually from the GitHub Actions tab.
  • Pipeline Modernization: Upgraded GitHub Actions to their latest versions and explicitly set the Node.js environment variables.

Simon-Zwa and others added 15 commits May 18, 2026 21:25
Beatport no longer exposes __NEXT_DATA__ on search pages, causing search failures.

Updated Beatport search to use the v4 catalog API endpoint instead of scraping the website.

Adjusted deserialization to support the new API response format (tracks array).
Added a note about a temporary Beatport fix for May 2026.
Updated user agent string and fixed a typo in the search query parameter. Changed the token fetching method to use OAuth API and updated the token expiration logic.
Updated GitHub Actions to use the latest versions of actions for checkout, cache, setup-node, and pnpm.
… and update platform statics

Summary of Dependency Modernization (Non-API Changes)

    Audio Engine Upgrades (rodio 0.22): Migrated the entire playback module (onetagger-player) from i16 integers to f32 floats. Updated all local decoders (aiff, alac, flac, mp3, mp4, ogg, wav) to yield floating-point data streams, and wrapped sample rates/channels in strict NonZeroU32/NonZeroU16 structural safety types.

    Shazam Fingerprinting Fixes (onetagger-autotag): Adapted shazam.rs to map floating-point streams from UniformSourceIterator. Implemented a down-sampling translation block to scale and convert the raw float sample buffer back to standard i16 PCM vectors to satisfy the underlying SongRec signature generator requirements.

    Window & UI Layer Adjustments (wry 0.55): Refactored the window lifecycle application loop in main.rs. Split the legacy webview navigation handler into separate modern asynchronous functions, updated the closures to support dual-argument signatures (handling url and NewWindowFeatures), and migrated target window states to use the modern NewWindowResponse::Allow and Deny enums.

    Crate Configuration Updates (Cargo.toml): Swapped deprecated reqwest flags from "rustls-tls" to "rustls". Restored explicit "query" and "form" dependency compilation flags inside the platforms module to handle isolated network request actions. Replaced the legacy audio "cpal" identifier with the modernized "playback" flag to maintain access to local host speakers.

Wrapped the cover art file-writing loop in an explicit `config.album_art_file` check to prevent loose 'cover.jpg' files from being saved when the feature is disabled in settings.
id3.rs: Resolved the duplicate TXXX frame bug (e.g., UNIQUEFILEID, WWWAUDIOFILE) by explicitly clearing existing extended text frames before writing new ones.

lib.rs (tag) & lib.rs (autotagger) : Fixed ID3 date mapping to ensure the standard YEAR frame strictly outputs a 4-digit year (YYYY).

lib.rs (tag) & lib.rs (autotagger): Added logic to automatically inject the full YYYY-MM-DD date strings into custom RELEASETIME and PUBLISHTIME tags.
beatsource.rs: Added clear_search_query helper to strip parentheses from search strings before hitting the v4/catalog/search endpoint.

This mirrors the recent Beatport fix and prevents 400/403 API crashes when scraping tracks with bracketed metadata (e.g., mix names or featured artists) in the title
rosgr100 added 2 commits May 24, 2026 19:28
Removed installation of nodejs and pnpm from dependencies. Added separate steps for installing NodeJS and pnpm.
@deviationist

Copy link
Copy Markdown

Just tested this PR's CLI build against a ~3,800-track DJ library on Linux (Docker, cargo build --release -p onetagger-cli from the pr526 head over 1.7.0). Confirms the rewrite works end-to-end for both platforms — __NEXT_DATA__ errors gone, api.beatport.com / api.beatsource.com OAuth client_credentials handshake succeeds, matches come through cleanly.

In context: ran 5 sequential passes over the library with overwrite:false, skipTagged:true. The first four (Discogs + Junodownload, MusicBrainz, then Deezer + iTunes) brought us to 91% coverage. Rebuilding with this PR and running just ["beatport", "beatsource"] over the remaining ~352 residue rescued +31 more files (29 Beatport, 2 Beatsource — Beatsource's catalog looks largely subsumed by Beatport's for electronic). API responses are noticeably faster than the old HTML scrape too — per-batch wall time dropped meaningfully.

Just chiming in with real-world results in case it helps nudge the review. Thanks @rosgr100 for the fix!

@francisuk1989

Copy link
Copy Markdown

+1 from me, Thank you @rosgr100

rosgr100 added 3 commits June 7, 2026 17:32
**The Problem:**
Perfectly matched tracks were receiving ~56% accuracy scores because OneTagger's standard fuzzy matching compared long local titles like `Title (Extended Mix)` against Beatport's shorter base `Title` field.

**The Solution:**
* **Smart Fallback:** Added a secondary matching pass that only triggers if the initial score is < 80%.
* **Regex Extraction:** Safely splits the local title and mix name for independent grading.
* **Weighted Scoring:** Calculates a new accuracy score heavily weighted toward the base title (70%) but rewarding accurate mix names (30%).
* **False Positive Prevention:** Added a strict boolean check to immediately reject API tracks if their mix name directly contradicts the local file's mix name.
* **Deps:** Added `strsim` to `onetagger-platforms`.
**The Problem:**
Perfectly matched tracks were receiving ~56% accuracy scores because OneTagger's standard fuzzy matching compared long local titles like `Title (Extended Mix)` against Beatport's shorter base `Title` field.

**The Solution:**
* **Smart Fallback:** Added a secondary matching pass that only triggers if the initial score is < 80%.
* **Regex Extraction:** Safely splits the local title and mix name for independent grading.
* **Weighted Scoring:** Calculates a new accuracy score heavily weighted toward the base title (70%) but rewarding accurate mix names (30%).
* **False Positive Prevention:** Added a strict boolean check to immediately reject API tracks if their mix name directly contradicts the local file's mix name.
* **Deps:** Added `strsim` to `onetagger-platforms`.
…y and remix matching

This PR introduces a highly optimized, secondary fallback matching engine exclusively within beatport.rs. It is designed to resolve widespread false negatives caused by Beatport's inconsistent metadata formatting (e.g., artist reordering, alias variations, and arbitrary (Extended Mix) suffixes) without altering the primary MatchingUtils core logic.  The fallback only triggers if the primary matcher returns a confidence score below 0.80, acting as a localized rescue mission for difficult DJ metadata.

Key Features & Improvements

Categorical Version Taxonomy (MixType): Replaces brittle string equality with a strict enum matrix. This guarantees that functionally different DJ mixes (e.g., Club Mix vs Extended Mix) are strictly isolated and cannot falsely match, while safely bridging tracks missing explicit tags (Unknown ↔ Original).  Jaccard (Token-Based) Similarity: Transitions from Levenshtein distance to Jaccard set intersection for Artist arrays and Remix titles. This completely resolves the mathematical penalties previously caused by word reordering (e.g., "Guetta Remix" vs "Remix Guetta") and punctuation differences ("&" vs "and").  Remix Stopword Filtering: Violently strips noise words (remix, rmx, mix, edit, vip, dub, etc.) prior to Jaccard comparison. This prevents false positives where two completely different remixers achieve a high similarity score simply because both strings contain the word "remix".  Deterministic Confidence Ceiling: Enforces a strict scoring hierarchy where fuzzy Jaccard artist matches are mathematically capped (0.9) to ensure they can never outrank a verified exact match (1.0).

Performance Optimizations

O(N²) Prevention: Eliminates nested vector iteration by utilizing an internal HashMap lookup during fallback score updates.  Regex Caching: Migrates the primary Mix Regex to a static OnceLock so it compiles exactly once per application lifetime.  Closure Hoisting: Moves allocation-heavy normalizer closures (normalize_punctuation, normalize_artists) outside the main iteration loop.

Architectural Safety

Zero Core Impact: These changes are 100% fenced behind the < 0.80 fallback gate and contained entirely inside beatport.rs. The primary engine remains completely untouched, ensuring stability across other platform matchers.
rosgr100 and others added 3 commits June 10, 2026 01:22
…ag pagination

The Beatport module struggled to match tracks containing featured artists in the local title (e.g., "Forever Ft. Sabrina Johnston").

 Beatport's search API frequently chokes and returns 0 results when featured artists are included in the search string.

Even when found, OneTagger's strict accuracy thresholds would fail the match because Beatport's base title didn't contain the featured artist.

Furthermore, Beatport's Auto Tag fall back search was failing on valid tracks because the engine exited pagination prematurely (only checking Page 1) and didn't sort the fallback array by accuracy.

The Solution
This PR completely overhauls the Beatport matching logic and search sanitization to be bulletproof out-of-the-box.

Key Changes:

Hardcoded Search Sanitization: Baked a feature-stripping Regex ((?i)\s+(?:ft|feat|featuring)\.?\s+[^()]+) directly into clear_search_query. The Beatport API now always receives a clean base title, preventing 0-result API crashes.

 Universal Math Fallback: Added a smart fallback engine that dynamically strips features from both the local title and the Beatport API title during the Levenshtein calculation. This guarantees a 1.0 accuracy score for valid tracks even if the user has the OneTagger title cleanup regex empty.

Autotag Pagination Fix: Modified the search loop to respect the user's max_pages config during Auto Tag runs, rather than prematurely returning after Page 1 if a match wasn't instantly found.

Array Sorting: Forced matched_tracks to sort by accuracy descending so Auto Tag's automated selection reliably grabs the 100% match instead of a scrambled lower-tier match.
…ashes

Two edge-case bugs were causing the Beatport module to freeze or aggressively fail out of track matching:

Token Deadlock: If the Beatport API token expired mid-session, the update_token function attempted a recursive call while still holding the Mutex guard, causing the application thread to permanently freeze.

ISRC Hard Crash: The ISRC matching path used the ? operator on the track detail API fetch. If the API rate-limited or dropped the connection on that specific call, it hard-failed the entire matching process instead of safely falling back to the text-search engine.

The Solution

Mutex Deadlock Fix: Explicitly added drop(token); to release the Mutex guard before the recursive update_token call, ensuring thread safety upon token expiry.

Graceful ISRC Fallback: Replaced the ? operator in the ISRC search block with a match statement. If a track detail fetch fails, it now logs a warning and gracefully falls through to the standard text search, rather than crashing the tagger.
Implemented a bounded retry loop in the update_token method to handle expired tokens more effectively by preventing infinite recursion. Added error handling for cases where the Beatport API continuously provides expired tokens.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Beatport API disconnection ?

4 participants