fix: apply Unicode full case fold in link reference label normalization#3997
Open
JSap0914 wants to merge 1 commit into
Open
fix: apply Unicode full case fold in link reference label normalization#3997JSap0914 wants to merge 1 commit into
JSap0914 wants to merge 1 commit into
Conversation
JavaScript's String.prototype.toLowerCase() implements Unicode simple case mapping, not full case folding. As a result, ẞ (U+1E9E, Latin Capital Letter Sharp S) is lowercased to ß (U+00DF) rather than 'ss'. The same issue affects ß itself (U+00DF → 'ss' in full case fold) and Unicode ligatures (fi → 'fi', ff → 'ff', etc.). The CommonMark spec (§5.6) requires Unicode case folding when matching link reference labels, so [ẞ] must match [SS]: /url (example 540). Previously it did not. Fix: Add normalizeLabel() in helpers.ts that applies toLowerCase() then replaces the Unicode F-status multi-character case fold pairs. Use it in Tokenizer.def and Tokenizer.reflink in place of plain toLowerCase(). Fixes CommonMark spec example 540.
|
@JSap0914 is attempting to deploy a commit to the MarkedJS Team on Vercel. A member of the Team first needs to authorize it. |
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
styfle
reviewed
Jun 22, 2026
| .replace(/fl/g, 'fl') // U+FB02 | ||
| .replace(/ffi/g, 'ffi') // U+FB03 | ||
| .replace(/ffl/g, 'ffl') // U+FB04 | ||
| .replace(/ſt|st/g, 'st'); // U+FB05, U+FB06 |
Member
There was a problem hiding this comment.
Is this list exhaustive?
I'm wondering if we can use a built-in API like Intl or perhaps label.normalize('NFKC') instead?
styfle
reviewed
Jun 22, 2026
| // Apply Unicode full case fold before toLowerCase so that | ||
| // ẞ (U+1E9E) → ss and ß (U+00DF) → ss (simple fold maps ẞ to ß first). | ||
| .toLowerCase() | ||
| .replace(/ß/g, 'ss') // U+00DF + folded U+1E9E |
Member
There was a problem hiding this comment.
I see the test for this first one, but do we have tests for the others?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Bug
JavaScript's
String.prototype.toLowerCase()implements Unicode simple case mapping, not full case folding. The CommonMark spec §5.6 requires Unicode case folding when matching link reference labels:The key difference: full case folding allows a single character to expand to multiple characters. For example:
toLowerCase()This means CommonMark spec example 540 fails:
Expected:
<p><a href="/url">ẞ</a></p>Got:
<p>[ẞ]</p>(unresolved reference)Fix
Add
normalizeLabel()tosrc/helpers.tsthat chainstoLowerCase()with replacements for the Unicode F-status (full) case fold pairs that expand to multiple characters. UpdateTokenizer.defandTokenizer.reflinkto use it instead of plaintoLowerCase().Verification