Skip to content

fix: apply Unicode full case fold in link reference label normalization#3997

Open
JSap0914 wants to merge 1 commit into
markedjs:masterfrom
JSap0914:fix/autolink-email-scheme-normalization
Open

fix: apply Unicode full case fold in link reference label normalization#3997
JSap0914 wants to merge 1 commit into
markedjs:masterfrom
JSap0914:fix/autolink-email-scheme-normalization

Conversation

@JSap0914

Copy link
Copy Markdown

Bug

JavaScript's String.prototype.toLowerCase() implements Unicode simple case mapping, not full case folding. The CommonMark spec §5.6 requires Unicode case folding when matching link reference labels:

One label matches another just if their normalized forms are equal. To normalize a label, perform the Unicode case fold on the label, and collapse consecutive spaces, tabs, and line endings to a single space.

The key difference: full case folding allows a single character to expand to multiple characters. For example:

Character toLowerCase() Full case fold
ẞ (U+1E9E) ß ss
ß (U+00DF) ß ss
fi (U+FB01) fi

This means CommonMark spec example 540 fails:

[]

[SS]: /url

Expected: <p><a href="/url">ẞ</a></p>
Got: <p>[ẞ]</p> (unresolved reference)

Fix

Add normalizeLabel() to src/helpers.ts that chains toLowerCase() with replacements for the Unicode F-status (full) case fold pairs that expand to multiple characters. Update Tokenizer.def and Tokenizer.reflink to use it instead of plain toLowerCase().

Verification

npm test
  • CommonMark Links: 77/90 → 78/90 (example 540 now passes)
  • GFM Links: same improvement
  • All 1743 spec tests pass; all 188 unit tests pass; lint clean

AI assistance disclosure: This fix was developed with AI-assisted tooling.

JavaScript's String.prototype.toLowerCase() implements Unicode simple
case mapping, not full case folding. As a result, ẞ (U+1E9E, Latin
Capital Letter Sharp S) is lowercased to ß (U+00DF) rather than 'ss'.
The same issue affects ß itself (U+00DF → 'ss' in full case fold) and
Unicode ligatures (fi → 'fi', ff → 'ff', etc.).

The CommonMark spec (§5.6) requires Unicode case folding when matching
link reference labels, so [ẞ] must match [SS]: /url (example 540).
Previously it did not.

Fix: Add normalizeLabel() in helpers.ts that applies toLowerCase() then
replaces the Unicode F-status multi-character case fold pairs. Use it
in Tokenizer.def and Tokenizer.reflink in place of plain toLowerCase().

Fixes CommonMark spec example 540.
Copilot AI review requested due to automatic review settings June 17, 2026 11:12
@vercel

vercel Bot commented Jun 17, 2026

Copy link
Copy Markdown

@JSap0914 is attempting to deploy a commit to the MarkedJS Team on Vercel.

A member of the Team first needs to authorize it.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

@vercel

vercel Bot commented Jun 18, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
marked-website Ready Ready Preview, Comment Jun 18, 2026 4:09am

Request Review

@UziTech UziTech left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! 💯

@UziTech UziTech requested review from calculuschild and styfle June 18, 2026 04:11
Comment thread src/helpers.ts
.replace(/fl/g, 'fl') // U+FB02
.replace(/ffi/g, 'ffi') // U+FB03
.replace(/ffl/g, 'ffl') // U+FB04
.replace(/ſt|st/g, 'st'); // U+FB05, U+FB06

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this list exhaustive?

I'm wondering if we can use a built-in API like Intl or perhaps label.normalize('NFKC') instead?

Comment thread src/helpers.ts
// Apply Unicode full case fold before toLowerCase so that
// ẞ (U+1E9E) → ss and ß (U+00DF) → ss (simple fold maps ẞ to ß first).
.toLowerCase()
.replace(/ß/g, 'ss') // U+00DF + folded U+1E9E

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the test for this first one, but do we have tests for the others?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants