Skip to content

HTML API: Decode semicolonless legacy references before non-ASCII attribute followers#65

Closed
sirreal wants to merge 19 commits into
trunkfrom
fix/html-decoder-legacy-follower-ascii
Closed

HTML API: Decode semicolonless legacy references before non-ASCII attribute followers#65
sirreal wants to merge 19 commits into
trunkfrom
fix/html-decoder-legacy-follower-ascii

Conversation

@sirreal

@sirreal sirreal commented Jun 12, 2026

Copy link
Copy Markdown
Owner

Summary

  • Decode semicolonless legacy named references before non-ASCII attribute followers.
  • Replace locale-sensitive ctype_alnum() checks with explicit ASCII byte checks.
  • Preserve the HTML ambiguity rule for ASCII alphanumeric and = followers.

Testing

  • Focused decoder PHPUnit coverage for non-ASCII and ASCII follower cases passes.
  • PHPCS alignment fixes are committed.
  • codex review --base trunk.

Trac ticket: https://core.trac.wordpress.org/ticket/65372

Use of AI Tools

AI assistance: Yes
Tool(s): Codex
Model(s): GPT-5.5
Used for: Splitting the fuzzer-discovered fix, PR description cleanup, and code review.


This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.

@sirreal sirreal force-pushed the fix/html-decoder-legacy-follower-ascii branch from 7169693 to f92da80 Compare June 12, 2026 22:01
@sirreal

sirreal commented Jun 12, 2026

Copy link
Copy Markdown
Owner Author

I have reproduced this, the reproduction prints:

string(18) "26416163757465c280"

@sirreal sirreal marked this pull request as ready for review June 12, 2026 22:11
@github-actions

github-actions Bot commented Jun 12, 2026

Copy link
Copy Markdown

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Core Committers: Use this line as a base for the props when committing in SVN:

Props jonsurrell, dmsnell.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

@dmsnell dmsnell left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs Trac ticket references and WPCS will whine if whitespace isn’t adjusted. Maybe @copilot can fix these.

Giving it a preemptive approval. Thanks

@sirreal

sirreal commented Jun 23, 2026

Copy link
Copy Markdown
Owner Author

This seems to be platform independent. I am able to reproduce in a macOS machine with current trunk @ 638a82c. The host machine reproduces the issue with some strings, but when running the same code in a Linux container the issue disappears.

Reproduction (system-dependent)
$affected_locale = setlocale( LC_CTYPE, "C.UTF-8" );
if ( false === $affected_locale || ! ctype_alnum( "\xC2" ) ) {
	die( "This platform does not have an affected LC_CTYPE locale.\n" );
} else {
	echo "Affected LC_CTYPE locale: $affected_locale\n";
}

foreach ( [
	[ "&copyÉ", 0 ],
	[ "Total&nbsp£20", 5 ],
	[ "Shipped&nbsp🎉", 7 ],
	[ "ACME&reg™ widget", 4 ],
] as $item ) {
	[ $raw, $at ] = $item;
	echo "===\n";
	var_dump( $raw, bin2hex( $raw ) );

	$decoded = WP_HTML_Decoder::decode_attribute( $raw );
	var_dump( $decoded, @bin2hex( $decoded ) );
	$match_byte_length = null;
	$reference = WP_HTML_Decoder::read_character_reference( "attribute", $raw, $at, $match_byte_length );
	var_dump( $reference, @bin2hex( $reference ) );
	var_dump( $match_byte_length );
}
AI debugging summary

The ctype_alnum follower check is locale-dependent and the spec wants specifically ASCII alphanumeric, so fixing it is correct. But the test vector &Aacute\xC2\x80 (follower = U+0080, a C1 control) is the least realistic non-ASCII follower and badly undersells the reach. Any semicolonless legacy reference glued to ordinary non-ASCII content misdecodes in an attribute. Verified outputs (no fix, affected locale):

&copyÉ. Dupont   → &copyÉ. Dupont      (want ©É. Dupont)
Total&nbsp£20    → Total&nbsp£20       (want Total £20)
Shipped&nbsp🎉   → Shipped&nbsp🎉      (want Shipped 🎉)
ACME&reg™ widget → ACME&reg™ widget    (want ACME®™ widget)

Platform dependence (the important part). This isn't "non-C locale" — it's libc-dependent:

  • glibc (Ubuntu 24.04, 2.39): UTF-8 locales are not affected; ctype_alnum("\xC2") is false. Only single-byte locales (ISO-8859-x) trigger it.
  • macOS / Darwin (BSD libc): UTF-8 locales are affected, including C.UTF-8 and en_US.UTF-8. C.UTF-8 is a mainstream default, so this is far from exotic.
  • C / POSIX: never affected (ASCII-only), both platforms.

Darwin's UTF-8 ctype classifies the follower byte by its Latin-1 identity: isalnum is true for the Latin-1 letters (ª µ º, À–Ö, Ø–ö, ø–ÿ), with the only printable-range gaps at × (0xD7) and ÷ (0xF7). Confirmed at the C level (isalnum), so it's libc, not PHP.

Consequence — a clean asymmetry: in valid UTF-8 the only non-ASCII follower that escapes the bug on Darwin is a 0xD7 lead byte, i.e. Hebrew (U+05C0–U+05FF), because 0xD7 is × in Latin-1. Latin-accented, Greek, Cyrillic, Arabic, CJK, emoji, dashes, quotes, currency all trigger it. End to end:

read_character_reference("attribute", "&copy\xC3\x89", …)  → not decoded   (É, lead 0xC3)
read_character_reference("attribute", "&copy\xD7\x90", …)  → © decoded     (א, lead 0xD7)

Open / unverified:

  • FreeBSD: Darwin's ctype is FreeBSD-derived, so almost certainly identical — and unlike macOS, FreeBSD actually serves PHP in production. Not tested.
  • musl (Alpine): believed ASCII-only / unaffected; not tested.
  • Reachability: depends on the serving SAPI's active LC_CTYPE. CLI here reports C.UTF-8 (so wp-cli on macOS is already live); php-fpm/mod_php often start in plain C, leaving it latent. Not measured per-SAPI. That said, it's a latent bug on every platform — ctype_alnum doesn't implement "ASCII alphanumeric" anywhere — so platform reach is about urgency, not correctness.

Testing: assert the new predicate on high bytes directly (0xC2 ⇒ not ambiguous) — deterministic everywhere. Avoid a locale-forced integration test: it's flaky, skips across the matrix, and one that doesn't force an affected locale passes on glibc CI even without the fix, so it guards nothing.

@sirreal

sirreal commented Jun 23, 2026

Copy link
Copy Markdown
Owner Author

Playground example

Input:

<p title="&copy¯\_(ツ)_/¯">&copy¯\_(ツ)_/¯
<p title="&notಠ_ಠ">&notಠ_ಠ
<p title="Total&nbsp£20">Total&nbsp£20
<p title="Shipped&nbsp🎉">Shipped&nbsp🎉
<p title="ACME&reg™ widget">ACME&reg™ widget

Expected

├─P title="©¯\_(ツ)_/¯"
│ └─#text ©¯\_(ツ)_/¯ 
├─P title="¬ಠ_ಠ"
│ └─#text ¬ಠ_ಠ 
├─P title="Total £20"
│ └─#text Total £20 
├─P title="Shipped 🎉"
│ └─#text Shipped 🎉 
└─P title="ACME®™ widget"
  └─#text ACME®™ widget

HTML API processed (Linux / trunk)

(exact match)

HTML API processed (macOS / trunk)

├─P title="&copy¯\_(ツ)_/¯"
│ └─#text ©¯\_(ツ)_/¯ 
├─P title="&notಠ_ಠ"
│ └─#text ¬ಠ_ಠ 
├─P title="Total&nbsp£20"
│ └─#text Total £20 
├─P title="Shipped&nbsp🎉"
│ └─#text Shipped 🎉 
└─P title="ACME&reg™ widget"
  └─#text ACME®™ widget

HTML API processed (macOS / trunk)

(Fixed / expected behavior)

├─P title="©¯\_(ツ)_/¯"
│ └─#text ©¯\_(ツ)_/¯ 
├─P title="¬ಠ_ಠ"
│ └─#text ¬ಠ_ಠ 
├─P title="Total £20"
│ └─#text Total £20 
├─P title="Shipped 🎉"
│ └─#text Shipped 🎉 
└─P title="ACME®™ widget"
  └─#text ACME®™ widget

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the HTML API’s named character reference parsing to ensure semicolonless legacy references are decoded in attributes when followed by non-ASCII bytes, while preserving the HTML ambiguity rule for = and ASCII alphanumerics.

Changes:

  • Adjusts WP_HTML_Decoder::read_character_reference() to replace locale-sensitive ctype_alnum() ambiguity detection with explicit ASCII byte checks.
  • Ensures semicolonless legacy named references decode before non-ASCII (e.g., UTF-8) attribute followers.
  • Adds PHPUnit coverage for both the non-ASCII follower decode behavior and the ASCII/= ambiguity no-decode behavior (including locale probing to reproduce problematic ctype_alnum() behavior where available).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
tests/phpunit/tests/html-api/wpHtmlDecoder.php Adds targeted tests for semicolonless legacy reference decoding/ambiguity behavior in attributes, including locale handling.
src/wp-includes/html-api/class-wp-html-decoder.php Implements ASCII-only follower checks for ambiguity handling in attribute context, avoiding locale-dependent ctype_alnum().

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/phpunit/tests/html-api/wpHtmlDecoder.php
Comment thread tests/phpunit/tests/html-api/wpHtmlDecoder.php Outdated
Comment thread tests/phpunit/tests/html-api/wpHtmlDecoder.php Outdated
sirreal and others added 3 commits June 23, 2026 16:18
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Comment thread tests/phpunit/tests/html-api/wpHtmlDecoder.php
Comment thread tests/phpunit/tests/html-api/wpHtmlDecoder.php Outdated
@sirreal

sirreal commented Jun 23, 2026

Copy link
Copy Markdown
Owner Author

WordPress#12286

@sirreal sirreal closed this Jun 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants