HTML API: Decode semicolonless legacy references before non-ASCII attribute followers#65
HTML API: Decode semicolonless legacy references before non-ASCII attribute followers#65sirreal wants to merge 19 commits into
Conversation
7169693 to
f92da80
Compare
|
I have reproduced this, the reproduction prints: |
|
The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the Core Committers: Use this line as a base for the props when committing in SVN: To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook. |
|
This seems to be platform independent. I am able to reproduce in a macOS machine with current Reproduction (system-dependent)$affected_locale = setlocale( LC_CTYPE, "C.UTF-8" );
if ( false === $affected_locale || ! ctype_alnum( "\xC2" ) ) {
die( "This platform does not have an affected LC_CTYPE locale.\n" );
} else {
echo "Affected LC_CTYPE locale: $affected_locale\n";
}
foreach ( [
[ "©É", 0 ],
[ "Total £20", 5 ],
[ "Shipped 🎉", 7 ],
[ "ACME®™ widget", 4 ],
] as $item ) {
[ $raw, $at ] = $item;
echo "===\n";
var_dump( $raw, bin2hex( $raw ) );
$decoded = WP_HTML_Decoder::decode_attribute( $raw );
var_dump( $decoded, @bin2hex( $decoded ) );
$match_byte_length = null;
$reference = WP_HTML_Decoder::read_character_reference( "attribute", $raw, $at, $match_byte_length );
var_dump( $reference, @bin2hex( $reference ) );
var_dump( $match_byte_length );
}AI debugging summaryThe Platform dependence (the important part). This isn't "non-
Darwin's UTF-8 ctype classifies the follower byte by its Latin-1 identity: Consequence — a clean asymmetry: in valid UTF-8 the only non-ASCII follower that escapes the bug on Darwin is a Open / unverified:
Testing: assert the new predicate on high bytes directly ( |
|
Input: <p title="©¯\_(ツ)_/¯">©¯\_(ツ)_/¯
<p title="¬ಠ_ಠ">¬ಠ_ಠ
<p title="Total £20">Total £20
<p title="Shipped 🎉">Shipped 🎉
<p title="ACME®™ widget">ACME®™ widgetExpectedHTML API processed (Linux /
|
There was a problem hiding this comment.
Pull request overview
This PR updates the HTML API’s named character reference parsing to ensure semicolonless legacy references are decoded in attributes when followed by non-ASCII bytes, while preserving the HTML ambiguity rule for = and ASCII alphanumerics.
Changes:
- Adjusts
WP_HTML_Decoder::read_character_reference()to replace locale-sensitivectype_alnum()ambiguity detection with explicit ASCII byte checks. - Ensures semicolonless legacy named references decode before non-ASCII (e.g., UTF-8) attribute followers.
- Adds PHPUnit coverage for both the non-ASCII follower decode behavior and the ASCII/
=ambiguity no-decode behavior (including locale probing to reproduce problematicctype_alnum()behavior where available).
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| tests/phpunit/tests/html-api/wpHtmlDecoder.php | Adds targeted tests for semicolonless legacy reference decoding/ambiguity behavior in attributes, including locale handling. |
| src/wp-includes/html-api/class-wp-html-decoder.php | Implements ASCII-only follower checks for ambiguity handling in attribute context, avoiding locale-dependent ctype_alnum(). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Summary
ctype_alnum()checks with explicit ASCII byte checks.=followers.Testing
codex review --base trunk.Trac ticket: https://core.trac.wordpress.org/ticket/65372
Use of AI Tools
AI assistance: Yes
Tool(s): Codex
Model(s): GPT-5.5
Used for: Splitting the fuzzer-discovered fix, PR description cleanup, and code review.
This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.