HTML Decoder: Replace system-dependent ctype check with ASCII byte comparison by sirreal · Pull Request #12286 · WordPress/wordpress-develop

sirreal · 2026-06-23T14:43:17Z

Part of decoding HTML named character references in attribute values may involve checking the codepoint immediately following the named character reference:

13.2.5.73 Named character reference state
…

If there is a [named character reference] match

If the character reference was consumed as part of an attribute, and the last character matched is not a U+003B SEMICOLON character (;), and the next input character is either a U+003D EQUALS SIGN character (=) or an ASCII alphanumeric, then, for historical reasons, flush code points consumed as a character reference and switch to the return state.

The ASCII alphanumeric check was implemented using ctype_alnum(). The behavior of this depends on the host system and the locale. On my system (macOS) it returns true for characters outside of the desired ASCII alphanumeric range.

php -r 'echo ctype_alnum( "\xC2" ) ? "Affected" : "Unaffected";'
# Affected

This change compares the following byte with the well-defined ASCII alphanumeric ranges from the HTML specification.

This change also does some minor restructuring of the method to make it align clearly with the specification and to include an early return and avoid the byte comparison in the majority of cases.

Trac ticket: https://core.trac.wordpress.org/ticket/65372

Use of AI Tools

AI assistance: Yes
Tool(s): Claude Opus 5.8, Codex GPT 5.5
Used for: Fuzz testing and discovery, fix draft, refinement, review.

This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

github-actions · 2026-06-23T14:43:28Z

Hi there! 👋

Thank you for your contribution to WordPress! 💖

It looks like this is your first pull request to wordpress-develop. Here are a few things to be aware of that may help you out!

No one monitors this repository for new pull requests. Pull requests must be attached to a Trac ticket to be considered for inclusion in WordPress Core. To attach a pull request to a Trac ticket, please include the ticket's full URL in your pull request description.

Pull requests are never merged on GitHub. The WordPress codebase continues to be managed through the SVN repository that this GitHub repository mirrors. Please feel free to open pull requests to work on any contribution you are making.

More information about how GitHub pull requests can be used to contribute to WordPress can be found in the Core Handbook.

Please include automated tests. Including tests in your pull request is one way to help your patch be considered faster. To learn about WordPress' test suites, visit the Automated Testing page in the handbook.

If you have not had a chance, please review the Contribute with Code page in the WordPress Core Handbook.

The Developer Hub also documents the various coding standards that are followed:

Thank you,
The WordPress Project

sirreal · 2026-06-23T14:46:23Z

+		$locale_candidates = array(
+			'C.UTF-8',
+			'C.utf8',
+			'en_US.UTF-8',
+			'en_US.utf8',
+			'en_GB.UTF-8',
+			'en_GB.utf8',
+		);


I don't know whether it's worth checking multiple locales or all of these locales are likely to all have the same behavior on the same system. For example, my system has the issue with "C.UTF-8", the other .UTF-8 locales listed here, and more.

github-actions · 2026-06-23T14:47:18Z

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

All changes will be lost when closing a tab with a Playground instance.
All changes will be lost when refreshing the page.
A fresh instance is created each time the link below is clicked.
Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

sirreal · 2026-06-23T14:49:37Z

+	 *
+	 * @ticket 65372
+	 */
+	public function test_semicolonless_legacy_reference_before_multibyte_attribute_follower( string $encoded_attribute_value, string $expected, string $expected_decode, int $expected_byte_length ): void {


This is the test that fails on trunk depending on the system.

This reverts commit e2ed016.

sirreal · 2026-06-23T14:52:26Z

I'm trying a revert of the ctype_alnum() change to see if there are any failures on CI.

On my system, I get these failures from one of the new tests:

1) Tests_HtmlApi_WpHtmlDecoder::test_semicolonless_legacy_reference_before_multibyte_attribute_follower with data set #0 ('&copy¯\_(ツ)_/¯', '©¯\_(ツ)_/¯', '©', 5)
Failed to decode the full attribute value as expected.
Failed asserting that two strings are identical.
--- Expected
+++ Actual
@@ @@
-'©¯\_(ツ)_/¯'
+'&copy¯\_(ツ)_/¯'

tests/phpunit/tests/html-api/wpHtmlDecoder.php:121

2) Tests_HtmlApi_WpHtmlDecoder::test_semicolonless_legacy_reference_before_multibyte_attribute_follower with data set #1 ('&notಠ_ಠ', '¬ಠ_ಠ', '¬', 4)
Failed to decode the full attribute value as expected.
Failed asserting that two strings are identical.
--- Expected
+++ Actual
@@ @@
-'¬ಠ_ಠ'
+'&notಠ_ಠ'

tests/phpunit/tests/html-api/wpHtmlDecoder.php:121

3) Tests_HtmlApi_WpHtmlDecoder::test_semicolonless_legacy_reference_before_multibyte_attribute_follower with data set #2 ('&nbsp£20', ' £20', ' ', 5)
Failed to decode the full attribute value as expected.
Failed asserting that two strings are identical.
--- Expected
+++ Actual
@@ @@
-' £20'
+'&nbsp£20'

tests/phpunit/tests/html-api/wpHtmlDecoder.php:121

4) Tests_HtmlApi_WpHtmlDecoder::test_semicolonless_legacy_reference_before_multibyte_attribute_follower with data set #3 ('&nbsp🎉', ' 🎉', ' ', 5)
Failed to decode the full attribute value as expected.
Failed asserting that two strings are identical.
--- Expected
+++ Actual
@@ @@
-' 🎉'
+'&nbsp🎉'

tests/phpunit/tests/html-api/wpHtmlDecoder.php:121

5) Tests_HtmlApi_WpHtmlDecoder::test_semicolonless_legacy_reference_before_multibyte_attribute_follower with data set #4 ('&reg™', '®™', '®', 4)
Failed to decode the full attribute value as expected.
Failed asserting that two strings are identical.
--- Expected
+++ Actual
@@ @@
-'®™'
+'&reg™'

tests/phpunit/tests/html-api/wpHtmlDecoder.php:121

FAILURES!
Tests: 115, Assertions: 331, Failures: 5.

github-actions · 2026-06-23T15:07:52Z

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Core Committers: Use this line as a base for the props when committing in SVN:

Props jonsurrell, dmsnell.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

dmsnell · 2026-06-23T18:58:32Z

+		 * - In _attribute context_, "&not" decodes to "¬". Condition 3 is not satisfied
+		 *   because there is no following code point to consider.
+		 * - In _attribute context_, "&notme" decodes to "&notme" unchanged because it
+		 *   satisfies all three conditions above.


your expanded discussion is really helpful, but I find the inversion of logic really hard to follow. before we had “allowed under these circumstances” and now we have “not allowed when these circumstances are not met”

The “ambiguous” language might be helpful both to match the spec and to explain the intent behind the rule. the intent is that we are determining if it was likely that the missing semicolon was a typo vs. something never intended to be a character reference: it’s ambiguous.

For example:

Condition 3 is not satisfied

The reference is not not-rendered because a condition is not satisfied.

Perhaps phrasing could be more affirmative in describing what does happen.

In attribute context, "&not己" decodes to "¬己" because the character in the place of the missing semicolon is distinctly separate from the name; it is neither an ASCII alphanumeric or an equals sign.

If we are going to expand this so much, we might also consider explaining the other conditions, the ambiguous ones, to highlight why the rule is here. Specifically I see no mention of URL query arguments, which explains the equals sign.

Please notify all future &current students.

https://website.domain/search?q=html&not=regex

So these two cases I think capture the “error-handling” aspect and might clarify the complicated rules. I think the essence is that everything here is complicated to try and avoid these two cases.

I struggled a lot with the language here. Clarifying what's happening and aligning with the spec was one of my goals.

I intentionally removed "ambiguous" because it adds to confusion here. These cases have nothing to do with the ambiguous ampersand state. That state is whenever & + an ASCII alphanumeric does not lead to a named character reference match. At this point, a match has already been made but some special rules prevent it from being applied.

ambiguous ampersand state

The ambiguous ampersand state is entered when a named character references is expected but there's no match, for example &absurd;. The flow is like this:

flowchart TD Data["Data state"] -->|"U+0026 AMPERSAND return state = Data"| CR RCDATA["RCDATA state"] -->|"U+0026 AMPERSAND return state = RCDATA"| CR AttrDQ["Attribute value (double-quoted) state"] -->|"U+0026 AMPERSAND return state = double-quoted attr value"| CR AttrSQ["Attribute value (single-quoted) state"] -->|"U+0026 AMPERSAND return state = single-quoted attr value"| CR AttrUQ["Attribute value (unquoted) state"] -->|"U+0026 AMPERSAND return state = unquoted attr value"| CR CR["Character reference state temporary buffer = U+0026"] -->|"ASCII alphanumeric reconsume"| NCR["Named character reference state"] NCR -->|"No named character reference match flush consumed code points"| AA["Ambiguous ampersand state"]

Loading

So in our example & enters character referece, then &a enters the named character reference state. At that point it fails to match a named character reference it flushes and enters ambiguous ampersand.

This is the relevant part of the spec (bold mine for the relevant section):

If the character reference was consumed as part of an attribute, and the last character matched is not a U+003B SEMICOLON character (;), and the next input character is either a U+003D EQUALS SIGN character (=) or an ASCII alphanumeric, then, for historical reasons, flush code points consumed as a character reference and switch to the return state.

Otherwise:

If the last character matched is not a U+003B SEMICOLON character (;), then this is a missing-semicolon-after-character-reference parse error.

Set the temporary buffer to the empty string. Append one or two characters corresponding to the character reference name (as given by the second column of the named character references table) to the temporary buffer.

Flush code points consumed as a character reference. Switch to the return state.

That bold section is what I've tried to capture in these notes.

I've reworked comment and added some details about URLs and why the special cases are helpful.

sirreal and others added 18 commits June 13, 2026 00:01

Fix attribute legacy reference follower checks

f92da80

Merge remote-tracking branch 'upstream/trunk' into HEAD

2876d8a

Fix coding standards for decoder legacy follower checks

8bb66dc

Merge branch 'trunk' into fix/html-decoder-legacy-follower-ascii

cc0d43a

Add ticket number

2187e33

Tests: Broaden UTF-8 locale candidates for HTML decoder

e6e98c7

Improve tests

f9c7d74

Improve test logic

ac0d842

Test fixups

bd5566e

Fix test language

65e3937

Fix test description

cb3b277

Rework decoding bail to match spec, improve perf and clarity

deccaae

clean up language

63a2fc1

Improve comment

e0d7a45

Tighten up spec

a2a9357

Merge branch 'trunk' into fix/html-decoder-legacy-follower-ascii

46c8153

Fix data provider test name

998be41

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Fix tests phpdoc

26a8d10

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

sirreal changed the title ~~HTML Decoder: Replace system-dependent ctype check ASCII byte comparison~~ HTML Decoder: Replace system-dependent ctype check with ASCII byte comparison Jun 23, 2026

sirreal commented Jun 23, 2026

View reviewed changes

sirreal added 2 commits June 23, 2026 16:50

Revert ctype_alnum() fix

e2ed016

Revert "Revert ctype_alnum() fix"

118ef06

This reverts commit e2ed016.

sirreal mentioned this pull request Jun 23, 2026

HTML API: Decode semicolonless legacy references before non-ASCII attribute followers sirreal/wordpress-develop#65

Closed

sirreal marked this pull request as ready for review June 23, 2026 15:07

sirreal requested a review from dmsnell June 23, 2026 15:07

sirreal added 2 commits June 23, 2026 19:37

Improve language and add examples

3040e14

Improve documentation

b55f333

dmsnell reviewed Jun 23, 2026

View reviewed changes

sirreal added 3 commits June 25, 2026 12:55

Merge branch 'trunk' into fix/html-decoder-legacy-follower-ascii

85028e0

Improve comment clarity about attribute special case

ebeb091

Add clause about URL query strings

d3d88f0

sirreal requested a review from dmsnell June 25, 2026 12:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HTML Decoder: Replace system-dependent ctype check with ASCII byte comparison#12286

HTML Decoder: Replace system-dependent ctype check with ASCII byte comparison#12286
sirreal wants to merge 25 commits into
WordPress:trunkfrom
sirreal:fix/html-decoder-legacy-follower-ascii

sirreal commented Jun 23, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 23, 2026

Uh oh!

sirreal Jun 23, 2026

Uh oh!

github-actions Bot commented Jun 23, 2026

Uh oh!

sirreal Jun 23, 2026

Uh oh!

sirreal commented Jun 23, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 23, 2026 •

edited

Loading

Uh oh!

dmsnell Jun 23, 2026

Uh oh!

sirreal Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

sirreal commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Use of AI Tools

Uh oh!

github-actions Bot commented Jun 23, 2026

Uh oh!

sirreal Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 23, 2026

Test using WordPress Playground

Some things to be aware of

Uh oh!

sirreal Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

sirreal commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dmsnell Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

sirreal Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sirreal commented Jun 23, 2026 •

edited

Loading

sirreal commented Jun 23, 2026 •

edited

Loading

github-actions Bot commented Jun 23, 2026 •

edited

Loading