HTML Decoder: Replace system-dependent ctype check with ASCII byte comparison#12286
HTML Decoder: Replace system-dependent ctype check with ASCII byte comparison#12286sirreal wants to merge 25 commits into
Conversation
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
|
Hi there! 👋 Thank you for your contribution to WordPress! 💖 It looks like this is your first pull request to No one monitors this repository for new pull requests. Pull requests must be attached to a Trac ticket to be considered for inclusion in WordPress Core. To attach a pull request to a Trac ticket, please include the ticket's full URL in your pull request description. Pull requests are never merged on GitHub. The WordPress codebase continues to be managed through the SVN repository that this GitHub repository mirrors. Please feel free to open pull requests to work on any contribution you are making. More information about how GitHub pull requests can be used to contribute to WordPress can be found in the Core Handbook. Please include automated tests. Including tests in your pull request is one way to help your patch be considered faster. To learn about WordPress' test suites, visit the Automated Testing page in the handbook. If you have not had a chance, please review the Contribute with Code page in the WordPress Core Handbook. The Developer Hub also documents the various coding standards that are followed:
Thank you, |
| $locale_candidates = array( | ||
| 'C.UTF-8', | ||
| 'C.utf8', | ||
| 'en_US.UTF-8', | ||
| 'en_US.utf8', | ||
| 'en_GB.UTF-8', | ||
| 'en_GB.utf8', | ||
| ); |
There was a problem hiding this comment.
I don't know whether it's worth checking multiple locales or all of these locales are likely to all have the same behavior on the same system. For example, my system has the issue with "C.UTF-8", the other .UTF-8 locales listed here, and more.
Test using WordPress PlaygroundThe changes in this pull request can previewed and tested using a WordPress Playground instance. WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser. Some things to be aware of
For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation. |
| * | ||
| * @ticket 65372 | ||
| */ | ||
| public function test_semicolonless_legacy_reference_before_multibyte_attribute_follower( string $encoded_attribute_value, string $expected, string $expected_decode, int $expected_byte_length ): void { |
There was a problem hiding this comment.
This is the test that fails on trunk depending on the system.
This reverts commit e2ed016.
|
I'm trying a revert of the On my system, I get these failures from one of the new tests: |
|
The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the Core Committers: Use this line as a base for the props when committing in SVN: To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook. |
| * - In _attribute context_, "¬" decodes to "¬". Condition 3 is not satisfied | ||
| * because there is no following code point to consider. | ||
| * - In _attribute context_, "¬me" decodes to "¬me" unchanged because it | ||
| * satisfies all three conditions above. |
There was a problem hiding this comment.
your expanded discussion is really helpful, but I find the inversion of logic really hard to follow. before we had “allowed under these circumstances” and now we have “not allowed when these circumstances are not met”
The “ambiguous” language might be helpful both to match the spec and to explain the intent behind the rule. the intent is that we are determining if it was likely that the missing semicolon was a typo vs. something never intended to be a character reference: it’s ambiguous.
For example:
Condition 3 is not satisfied
The reference is not not-rendered because a condition is not satisfied.
Perhaps phrasing could be more affirmative in describing what does happen.
In attribute context, "¬己" decodes to "¬己" because the character in the place of the missing semicolon is distinctly separate from the name; it is neither an ASCII alphanumeric or an equals sign.
If we are going to expand this so much, we might also consider explaining the other conditions, the ambiguous ones, to highlight why the rule is here. Specifically I see no mention of URL query arguments, which explains the equals sign.
Please notify all future ¤t students.https://website.domain/search?q=html¬=regex
So these two cases I think capture the “error-handling” aspect and might clarify the complicated rules. I think the essence is that everything here is complicated to try and avoid these two cases.
There was a problem hiding this comment.
I struggled a lot with the language here. Clarifying what's happening and aligning with the spec was one of my goals.
I intentionally removed "ambiguous" because it adds to confusion here. These cases have nothing to do with the ambiguous ampersand state. That state is whenever & + an ASCII alphanumeric does not lead to a named character reference match. At this point, a match has already been made but some special rules prevent it from being applied.
ambiguous ampersand state
The ambiguous ampersand state is entered when a named character references is expected but there's no match, for example &absurd;. The flow is like this:
flowchart TD
Data["Data state"] -->|"U+0026 AMPERSAND<br/>return state = Data"| CR
RCDATA["RCDATA state"] -->|"U+0026 AMPERSAND<br/>return state = RCDATA"| CR
AttrDQ["Attribute value<br/>(double-quoted) state"] -->|"U+0026 AMPERSAND<br/>return state = double-quoted attr value"| CR
AttrSQ["Attribute value<br/>(single-quoted) state"] -->|"U+0026 AMPERSAND<br/>return state = single-quoted attr value"| CR
AttrUQ["Attribute value<br/>(unquoted) state"] -->|"U+0026 AMPERSAND<br/>return state = unquoted attr value"| CR
CR["Character reference state<br/>temporary buffer = U+0026"] -->|"ASCII alphanumeric<br/>reconsume"| NCR["Named character reference state"]
NCR -->|"No named character reference match<br/>flush consumed code points"| AA["Ambiguous ampersand state"]
So in our example & enters character referece, then &a enters the named character reference state. At that point it fails to match a named character reference it flushes and enters ambiguous ampersand.
This is the relevant part of the spec (bold mine for the relevant section):
If the character reference was consumed as part of an attribute, and the last character matched is not a U+003B SEMICOLON character (;), and the next input character is either a U+003D EQUALS SIGN character (=) or an ASCII alphanumeric, then, for historical reasons, flush code points consumed as a character reference and switch to the return state.
Otherwise:
If the last character matched is not a U+003B SEMICOLON character (;), then this is a missing-semicolon-after-character-reference parse error.
Set the temporary buffer to the empty string. Append one or two characters corresponding to the character reference name (as given by the second column of the named character references table) to the temporary buffer.
Flush code points consumed as a character reference. Switch to the return state.
That bold section is what I've tried to capture in these notes.
I've reworked comment and added some details about URLs and why the special cases are helpful.
Part of decoding HTML named character references in attribute values may involve checking the codepoint immediately following the named character reference:
The ASCII alphanumeric check was implemented using
ctype_alnum(). The behavior of this depends on the host system and the locale. On my system (macOS) it returnstruefor characters outside of the desired ASCII alphanumeric range.This change compares the following byte with the well-defined ASCII alphanumeric ranges from the HTML specification.
This change also does some minor restructuring of the method to make it align clearly with the specification and to include an early return and avoid the byte comparison in the majority of cases.
Trac ticket: https://core.trac.wordpress.org/ticket/65372
Use of AI Tools
AI assistance: Yes
Tool(s): Claude Opus 5.8, Codex GPT 5.5
Used for: Fuzz testing and discovery, fix draft, refinement, review.
This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.