Skip to content

fix(xml): support unquoted HTML named entities#141

Merged
Moskize91 merged 2 commits into
mainfrom
fix/unquoted-html-entities
Jun 19, 2026
Merged

fix(xml): support unquoted HTML named entities#141
Moskize91 merged 2 commits into
mainfrom
fix/unquoted-html-entities

Conversation

@Moskize91

Copy link
Copy Markdown
Contributor

Summary

  • normalize known HTML named entities outside quoted sections before XML parsing
  • preserve XML predefined entities, numeric references, unknown entities, and quoted content
  • bump package version to 0.1.11

Tests

  • poetry run ruff check epub_translator/xml/xml_like.py tests/test_xml_like.py pyproject.toml
  • poetry run pytest

@coderabbitai

coderabbitai Bot commented Jun 18, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: cbbb3874-0ace-43ae-848b-1262e70c96ee

📥 Commits

Reviewing files that changed from the base of the PR and between dec1572 and 5516692.

📒 Files selected for processing (2)
  • epub_translator/xml/xml_like.py
  • tests/test_xml_like.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • epub_translator/xml/xml_like.py

Summary by CodeRabbit

  • Bug Fixes
    • Improved handling of HTML named entities in XML/EPUB content by converting unquoted entities to their correct characters.
    • Ensures entities inside quoted attribute values remain unchanged to prevent invalid markup transformations.
  • Tests
    • Added coverage for unquoted entity normalization, apostrophe edge cases, and quoted-attribute behavior (including error expectations).
  • Chores
    • Updated package version to 0.1.11.

Walkthrough

xml_like.py gains HTML named entity normalization applied during XMLLikeNode initialization. Two module-level constants are added: a regex matching &name; patterns and a set of XML predefined entity names (amp, lt, gt, apos, quot) exempt from conversion. A new call to _normalize_unquoted_html_entities is inserted after void-element self-closing normalization. Two helper functions implement a left-to-right, quote-aware scan that rewrites matched named entities to numeric character references (&#code;) via the stdlib html.entities.html5 mapping, leaving unknown or XML-predefined entities unchanged. Three unit tests cover unquoted entity normalization, apostrophe regression, and the unchanged-inside-quotes behavior. The package version is bumped to 0.1.11.

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The pull request title follows the required format <type>(<scope>): <subject> with type 'fix', scope 'xml', and clearly describes the main change of supporting unquoted HTML named entities.
Description check ✅ Passed The pull request description is directly related to the changeset, detailing the entity normalization implementation, version bump, and testing approach.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
✨ Simplify code
  • Create PR with simplified code
  • Commit simplified code in branch fix/unquoted-html-entities

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
tests/test_xml_like.py (1)

294-315: ⚡ Quick win

Add a regression test for apostrophes in text before a named entity.

Please add a case like <p>It's &copy;</p> to ensure text apostrophes do not disable later entity normalization. This directly protects the new scanner logic from quote-state regressions.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_xml_like.py` around lines 294 - 315, Add a new regression test
method after the existing test methods in the test class that verifies
apostrophes in text do not disable entity normalization. Create a test case with
XML content containing an apostrophe in the text (like "It's") followed by a
named entity (like "&copy;"), parse it using XMLLikeNode, and assert that the
named entity is properly normalized to its Unicode equivalent. This ensures the
scanner logic correctly distinguishes between text apostrophes and quote
characters that delimit attribute values.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@epub_translator/xml/xml_like.py`:
- Around line 240-249: The quote-state tracking logic at line 242-243 is being
applied throughout all content including plain text nodes, causing regular text
apostrophes like in "It's" to incorrectly set quote mode, which then breaks
entity handling for subsequent entities like "&copy;". Quote state should only
be tracked when inside tag contexts (between '<' and '>'). Introduce a boolean
flag to track whether you are currently inside a tag, set it to true when
encountering '<' and false when encountering '>', and only execute the
quote-tracking logic (the line that updates the quote variable with the
conditional expression) when this flag is true. This ensures apostrophes in
plain text do not interfere with the quote state machine used for parsing
attributes.

---

Nitpick comments:
In `@tests/test_xml_like.py`:
- Around line 294-315: Add a new regression test method after the existing test
methods in the test class that verifies apostrophes in text do not disable
entity normalization. Create a test case with XML content containing an
apostrophe in the text (like "It's") followed by a named entity (like "&copy;"),
parse it using XMLLikeNode, and assert that the named entity is properly
normalized to its Unicode equivalent. This ensures the scanner logic correctly
distinguishes between text apostrophes and quote characters that delimit
attribute values.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: fa84fa23-10e1-421a-bcc8-0921603fff51

📥 Commits

Reviewing files that changed from the base of the PR and between 46f42ab and dec1572.

📒 Files selected for processing (3)
  • epub_translator/xml/xml_like.py
  • pyproject.toml
  • tests/test_xml_like.py

Comment thread epub_translator/xml/xml_like.py
@Moskize91 Moskize91 changed the title Support unquoted HTML named entities fix(xml): support unquoted HTML named entities Jun 18, 2026
@Moskize91 Moskize91 merged commit 1652567 into main Jun 19, 2026
2 checks passed
@Moskize91 Moskize91 deleted the fix/unquoted-html-entities branch June 19, 2026 02:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant