fix(xml): support unquoted HTML named entities by Moskize91 · Pull Request #141 · oomol-lab/epub-translator

Moskize91 · 2026-06-18T09:05:01Z

Summary

normalize known HTML named entities outside quoted sections before XML parsing
preserve XML predefined entities, numeric references, unknown entities, and quoted content
bump package version to 0.1.11

Tests

poetry run ruff check epub_translator/xml/xml_like.py tests/test_xml_like.py pyproject.toml
poetry run pytest

coderabbitai · 2026-06-18T09:05:14Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: cbbb3874-0ace-43ae-848b-1262e70c96ee

📥 Commits

Reviewing files that changed from the base of the PR and between dec1572 and 5516692.

📒 Files selected for processing (2)

epub_translator/xml/xml_like.py
tests/test_xml_like.py

🚧 Files skipped from review as they are similar to previous changes (1)

epub_translator/xml/xml_like.py

Summary by CodeRabbit

Bug Fixes
- Improved handling of HTML named entities in XML/EPUB content by converting unquoted entities to their correct characters.
- Ensures entities inside quoted attribute values remain unchanged to prevent invalid markup transformations.
Tests
- Added coverage for unquoted entity normalization, apostrophe edge cases, and quoted-attribute behavior (including error expectations).
Chores
- Updated package version to 0.1.11.

Walkthrough

xml_like.py gains HTML named entity normalization applied during XMLLikeNode initialization. Two module-level constants are added: a regex matching &name; patterns and a set of XML predefined entity names (amp, lt, gt, apos, quot) exempt from conversion. A new call to _normalize_unquoted_html_entities is inserted after void-element self-closing normalization. Two helper functions implement a left-to-right, quote-aware scan that rewrites matched named entities to numeric character references (&#code;) via the stdlib html.entities.html5 mapping, leaving unknown or XML-predefined entities unchanged. Three unit tests cover unquoted entity normalization, apostrophe regression, and the unchanged-inside-quotes behavior. The package version is bumped to 0.1.11.

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The pull request title follows the required format `<type>(<scope>): <subject>` with type 'fix', scope 'xml', and clearly describes the main change of supporting unquoted HTML named entities.
Description check	✅ Passed	The pull request description is directly related to the changeset, detailing the entity normalization implementation, version bump, and testing approach.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

✨ Simplify code

Create PR with simplified code
Commit simplified code in branch fix/unquoted-html-entities

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

tests/test_xml_like.py (1)
294-315: ⚡ Quick win

Add a regression test for apostrophes in text before a named entity.

Please add a case like <p>It's ©</p> to ensure text apostrophes do not disable later entity normalization. This directly protects the new scanner logic from quote-state regressions.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_xml_like.py` around lines 294 - 315, Add a new regression test
method after the existing test methods in the test class that verifies
apostrophes in text do not disable entity normalization. Create a test case with
XML content containing an apostrophe in the text (like "It's") followed by a
named entity (like "&copy;"), parse it using XMLLikeNode, and assert that the
named entity is properly normalized to its Unicode equivalent. This ensures the
scanner logic correctly distinguishes between text apostrophes and quote
characters that delimit attribute values.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@epub_translator/xml/xml_like.py`:
- Around line 240-249: The quote-state tracking logic at line 242-243 is being
applied throughout all content including plain text nodes, causing regular text
apostrophes like in "It's" to incorrectly set quote mode, which then breaks
entity handling for subsequent entities like "&copy;". Quote state should only
be tracked when inside tag contexts (between '<' and '>'). Introduce a boolean
flag to track whether you are currently inside a tag, set it to true when
encountering '<' and false when encountering '>', and only execute the
quote-tracking logic (the line that updates the quote variable with the
conditional expression) when this flag is true. This ensures apostrophes in
plain text do not interfere with the quote state machine used for parsing
attributes.

---

Nitpick comments:
In `@tests/test_xml_like.py`:
- Around line 294-315: Add a new regression test method after the existing test
methods in the test class that verifies apostrophes in text do not disable
entity normalization. Create a test case with XML content containing an
apostrophe in the text (like "It's") followed by a named entity (like "&copy;"),
parse it using XMLLikeNode, and assert that the named entity is properly
normalized to its Unicode equivalent. This ensures the scanner logic correctly
distinguishes between text apostrophes and quote characters that delimit
attribute values.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: fa84fa23-10e1-421a-bcc8-0921603fff51

📥 Commits

Reviewing files that changed from the base of the PR and between 46f42ab and dec1572.

📒 Files selected for processing (3)

epub_translator/xml/xml_like.py
pyproject.toml
tests/test_xml_like.py

Handle unquoted HTML entities in XML-like content

dec1572

coderabbitai Bot reviewed Jun 18, 2026

View reviewed changes

Comment thread epub_translator/xml/xml_like.py

Fix HTML entity quote tracking

5516692

Moskize91 changed the title ~~Support unquoted HTML named entities~~ fix(xml): support unquoted HTML named entities Jun 18, 2026

Moskize91 merged commit 1652567 into main Jun 19, 2026
2 checks passed

Moskize91 deleted the fix/unquoted-html-entities branch June 19, 2026 02:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(xml): support unquoted HTML named entities#141

fix(xml): support unquoted HTML named entities#141
Moskize91 merged 2 commits into
mainfrom
fix/unquoted-html-entities

Moskize91 commented Jun 18, 2026

Uh oh!

coderabbitai Bot commented Jun 18, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Moskize91 commented Jun 18, 2026

Summary

Tests

Uh oh!

coderabbitai Bot commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Walkthrough

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Jun 18, 2026 •

edited

Loading