WIP: v0.4 — DO NOT MERGE (tracking) by mathieu17g · Pull Request #76 · JuliaData/XML.jl

mathieu17g · 2026-06-22T17:44:29Z

WIP — do not merge. Tracking + collaboration PR for the v0.4 line (supersedes #54).
v0.4-dev = #54's rewrite + the cursor stack + ongoing work.

Highlights

A ground-up rewrite of XML.jl's internals initiated by @joshday — a token-based streaming parser, a pull/cursor (StAX-style) API, XPath support, and substantial parse/read speedups.

Done so far

Branch assembled — WIP XML.jl v0.4: Rewrite of internals, streaming tokenizer, XPath support, and bug fixes #54's rewrite + the cursor stack + main's 0.3.x fixes & infrastructure, combined by merging main in (merge-only, so the cursor branch stays usable for downstreams that build on it).
Non-ASCII names — element, attribute, PI, and DTD names with multi-byte UTF-8 (e.g. <café>, <日本語/>) now parse correctly.
Well-formedness enforcement — new wellformed = :lenient | :structural | :strict option (default :structural): rejects multiple roots, non-whitespace text outside the root, and empty/invalid names; :strict additionally rejects -- inside comments, empty/invalid PI targets, and character references outside the legal character range (XML 1.0 §2.2). The level is a compile-time type parameter, so a mode never pays for checks above its level — :lenient runs none at all.
BOM handling — UTF-16 LE/BE and UTF-8 BOM decode on read; a leading BOM is stripped on parse; UTF-16 without a BOM — which XML 1.0 §4.3.3 ("Character Encoding in Entities") forbids — now raises a clear error instead of failing obscurely downstream.
Test harness — the borrowed libxml2 suite (240+ cases) is now actually run (it was present but never included); the W3C conformance suite now asserts (every well-formed doc must parse; a no-regression floor on rejected ill-formed docs) instead of only warning, and runs at :strict.
Regression guards — escape on SubString (escape should work with AbstractString. #60), BOM decoding, and the UTF-16-without-BOM error are pinned by tests.
Julia floor raised to 1.10 (LTS).
Infrastructure carried from main — CI bumps, codecov, the W3C-suite cache, and the 0.3.9 CHANGELOG fold-in.

Still ahead

Comprehensive pre-registration audit — sweep the rewrite for any remaining correctness/completeness gaps beyond the set already fixed.
Performance — (re-)measure the parse/read speedup on v0.4-dev vs the current release before quoting figures.
Downstream compatibility (XLSX first) — XLSX is the most prominent dependent needing code changes; open an adaptation PR on it (building on @joshday's WIP: Supporting XML v0.4 XLSX.jl#361) and wire its branch into this PR's downstream CI, so XLSX's own test suite runs against this XML 0.4 on every push here. Other broadly-used dependents may get the same treatment as we go. (The XLSX adaptation already passes locally; opening the PR and wiring the check is what remains.)
CHANGELOG — document the breaking changes and write the migration guide.
Register v0.4; close WIP XML.jl v0.4: Rewrite of internals, streaming tokenizer, XPath support, and bug fixes #54, and perf: avoid per-call ctx allocation in next_no_xml_space #58 as subsumed.

Issues addressed

XML character references are not unescaped/escaped #17 — XML character references are now unescaped/escaped
XPath support #30 — XPath support
Inconsistent type for attributes where nodes have no attributes #33 — Inconsistent type for attributes where nodes have no attributes
Simple XML.write followed by XML.parse fails #35 — Simple XML.write followed by XML.parse no longer fails
get not defined to match getindex #50 — get defined to match getindex
Question: Why the choice not to escape & to &amp; ? #52 — escape now unconditionally escapes '&'
Incorrect unescape result. #53 — Incorrect unescape result (double-unescaping)

Breaking changes & impact on dependent packages

The low-level streaming API (Raw, next/prev, single-argument parent/depth, nodes_equal, escape!/unescape!) is removed in favour of the token parser and cursor API. The high-level DOM API is largely preserved, but note some structural and behavioural changes:

Node is now parametric (Node{S}); attributes is a Vector{Pair} and children may be nothing.
parse/read decode entities into values, so value() returns &, not &.
write auto-escapes text and attributes (double-escape risk if you pre-escape).
Duplicate attributes now error.

Dependents that only use the high-level DOM mostly need a [compat] bump to 0.4 plus a spot-check of those behavioural changes; those using the removed low-level API (notably XLSX.jl) need code changes. A migration guide is planned (see "Still ahead").

Performance

v0.4 aims at a substantial parse/read speedup over 0.3.x; figures to be (re-)measured on v0.4-dev vs the current release before quoting.

Drops the underscore prefixes from internal names (module is unexported, the clutter was only needed back when these names leaked into XML.jl). Replaces the name-byte predicate with a 256-entry const lookup table. Also fixes a 1-based indexing off-by-one in read_doctype_body: the '<!--' detection guarded with `pos >= 2` while reading `codeunit(data, pos - 2)`, which is codeunit 0 when pos == 2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

tag, value, keys, and attributes on LazyNode now return SubString{String} views into the source rather than allocating fresh Strings, so traversing a large document lazily does not duplicate its text data. Introduces a small _as_substring helper to promote the String that `unescape` can return into a SubString so Attributes stays homogeneous. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

_write_xml now inspects children before reformatting: if any Text child has non-whitespace content (or any CData child exists), the element is treated as mixed content and its whitespace is preserved verbatim. Otherwise the writer drops the whitespace-only Text nodes the parser emits for round-tripping source formatting and generates fresh indentation. Same filter is applied at the Document level. Also adds an unescape(::SubString{String}) specialization that returns the input unchanged when it contains no '&', avoiding an allocation on the lazy scanning path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The medium-file workloads show a ~10–25% regression vs the numbers captured at 4a728ee ("Revamp benchmarks"). v0.4-vs-v0.3.8 remains a 70–80% improvement, so this is a post-release follow-up, not a release blocker. Suspected culprit is the eager Pair{S,S}[] alloc per TOKEN_OPEN_TAG introduced in 2f71f9a — see follow-up issue. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds `Cursor`: a single mutable wrapper advanced in place over the token stream (the cursor-based StAX direction from #61). Closes the per-child `LazyNode` allocation gap of the lazy DOM walk by mutating one object instead of materializing a node per child. Orthogonal/additive design: - New file src/cursor.jl; seams are one include + 4 exports in XML.jl. - `Cursor` and `LazyNode` are siblings on the shared XMLTokenizer foundation. The cursor's accessors rest on the token-layer primitives (tag_name, attr_value, pi_target, unescape) — they never call LazyNode or its accessors, so DOM-layer changes don't affect the cursor. The token→value logic is intentionally duplicated rather than shared, to keep this purely additive (a later refactor can factor it out). API: next!, for_each_child, nodetype/tag/value/attributes/depth/eof, get, the Base.iterate pull-mode surface, and LazyNode(c) as a one-way snapshot bridge for the aliasing contract (the cursor is reused in place; reads are synchronous-safe, retention requires a snapshot). Tests: test/test_cursor.jl (46 cases) — depth model on hand-counted docs, for_each_child, attributes/get, CData/Comment/PI/DTD/entity values, accessor agreement with LazyNode node-for-node, snapshot survival, iterator protocol. Full suite passes. Perf (N=100k synth, vs the lazy-walk techniques in #61): Cursor next!() DFS = 103 ms / 305 MiB / 4.0M allocs, vs v0.4 eachchildnode/recursive ~310-390 ms / ~1 GiB / 12-15M (×3 faster, ×3.4 less memory). It does not yet reach the v0.3.8+#59 next!()-DFS class (57 ms / 123 MiB): the residual ~1 alloc/token is the non-isbits Token tuple at the iterate boundary, which a follow-up bitstype-Token change removes. Ref: #61 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@offset

…ocation Replaces `Token{S}`'s `raw::SubString{S}` field with a plain byte range — `(kind, has_entities, offset::Int, ncodeunits::Int)` — making `Token` non-parametric and isbits (24 bytes). The `(Token, TokenizerState)` tuple returned by `iterate` is now isbits, so it returns in registers/sret with no heap allocation even though the tokenizer body is too large to inline. This removes the per-token allocation that was the cursor's residual cost (see #61). Token API: - `raw(token, data) -> SubString` reconstructs the text view from the source. Multibyte-safe: it lands the end index on the START of the last char via `prevind` (a naive `SubString(data, off+1, off+ncu)` passes a UTF-8 continuation byte as the end index and throws — verified on "aé"/"日本"). `_token_root` resolves `data::SubString` to its parent (offsets are root-relative). This matters for the UTF-16 path of #62, whose fix transcodes to a UTF-8 String upstream of the tokenizer → dense multibyte. - Emit-site constructors `Token(kind, view)` / `Token(kind, has_amp, view)` keep only the view's range, so all 22 tokenizer emit sites are unchanged. - `tag_name` / `attr_value` / `pi_target` now take `(token, data)`. - `TokenizerState` and `StatefulTokenizer.state` drop the `{S}` parameter (the buffered `pending` Token is non-parametric); `has_pending` tests `pending.ncodeunits != 0`; `show(::Token)` prints `KIND @offset+len`. Consumers thread `data` (`tok.raw` → `raw(tok, data)`): src/XML.jl (eager _parse), src/lazynode.jl (LazyNode + iterators; `_lazy_pos`/`_token_end` simplify to direct field access; `LazyAttrIterator` reaches the source via a small `_src(iter)` helper since it carries only the tokenizer), src/cursor.jl. xpath.jl needs no change (it uses a distinct `XPathToken` type). Tests: - Revives test/test_tokenizer.jl (was orphaned — not in runtests, and its `using XML.XMLTokenizer` did not import the names so it could not run). Fixed imports, migrated all `.raw`/accessor sites to thread the source, updated the `show` test (no longer prints text), and wired it into runtests.jl. Its multibyte cases (café/über/héllo/日本語) now guard the `raw()` round-trip in CI. - Full suite green, byte-identical to baseline: LazyNode 175/175, XMLTokenizer 122/122, Cursor 46/46, XPath 66/66, W3C 559/577 wf + 195/940 not-wf (unchanged counts — Token is representational, the accept/reject scan logic is untouched). Measured (N=100k synth placemarks, @benchmark seconds=3, Julia 1.12.6): - Cursor advance-only: 305 MiB/4.0M allocs → 0.00 MiB / 1 alloc. - Cursor full value-extraction: 103 ms/305 MiB → 83 ms / 30.5 MiB / 1.0M, below the tech-4 target (57 ms/123 MiB), achieving #61's memory goal. The residual 30 MiB is the `value()::Union{SubString,String}` boxing (one per text node) — orthogonal, a separate monomorphization micro-opt. This modifies the core `Token` type, so it is NOT orthogonal/additive: it needs coordination with the maintainer and rebasing onto #54 before any upstream merge. Develop in parallel on this stacked branch. Ref: #61, #62 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Nested for_each_child silently skipped a parent's second (and later) subtrees when the source had no inter-element whitespace (minified XML): the inner sweep broke on the boundary node by consuming it (next!() at the top of the loop), then the enclosing sweep's next!() advanced past that same node. Whitespace text nodes between elements accidentally masked the bug by serving as a throwaway boundary; minified machine-generated XML (common for KML) has none. Fix: make the cursor peekable via a `held` flag. On reaching the end of its subtree a sweep sets `c.held` instead of consuming the boundary node; the next `next!` re-yields the held node without advancing, so the enclosing sweep sees it. Composition is then correct for full DFS at any depth, independent of whitespace. Verified by 3 new test_cursor cases (minified + whitespaced + 3-level DFS); full suite green (Cursor 49, LazyNode 175, XMLTokenizer 122, W3C 754). This is a correctness fix for the Phase-1 cursor; it is committed here on the stacked bitstype-Token branch but logically belongs on feature-cursor — move or reorder when restructuring for the upstream PR stack. Ref: #61 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ch_child Support for driving the cursor from a known subtree position (Phase 3 wiring): - Cursor(data, startpos::Integer): primitive cursor whose token stream starts at a byte offset instead of the document start — for walking a subtree whose start is known. LazyNode-agnostic. Cursor(node::LazyNode) becomes a thin, removable convenience over it (the only place Cursor mentions LazyNode), the inverse of the LazyNode(c) snapshot. for_each_child auto-stops at the subtree boundary. - @for_each_child c child body: macro form of for_each_child that INLINES the body (not a closure), so a body accumulating into enclosing locals avoids the capture-boxing a do-block incurs. Measured on a 5k-placemark accumulating walk: 80 B (macro) vs 237 KB (for_each_child do-block) — the latter is one Core.Box per mutated captured local. Mirrors why node-based code uses @for_each_immediate_child. 7 new test_cursor cases (subtree bridge via offset + LazyNode; inlined nested accumulation, minified); full suite green (Cursor 56, LazyNode 175, XMLTokenizer 122, W3C 754). Ref: #61 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ral walks next!/for_each_child advance token-by-token, so a structural walk that classifies a node but doesn't need its contents still tokenizes every skipped subtree. skip_element! advances past an element's whole subtree in one byte scan (XMLTokenizer._skip_element_raw, + _scan_tag_end): counts element-nesting depth and respects CDATA / comment / PI / quoted-`>` boundaries, emitting no internal tokens. O(subtree-bytes) but a far tighter loop than full tokenization (no token emission, no SubString construction). Measured (WRS-2 Document, 28k flat Placemarks): classify WITH skip 21 ms vs 70 ms tokenizing the subtrees — ×3.4, and faster than the v0.3.8 next!() walk (~32 ms) too. Robust: 16 new test_cursor cases (literal </tag> in CDATA/comments, > inside an attr value, nested same-name, self-close, PI, minified) confirm skip lands exactly where for_each_child's full walk does. Full suite green (Cursor 72, LazyNode 175, XMLTokenizer 122, W3C 754). For structural walks like FastKML's layer discovery (the WRS-2 deficit). Ref: #61 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Cursor mirror of `is_simple_value(::LazyNode)`: returns the lone Text/CData value of the current element (or `nothing` if it has attributes / isn't a single-text element). Non-destructive — reads via `_rescan`, so the cursor position is unchanged and callers still advance with `for_each_child` / `skip_element!`. Lets hot streaming paths read a single-text element's value (e.g. an XLSX cell's `<v>`) with no per-element `LazyNode` snapshot. Measured downstream on XLSX.jl's read path (building `Cell` from the cursor instead of a per-cell `LazyNode`): readtable/eachrow on numeric_only & dates_heavy drop ~40% allocations / ~35% memory, taking the v0.4 read regression vs EzXML v0.10.4 from +15–18% back to ~parity (and below v0.10.4 in memory). Output byte-identical (checksum-verified). test/test_cursor.jl: +1 testset (matches LazyNode on text/entity/CDATA; `nothing` for attrs/element-child/empty/mixed/non-element; non-destructive). Cursor suite 72 → 87. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ttribute allocations by 36%

…nning

…bom (#65) into v0.4 read path Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

#56 corrected example.kml to valid <![CDATA[ on main; v0.4's tests still asserted the old invalid <![CData[ behavior — a semantic merge conflict. - example.kml testset: assert it reads as a valid Document; keep the invalid-spelling rejection via an inline parse() check. - roundtrip suite: un-skip example.kml (verified write-stable, CDATA survives). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

codecov-commenter · 2026-06-23T19:59:02Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 94.38073% with 98 lines in your changes missing coverage. Please review.
✅ Project coverage is 94.35%. Comparing base (2c869e3) to head (26a47d7).

Files with missing lines	Patch %	Lines
src/XML.jl	94.43%	38 Missing ⚠️
src/lazynode.jl	92.00%	30 Missing ⚠️
src/cursor.jl	90.79%	15 Missing ⚠️
src/XMLTokenizer.jl	96.62%	11 Missing ⚠️
src/xpath.jl	98.10%	3 Missing ⚠️
ext/XMLAbstractTreesExt.jl	97.43%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##             main      #76       +/-   ##
===========================================
+ Coverage   74.28%   94.35%   +20.06%     
===========================================
  Files           3        6        +3     
  Lines         669     1753     +1084     
===========================================
+ Hits          497     1654     +1157     
+ Misses        172       99       -73

Files with missing lines	Coverage Δ
ext/XMLAbstractTreesExt.jl	`97.43% <97.43%> (ø)`
src/xpath.jl	`98.10% <98.10%> (ø)`
src/XMLTokenizer.jl	`96.62% <96.62%> (ø)`
src/cursor.jl	`90.79% <90.79%> (ø)`
src/lazynode.jl	`92.00% <92.00%> (ø)`
src/XML.jl	`94.36% <94.43%> (+21.63%)`	⬆️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

The 1.9 floor came from package extensions (which need >=1.9), but it sat below the LTS and was never exercised — CI runs lts(=1.10) + 1, not 1.9. Flooring at the LTS makes the declared minimum match what CI actually tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

mathieu17g · 2026-06-23T21:08:51Z

@TimG1964 — v0.4 now has an official integration branch on JuliaData: v0.4-dev. It's your feature-cursor-bitstype-token plus main's 0.3.x fixes and infrastructure, so the cursor API and foreach_attr are unchanged.

I dev'd XLSX's XML dependency to v0.4-dev and ran your cursor-xml-optimisation suite — all green, no code changes needed. When you have a moment, could you point that XML dependency at v0.4-dev instead of feature-cursor-bitstype-token? I'll keep the latter frozen until you've moved.

TimG1964 · 2026-06-23T21:26:50Z

Away this week, so will be a few days. Will do as soon as I can.

Names (element / attribute / PI / DTD) with non-ASCII characters — café, 日本語, données — were rejected by the tokenizer, then hit StringIndexError once accepted. Fixed at the three layers where a byte-level tokenizer hides the 1-byte = 1-char assumption: - acceptance: NAME_BYTE_TABLE + _dtd_is_name_char admit bytes/chars >= 0x80 - slicing: tag/PI/attr-name slices use prevind (not pos-1); _dtd_read_name advances with nextind - accessors: tag_name / pi_target slice to lastindex (not ncodeunits) Test-first: 6 new assertions (Unicode Support + DTD Parsing); promoted the two @test_broken this resolves (pugixml CJK, libexpat UTF-8 names). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

parse/read now reject ill-formed documents by default — multiple root elements, non-whitespace text outside the root, and empty/invalid-start element names — via a `wellformed = :lenient | :structural | :strict` keyword (default :structural). The level is a `Val` type parameter, so :lenient's checks dead-code-eliminate and the default path's per-token cost is unchanged. Also: parse(::AbstractString) now strips a leading U+FEFF (BOM) character. The byte-level read path already did this (_normalize_bom); the in-memory path left it as a stray top-level Text node, surfaced once :structural rejected it. :strict (content-level: -- in comments, empty PI target, out-of-range char refs) is carried by the API but not yet implemented — follow-on. Test-first: well-formedness testset in 'Spec 2.1' (rejections + legal-prolog guards + the :lenient opt-out); the W3C catalog scrape (a multi-root fixture) opts to :lenient. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

test/test_libxml2_testcases.jl — 1578 lines, 156 testsets borrowed from libxml2 — existed but no include() referenced it, so it executed zero assertions. Wire it into runtests.jl beside the other reference-parser suites (pugixml, libexpat). Three error-case tests asserted the pre-:structural lenient behavior (accept trailing text / bare text / a stray DOCTYPE bracket as a Document) — cases where XML.jl historically diverged from libxml2 by accepting ill-formed input. They now assert the current contract: the default :structural rejects them (matching libxml2), and :lenient still accepts them. 246 assertions. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@warn

test_w3c.jl counted pass/fail but only @warn'd the outcome — the only real assertions were tautological (nodetype==Document after a successful read) or a no-op (@test true), so the suite passed regardless of how many W3C cases were mishandled. Now it asserts, asymmetrically: every well-formed doc must parse (@test n_fail == 0 — 577/577), and the not-well-formed rejection count carries a no-regression floor (@test n_pass >= 156). XML.jl is non-validating, so it cannot reject the ~784 not-wf cases needing DTD/entity validation; the floor ratchets up as :structural/:strict grow, and the live counts stay in @info. Categorising the remaining gap is a follow-on audit. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

:structural rejects document-shape errors; :strict now adds the content-level constraints, all gated on `W === :strict` so :lenient/:structural dead-code-eliminate them: - "--" within a comment (XML §2.5) - an empty or non-Name processing-instruction target (XML §2.6) — reuses _is_name_start, so "xml-stylesheet" and other valid targets still parse - a numeric character reference outside the XML §2.2 Char range — #x0, surrogates, > #x10FFFF. The range is checked explicitly, not via isvalid(Char,·), which accepts #x0 and other C0 controls that XML forbids. The scan runs only when a token actually carries entities. Completes the wellformed = :lenient | :structural | :strict ladder (the keyword was already wired through parse/read). Tests: per-construct :strict cases in the §2.4 / §2.5 / §2.6 spec testsets, each asserting the :strict rejection and that :structural/:lenient still accept. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Both W3C reads now pass wellformed=:strict (was the default :structural). Measured on the pinned xmlts20130923 corpus: - Well-formed (valid/invalid): 577/577 still parse — :strict has zero false-positives on real-world XML, the key safety check for the content-level rules. - Not-well-formed: rejections rise 156 -> 169 (the syntactic ill-formedness :structural missed: -- in comments, bad PI targets, out-of-range char refs). Floor bumped to 169. The remaining 771 not-rejected are validity errors (DTD/entity) outside a non-validating parser's scope; categorising them stays a Phase 6.5 audit item. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Regression tests for already-shipped fixes, plus one clear-error addition: - BOM decode (read path): UTF-16 LE/BE + UTF-8 BOM each decode to <a/> (guards _normalize_bom). - escape(SubString): #60 — escape was String-specialized; the AbstractString fix is now pinned. - UTF-16 without a BOM: _normalize_bom now raises "UTF-16 without a BOM is not well-formed (XML 1.0 §4.3.3)" when no BOM matched but a NUL byte sits in the first two positions. Previously :structural still rejected it, but with a cryptic "invalid element name" (interleaved NULs derail tokenization); this names the real cause. Two comparisons, not an O(n) isvalid(String) scan. The UTF-16-no-BOM tests assert the clear §4.3.3 message specifically — a bare @test_throws would false-pass since :structural already throws. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

mathieu17g · 2026-06-24T20:58:39Z

What landed in `0001d6e..26a47d7` — parser correctness + test harness

Well-formedness enforcement — new wellformed = :lenient | :structural | :strict option (024ce24, 1aa1d9f)
The parser was accepting many ill-formed documents — multiple roots, non-whitespace text outside the root, empty/invalid names, and at the content level -- inside comments, empty/invalid PI targets, and out-of-range character references. It now rejects them, with the level selectable and :structural the default; :strict adds the content-level checks. The level is a Val type parameter, so a mode never pays for checks above its level — :lenient runs none at all.

Non-ASCII names (0001d6e)
<café>, <日本語/>, and non-ASCII attribute/PI/DTD names threw before — the name-byte table was ASCII-only, and a few token slices/accessors used byte arithmetic that broke mid-multibyte-character. They parse now. (Two @test_broken cases in the borrowed suites started passing once this landed and were promoted to real assertions.)

Test-harness hardening (ac86c55, f20bc4f, bfddb19)

The borrowed libxml2 suite (~240 cases) was in the repo but not included — it ran zero assertions. Now wired in.
The W3C conformance suite only @warned on mismatches, so it stayed green no matter what. It now asserts: every well-formed doc must parse, plus a no-regression floor on rejected ill-formed docs.
The W3C suite now runs at :strict — 577/577 valid docs still parse (zero false-positives from the new checks) and not-well-formed rejections rose 156 → 169.

BOM handling + regression guards (26a47d7)
UTF-16 without a BOM now raises a clear not well-formed (XML 1.0 §4.3.3) error instead of a cryptic downstream failure; regression tests pin BOM decoding, that error, and escape on SubString (#60).

joshday and others added 30 commits March 5, 2026 09:34

Rewrite XML parser with tokenizer and XPath

6dacef3

remove dead code

97384c3

more test files

1844b16

Add validation tests and remove legacy DTD/raw code

b6f4d47

Update CI actions and add validation tests

21f647d

update ci

c673427

Add XMark benchmark generator and expand benchmarks

46c5a31

Add LazyNode type and StringViews extension

33bcf35

Refactor simple_value checks and use direct attrs iteration

d011424

Refactor tokenizer into XMLTokenizer and add LazyNode

754f8fa

Add benchmarks, StringViews tests, simplify XML module

8483fed

Add GC.gc before tmpfile cleanup for Windows

eb5caeb

Bump version to v0.4.0

b914bfe

Use mktempdir for temp file cleanup in StringViews tests

d76c484

Remove StringViews extension and simplify tokenizer

41836ae

Replace printstyled with print in show methods

b670267

Revamp benchmarks and expand test suite

4a728ee

Add Attributes type and performance optimizations

2f71f9a

Add sourcetext, write, eachchildnode for LazyNode

6c4e8f3

Namespace token kinds and document API

60725db

Add LazyNode perf APIs and XLSX-pattern benchmarks

9d129b8

Refresh XLSX-pattern benchmark snapshot

fb583c4

Add AbstractTrees package extension

cfc1f81

Use byte-level Base.write in XML serializer

895e994

Skip unescape scan when tokenizer saw no entities

ff84960

Use findnext for tokenizer text/attr scans

18d88b1

joshday and others added 13 commits May 15, 2026 15:40

Refresh benchmark snapshot and README bars

b790e85

Wire Token.has_entities into LazyNode read path

a93b9a0

Add end-to-end XLSX hot-loop benchmarks

e532a28

perf: replace Ref{Bool} with Bool in LazyAttrIterator, reducing eacha…

4d56ed3

…ttribute allocations by 36%

perf: add foreach_attr zero-allocation callback API for attribute sca…

3d5a806

…nning

Merge main into v0.4-dev: carry 0.3.x fixes + infra; port _normalize_…

b7fd1ba

…bom (#65) into v0.4 read path Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

mathieu17g and others added 7 commits June 24, 2026 10:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WIP: v0.4 — DO NOT MERGE (tracking)#76

WIP: v0.4 — DO NOT MERGE (tracking)#76
mathieu17g wants to merge 51 commits into
mainfrom
v0.4-dev

mathieu17g commented Jun 22, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Jun 23, 2026 •

edited

Loading

Uh oh!

mathieu17g commented Jun 23, 2026

Uh oh!

TimG1964 commented Jun 23, 2026

Uh oh!

mathieu17g commented Jun 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

mathieu17g commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Highlights

Done so far

Still ahead

Issues addressed

Breaking changes & impact on dependent packages

Performance

Uh oh!

codecov-commenter commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

mathieu17g commented Jun 23, 2026

Uh oh!

TimG1964 commented Jun 23, 2026

Uh oh!

mathieu17g commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What landed in 0001d6e..26a47d7 — parser correctness + test harness

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mathieu17g commented Jun 22, 2026 •

edited

Loading

codecov-commenter commented Jun 23, 2026 •

edited

Loading

mathieu17g commented Jun 24, 2026 •

edited

Loading

What landed in `0001d6e..26a47d7` — parser correctness + test harness