Skip to content

WIP: v0.4 — DO NOT MERGE (tracking)#76

Draft
mathieu17g wants to merge 51 commits into
mainfrom
v0.4-dev
Draft

WIP: v0.4 — DO NOT MERGE (tracking)#76
mathieu17g wants to merge 51 commits into
mainfrom
v0.4-dev

Conversation

@mathieu17g

@mathieu17g mathieu17g commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

WIP — do not merge. Tracking + collaboration PR for the v0.4 line (supersedes #54).
v0.4-dev = #54's rewrite + the cursor stack + ongoing work.

Highlights

A ground-up rewrite of XML.jl's internals initiated by @joshday — a token-based streaming parser, a pull/cursor (StAX-style) API, XPath support, and substantial parse/read speedups.

Done so far

  • Branch assembledWIP XML.jl v0.4: Rewrite of internals, streaming tokenizer, XPath support, and bug fixes  #54's rewrite + the cursor stack + main's 0.3.x fixes & infrastructure, combined by merging main in (merge-only, so the cursor branch stays usable for downstreams that build on it).
  • Non-ASCII names — element, attribute, PI, and DTD names with multi-byte UTF-8 (e.g. <café>, <日本語/>) now parse correctly.
  • Well-formedness enforcement — new wellformed = :lenient | :structural | :strict option (default :structural): rejects multiple roots, non-whitespace text outside the root, and empty/invalid names; :strict additionally rejects -- inside comments, empty/invalid PI targets, and character references outside the legal character range (XML 1.0 §2.2). The level is a compile-time type parameter, so a mode never pays for checks above its level — :lenient runs none at all.
  • BOM handling — UTF-16 LE/BE and UTF-8 BOM decode on read; a leading BOM is stripped on parse; UTF-16 without a BOM — which XML 1.0 §4.3.3 ("Character Encoding in Entities") forbids — now raises a clear error instead of failing obscurely downstream.
  • Test harness — the borrowed libxml2 suite (240+ cases) is now actually run (it was present but never included); the W3C conformance suite now asserts (every well-formed doc must parse; a no-regression floor on rejected ill-formed docs) instead of only warning, and runs at :strict.
  • Regression guardsescape on SubString (escape should work with AbstractString. #60), BOM decoding, and the UTF-16-without-BOM error are pinned by tests.
  • Julia floor raised to 1.10 (LTS).
  • Infrastructure carried from main — CI bumps, codecov, the W3C-suite cache, and the 0.3.9 CHANGELOG fold-in.

Still ahead

Issues addressed

Breaking changes & impact on dependent packages

The low-level streaming API (Raw, next/prev, single-argument parent/depth, nodes_equal, escape!/unescape!) is removed in favour of the token parser and cursor API. The high-level DOM API is largely preserved, but note some structural and behavioural changes:

  • Node is now parametric (Node{S}); attributes is a Vector{Pair} and children may be nothing.
  • parse/read decode entities into values, so value() returns &, not &amp;.
  • write auto-escapes text and attributes (double-escape risk if you pre-escape).
  • Duplicate attributes now error.

Dependents that only use the high-level DOM mostly need a [compat] bump to 0.4 plus a spot-check of those behavioural changes; those using the removed low-level API (notably XLSX.jl) need code changes. A migration guide is planned (see "Still ahead").

Performance

v0.4 aims at a substantial parse/read speedup over 0.3.x; figures to be (re-)measured on v0.4-dev vs the current release before quoting.

joshday and others added 30 commits March 5, 2026 09:34
Drops the underscore prefixes from internal names (module is unexported,
the clutter was only needed back when these names leaked into XML.jl).
Replaces the name-byte predicate with a 256-entry const lookup table.

Also fixes a 1-based indexing off-by-one in read_doctype_body: the
'<!--' detection guarded with `pos >= 2` while reading
`codeunit(data, pos - 2)`, which is codeunit 0 when pos == 2.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tag, value, keys, and attributes on LazyNode now return
SubString{String} views into the source rather than allocating
fresh Strings, so traversing a large document lazily does not
duplicate its text data.

Introduces a small _as_substring helper to promote the String that
`unescape` can return into a SubString so Attributes stays homogeneous.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
_write_xml now inspects children before reformatting: if any Text
child has non-whitespace content (or any CData child exists), the
element is treated as mixed content and its whitespace is preserved
verbatim. Otherwise the writer drops the whitespace-only Text nodes
the parser emits for round-tripping source formatting and generates
fresh indentation. Same filter is applied at the Document level.

Also adds an unescape(::SubString{String}) specialization that
returns the input unchanged when it contains no '&', avoiding an
allocation on the lazy scanning path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The medium-file workloads show a ~10–25% regression vs the numbers
captured at 4a728ee ("Revamp benchmarks"). v0.4-vs-v0.3.8 remains
a 70–80% improvement, so this is a post-release follow-up, not a
release blocker. Suspected culprit is the eager Pair{S,S}[] alloc
per TOKEN_OPEN_TAG introduced in 2f71f9a — see follow-up issue.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
joshday and others added 13 commits May 15, 2026 15:40
Adds `Cursor`: a single mutable wrapper advanced in place over the token
stream (the cursor-based StAX direction from #61).
Closes the per-child `LazyNode` allocation gap of the lazy DOM walk by
mutating one object instead of materializing a node per child.

Orthogonal/additive design:
- New file src/cursor.jl; seams are one include + 4 exports in XML.jl.
- `Cursor` and `LazyNode` are siblings on the shared XMLTokenizer
  foundation. The cursor's accessors rest on the token-layer primitives
  (tag_name, attr_value, pi_target, unescape) — they never call LazyNode
  or its accessors, so DOM-layer changes don't affect the cursor. The
  token→value logic is intentionally duplicated rather than shared, to
  keep this purely additive (a later refactor can factor it out).

API: next!, for_each_child, nodetype/tag/value/attributes/depth/eof,
get, the Base.iterate pull-mode surface, and LazyNode(c) as a one-way
snapshot bridge for the aliasing contract (the cursor is reused in
place; reads are synchronous-safe, retention requires a snapshot).

Tests: test/test_cursor.jl (46 cases) — depth model on hand-counted
docs, for_each_child, attributes/get, CData/Comment/PI/DTD/entity
values, accessor agreement with LazyNode node-for-node, snapshot
survival, iterator protocol. Full suite passes.

Perf (N=100k synth, vs the lazy-walk techniques in #61): Cursor next!()
DFS = 103 ms / 305 MiB / 4.0M allocs, vs v0.4 eachchildnode/recursive
~310-390 ms / ~1 GiB / 12-15M (×3 faster, ×3.4 less memory). It does
not yet reach the v0.3.8+#59 next!()-DFS class (57 ms / 123 MiB): the
residual ~1 alloc/token is the non-isbits Token tuple at the iterate
boundary, which a follow-up bitstype-Token change removes.

Ref: #61

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ocation

Replaces `Token{S}`'s `raw::SubString{S}` field with a plain byte range —
`(kind, has_entities, offset::Int, ncodeunits::Int)` — making `Token`
non-parametric and isbits (24 bytes). The `(Token, TokenizerState)` tuple
returned by `iterate` is now isbits, so it returns in registers/sret with no
heap allocation even though the tokenizer body is too large to inline. This
removes the per-token allocation that was the cursor's residual cost (see #61).

Token API:
- `raw(token, data) -> SubString` reconstructs the text view from the source.
  Multibyte-safe: it lands the end index on the START of the last char via
  `prevind` (a naive `SubString(data, off+1, off+ncu)` passes a UTF-8
  continuation byte as the end index and throws — verified on "aé"/"日本").
  `_token_root` resolves `data::SubString` to its parent (offsets are
  root-relative). This matters for the UTF-16 path of #62, whose fix
  transcodes to a UTF-8 String upstream of the tokenizer → dense multibyte.
- Emit-site constructors `Token(kind, view)` / `Token(kind, has_amp, view)`
  keep only the view's range, so all 22 tokenizer emit sites are unchanged.
- `tag_name` / `attr_value` / `pi_target` now take `(token, data)`.
- `TokenizerState` and `StatefulTokenizer.state` drop the `{S}` parameter
  (the buffered `pending` Token is non-parametric); `has_pending` tests
  `pending.ncodeunits != 0`; `show(::Token)` prints `KIND @offset+len`.

Consumers thread `data` (`tok.raw` → `raw(tok, data)`): src/XML.jl (eager
_parse), src/lazynode.jl (LazyNode + iterators; `_lazy_pos`/`_token_end`
simplify to direct field access; `LazyAttrIterator` reaches the source via a
small `_src(iter)` helper since it carries only the tokenizer), src/cursor.jl.
xpath.jl needs no change (it uses a distinct `XPathToken` type).

Tests:
- Revives test/test_tokenizer.jl (was orphaned — not in runtests, and its
  `using XML.XMLTokenizer` did not import the names so it could not run).
  Fixed imports, migrated all `.raw`/accessor sites to thread the source,
  updated the `show` test (no longer prints text), and wired it into
  runtests.jl. Its multibyte cases (café/über/héllo/日本語) now guard the
  `raw()` round-trip in CI.
- Full suite green, byte-identical to baseline: LazyNode 175/175,
  XMLTokenizer 122/122, Cursor 46/46, XPath 66/66, W3C 559/577 wf +
  195/940 not-wf (unchanged counts — Token is representational, the
  accept/reject scan logic is untouched).

Measured (N=100k synth placemarks, @benchmark seconds=3, Julia 1.12.6):
- Cursor advance-only: 305 MiB/4.0M allocs → 0.00 MiB / 1 alloc.
- Cursor full value-extraction: 103 ms/305 MiB → 83 ms / 30.5 MiB / 1.0M,
  below the tech-4 target (57 ms/123 MiB), achieving #61's memory goal. The
  residual 30 MiB is the `value()::Union{SubString,String}` boxing (one per
  text node) — orthogonal, a separate monomorphization micro-opt.

This modifies the core `Token` type, so it is NOT orthogonal/additive: it
needs coordination with the maintainer and rebasing onto #54 before any
upstream merge. Develop in parallel on this stacked branch.

Ref: #61, #62

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Nested for_each_child silently skipped a parent's second (and later) subtrees
when the source had no inter-element whitespace (minified XML): the inner sweep
broke on the boundary node by consuming it (next!() at the top of the loop),
then the enclosing sweep's next!() advanced past that same node. Whitespace
text nodes between elements accidentally masked the bug by serving as a
throwaway boundary; minified machine-generated XML (common for KML) has none.

Fix: make the cursor peekable via a `held` flag. On reaching the end of its
subtree a sweep sets `c.held` instead of consuming the boundary node; the next
`next!` re-yields the held node without advancing, so the enclosing sweep sees
it. Composition is then correct for full DFS at any depth, independent of
whitespace.

Verified by 3 new test_cursor cases (minified + whitespaced + 3-level DFS);
full suite green (Cursor 49, LazyNode 175, XMLTokenizer 122, W3C 754).

This is a correctness fix for the Phase-1 cursor; it is committed here on the
stacked bitstype-Token branch but logically belongs on feature-cursor — move
or reorder when restructuring for the upstream PR stack.

Ref: #61

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ch_child

Support for driving the cursor from a known subtree position (Phase 3 wiring):

- Cursor(data, startpos::Integer): primitive cursor whose token stream starts at a
  byte offset instead of the document start — for walking a subtree whose start is
  known. LazyNode-agnostic. Cursor(node::LazyNode) becomes a thin, removable
  convenience over it (the only place Cursor mentions LazyNode), the inverse of the
  LazyNode(c) snapshot. for_each_child auto-stops at the subtree boundary.

- @for_each_child c child body: macro form of for_each_child that INLINES the body
  (not a closure), so a body accumulating into enclosing locals avoids the
  capture-boxing a do-block incurs. Measured on a 5k-placemark accumulating walk:
  80 B (macro) vs 237 KB (for_each_child do-block) — the latter is one Core.Box per
  mutated captured local. Mirrors why node-based code uses @for_each_immediate_child.

7 new test_cursor cases (subtree bridge via offset + LazyNode; inlined nested
accumulation, minified); full suite green (Cursor 56, LazyNode 175, XMLTokenizer
122, W3C 754).

Ref: #61

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ral walks

next!/for_each_child advance token-by-token, so a structural walk that classifies a
node but doesn't need its contents still tokenizes every skipped subtree. skip_element!
advances past an element's whole subtree in one byte scan (XMLTokenizer._skip_element_raw,
+ _scan_tag_end): counts element-nesting depth and respects CDATA / comment / PI /
quoted-`>` boundaries, emitting no internal tokens. O(subtree-bytes) but a far tighter
loop than full tokenization (no token emission, no SubString construction).

Measured (WRS-2 Document, 28k flat Placemarks): classify WITH skip 21 ms vs 70 ms
tokenizing the subtrees — ×3.4, and faster than the v0.3.8 next!() walk (~32 ms) too.
Robust: 16 new test_cursor cases (literal </tag> in CDATA/comments, > inside an attr
value, nested same-name, self-close, PI, minified) confirm skip lands exactly where
for_each_child's full walk does. Full suite green (Cursor 72, LazyNode 175,
XMLTokenizer 122, W3C 754).

For structural walks like FastKML's layer discovery (the WRS-2 deficit).
Ref: #61

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Cursor mirror of `is_simple_value(::LazyNode)`: returns the lone Text/CData value of the
current element (or `nothing` if it has attributes / isn't a single-text element).
Non-destructive — reads via `_rescan`, so the cursor position is unchanged and callers
still advance with `for_each_child` / `skip_element!`.

Lets hot streaming paths read a single-text element's value (e.g. an XLSX cell's `<v>`)
with no per-element `LazyNode` snapshot. Measured downstream on XLSX.jl's read path
(building `Cell` from the cursor instead of a per-cell `LazyNode`): readtable/eachrow on
numeric_only & dates_heavy drop ~40% allocations / ~35% memory, taking the v0.4 read
regression vs EzXML v0.10.4 from +15–18% back to ~parity (and below v0.10.4 in memory).
Output byte-identical (checksum-verified).

test/test_cursor.jl: +1 testset (matches LazyNode on text/entity/CDATA; `nothing` for
attrs/element-child/empty/mixed/non-element; non-destructive). Cursor suite 72 → 87.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…bom (#65) into v0.4 read path

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
#56 corrected example.kml to valid <![CDATA[ on main; v0.4's tests still
asserted the old invalid <![CData[ behavior — a semantic merge conflict.
- example.kml testset: assert it reads as a valid Document; keep the
  invalid-spelling rejection via an inline parse() check.
- roundtrip suite: un-skip example.kml (verified write-stable, CDATA survives).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@codecov-commenter

codecov-commenter commented Jun 23, 2026

Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 94.38073% with 98 lines in your changes missing coverage. Please review.
✅ Project coverage is 94.35%. Comparing base (2c869e3) to head (26a47d7).

Files with missing lines Patch % Lines
src/XML.jl 94.43% 38 Missing ⚠️
src/lazynode.jl 92.00% 30 Missing ⚠️
src/cursor.jl 90.79% 15 Missing ⚠️
src/XMLTokenizer.jl 96.62% 11 Missing ⚠️
src/xpath.jl 98.10% 3 Missing ⚠️
ext/XMLAbstractTreesExt.jl 97.43% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##             main      #76       +/-   ##
===========================================
+ Coverage   74.28%   94.35%   +20.06%     
===========================================
  Files           3        6        +3     
  Lines         669     1753     +1084     
===========================================
+ Hits          497     1654     +1157     
+ Misses        172       99       -73     
Files with missing lines Coverage Δ
ext/XMLAbstractTreesExt.jl 97.43% <97.43%> (ø)
src/xpath.jl 98.10% <98.10%> (ø)
src/XMLTokenizer.jl 96.62% <96.62%> (ø)
src/cursor.jl 90.79% <90.79%> (ø)
src/lazynode.jl 92.00% <92.00%> (ø)
src/XML.jl 94.36% <94.43%> (+21.63%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

The 1.9 floor came from package extensions (which need >=1.9), but it sat below
the LTS and was never exercised — CI runs lts(=1.10) + 1, not 1.9. Flooring at the
LTS makes the declared minimum match what CI actually tests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mathieu17g

Copy link
Copy Markdown
Collaborator Author

@TimG1964 — v0.4 now has an official integration branch on JuliaData: v0.4-dev. It's your feature-cursor-bitstype-token plus main's 0.3.x fixes and infrastructure, so the cursor API and foreach_attr are unchanged.

I dev'd XLSX's XML dependency to v0.4-dev and ran your cursor-xml-optimisation suite — all green, no code changes needed. When you have a moment, could you point that XML dependency at v0.4-dev instead of feature-cursor-bitstype-token? I'll keep the latter frozen until you've moved.

@TimG1964

Copy link
Copy Markdown
Collaborator

Away this week, so will be a few days. Will do as soon as I can.

mathieu17g and others added 7 commits June 24, 2026 10:03
Names (element / attribute / PI / DTD) with non-ASCII characters — café, 日本語,
données — were rejected by the tokenizer, then hit StringIndexError once accepted.
Fixed at the three layers where a byte-level tokenizer hides the 1-byte = 1-char
assumption:
- acceptance: NAME_BYTE_TABLE + _dtd_is_name_char admit bytes/chars >= 0x80
- slicing: tag/PI/attr-name slices use prevind (not pos-1); _dtd_read_name
  advances with nextind
- accessors: tag_name / pi_target slice to lastindex (not ncodeunits)

Test-first: 6 new assertions (Unicode Support + DTD Parsing); promoted the two
@test_broken this resolves (pugixml CJK, libexpat UTF-8 names).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
parse/read now reject ill-formed documents by default — multiple root elements,
non-whitespace text outside the root, and empty/invalid-start element names — via a
`wellformed = :lenient | :structural | :strict` keyword (default :structural). The level
is a `Val` type parameter, so :lenient's checks dead-code-eliminate and the default path's
per-token cost is unchanged.

Also: parse(::AbstractString) now strips a leading U+FEFF (BOM) character. The byte-level
read path already did this (_normalize_bom); the in-memory path left it as a stray top-level
Text node, surfaced once :structural rejected it.

:strict (content-level: -- in comments, empty PI target, out-of-range char refs) is carried
by the API but not yet implemented — follow-on.

Test-first: well-formedness testset in 'Spec 2.1' (rejections + legal-prolog guards + the
:lenient opt-out); the W3C catalog scrape (a multi-root fixture) opts to :lenient.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
test/test_libxml2_testcases.jl — 1578 lines, 156 testsets borrowed from libxml2 — existed
but no include() referenced it, so it executed zero assertions. Wire it into runtests.jl
beside the other reference-parser suites (pugixml, libexpat).

Three error-case tests asserted the pre-:structural lenient behavior (accept trailing text /
bare text / a stray DOCTYPE bracket as a Document) — cases where XML.jl historically diverged
from libxml2 by accepting ill-formed input. They now assert the current contract: the default
:structural rejects them (matching libxml2), and :lenient still accepts them. 246 assertions.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
test_w3c.jl counted pass/fail but only @warn'd the outcome — the only real assertions were
tautological (nodetype==Document after a successful read) or a no-op (@test true), so the
suite passed regardless of how many W3C cases were mishandled.

Now it asserts, asymmetrically: every well-formed doc must parse (@test n_fail == 0 — 577/577),
and the not-well-formed rejection count carries a no-regression floor (@test n_pass >= 156).
XML.jl is non-validating, so it cannot reject the ~784 not-wf cases needing DTD/entity
validation; the floor ratchets up as :structural/:strict grow, and the live counts stay in
@info. Categorising the remaining gap is a follow-on audit.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
:structural rejects document-shape errors; :strict now adds the content-level constraints, all
gated on `W === :strict` so :lenient/:structural dead-code-eliminate them:

- "--" within a comment (XML §2.5)
- an empty or non-Name processing-instruction target (XML §2.6) — reuses _is_name_start, so
  "xml-stylesheet" and other valid targets still parse
- a numeric character reference outside the XML §2.2 Char range — #x0, surrogates, > #x10FFFF.
  The range is checked explicitly, not via isvalid(Char,·), which accepts #x0 and other C0
  controls that XML forbids. The scan runs only when a token actually carries entities.

Completes the wellformed = :lenient | :structural | :strict ladder (the keyword was already
wired through parse/read).

Tests: per-construct :strict cases in the §2.4 / §2.5 / §2.6 spec testsets, each asserting the
:strict rejection and that :structural/:lenient still accept.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Both W3C reads now pass wellformed=:strict (was the default :structural). Measured on the pinned
xmlts20130923 corpus:

- Well-formed (valid/invalid): 577/577 still parse — :strict has zero false-positives on
  real-world XML, the key safety check for the content-level rules.
- Not-well-formed: rejections rise 156 -> 169 (the syntactic ill-formedness :structural missed:
  -- in comments, bad PI targets, out-of-range char refs). Floor bumped to 169.

The remaining 771 not-rejected are validity errors (DTD/entity) outside a non-validating parser's
scope; categorising them stays a Phase 6.5 audit item.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Regression tests for already-shipped fixes, plus one clear-error addition:

- BOM decode (read path): UTF-16 LE/BE + UTF-8 BOM each decode to <a/> (guards _normalize_bom).
- escape(SubString): #60 — escape was String-specialized; the AbstractString fix is now pinned.
- UTF-16 without a BOM: _normalize_bom now raises "UTF-16 without a BOM is not well-formed (XML 1.0
  §4.3.3)" when no BOM matched but a NUL byte sits in the first two positions. Previously :structural
  still rejected it, but with a cryptic "invalid element name" (interleaved NULs derail tokenization);
  this names the real cause. Two comparisons, not an O(n) isvalid(String) scan.

The UTF-16-no-BOM tests assert the clear §4.3.3 message specifically — a bare @test_throws would
false-pass since :structural already throws.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mathieu17g

mathieu17g commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator Author

What landed in 0001d6e..26a47d7 — parser correctness + test harness

Well-formedness enforcement — new wellformed = :lenient | :structural | :strict option (024ce24, 1aa1d9f)
The parser was accepting many ill-formed documents — multiple roots, non-whitespace text outside the root, empty/invalid names, and at the content level -- inside comments, empty/invalid PI targets, and out-of-range character references. It now rejects them, with the level selectable and :structural the default; :strict adds the content-level checks. The level is a Val type parameter, so a mode never pays for checks above its level — :lenient runs none at all.

Non-ASCII names (0001d6e)
<café>, <日本語/>, and non-ASCII attribute/PI/DTD names threw before — the name-byte table was ASCII-only, and a few token slices/accessors used byte arithmetic that broke mid-multibyte-character. They parse now. (Two @test_broken cases in the borrowed suites started passing once this landed and were promoted to real assertions.)

Test-harness hardening (ac86c55, f20bc4f, bfddb19)

  • The borrowed libxml2 suite (~240 cases) was in the repo but not included — it ran zero assertions. Now wired in.
  • The W3C conformance suite only @warned on mismatches, so it stayed green no matter what. It now asserts: every well-formed doc must parse, plus a no-regression floor on rejected ill-formed docs.
  • The W3C suite now runs at :strict577/577 valid docs still parse (zero false-positives from the new checks) and not-well-formed rejections rose 156 → 169.

BOM handling + regression guards (26a47d7)
UTF-16 without a BOM now raises a clear not well-formed (XML 1.0 §4.3.3) error instead of a cryptic downstream failure; regression tests pin BOM decoding, that error, and escape on SubString (#60).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants