WIP: v0.4 — DO NOT MERGE (tracking)#76
Conversation
Drops the underscore prefixes from internal names (module is unexported, the clutter was only needed back when these names leaked into XML.jl). Replaces the name-byte predicate with a 256-entry const lookup table. Also fixes a 1-based indexing off-by-one in read_doctype_body: the '<!--' detection guarded with `pos >= 2` while reading `codeunit(data, pos - 2)`, which is codeunit 0 when pos == 2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tag, value, keys, and attributes on LazyNode now return
SubString{String} views into the source rather than allocating
fresh Strings, so traversing a large document lazily does not
duplicate its text data.
Introduces a small _as_substring helper to promote the String that
`unescape` can return into a SubString so Attributes stays homogeneous.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
_write_xml now inspects children before reformatting: if any Text
child has non-whitespace content (or any CData child exists), the
element is treated as mixed content and its whitespace is preserved
verbatim. Otherwise the writer drops the whitespace-only Text nodes
the parser emits for round-tripping source formatting and generates
fresh indentation. Same filter is applied at the Document level.
Also adds an unescape(::SubString{String}) specialization that
returns the input unchanged when it contains no '&', avoiding an
allocation on the lazy scanning path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The medium-file workloads show a ~10–25% regression vs the numbers captured at 4a728ee ("Revamp benchmarks"). v0.4-vs-v0.3.8 remains a 70–80% improvement, so this is a post-release follow-up, not a release blocker. Suspected culprit is the eager Pair{S,S}[] alloc per TOKEN_OPEN_TAG introduced in 2f71f9a — see follow-up issue. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds `Cursor`: a single mutable wrapper advanced in place over the token stream (the cursor-based StAX direction from #61). Closes the per-child `LazyNode` allocation gap of the lazy DOM walk by mutating one object instead of materializing a node per child. Orthogonal/additive design: - New file src/cursor.jl; seams are one include + 4 exports in XML.jl. - `Cursor` and `LazyNode` are siblings on the shared XMLTokenizer foundation. The cursor's accessors rest on the token-layer primitives (tag_name, attr_value, pi_target, unescape) — they never call LazyNode or its accessors, so DOM-layer changes don't affect the cursor. The token→value logic is intentionally duplicated rather than shared, to keep this purely additive (a later refactor can factor it out). API: next!, for_each_child, nodetype/tag/value/attributes/depth/eof, get, the Base.iterate pull-mode surface, and LazyNode(c) as a one-way snapshot bridge for the aliasing contract (the cursor is reused in place; reads are synchronous-safe, retention requires a snapshot). Tests: test/test_cursor.jl (46 cases) — depth model on hand-counted docs, for_each_child, attributes/get, CData/Comment/PI/DTD/entity values, accessor agreement with LazyNode node-for-node, snapshot survival, iterator protocol. Full suite passes. Perf (N=100k synth, vs the lazy-walk techniques in #61): Cursor next!() DFS = 103 ms / 305 MiB / 4.0M allocs, vs v0.4 eachchildnode/recursive ~310-390 ms / ~1 GiB / 12-15M (×3 faster, ×3.4 less memory). It does not yet reach the v0.3.8+#59 next!()-DFS class (57 ms / 123 MiB): the residual ~1 alloc/token is the non-isbits Token tuple at the iterate boundary, which a follow-up bitstype-Token change removes. Ref: #61 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ocation
Replaces `Token{S}`'s `raw::SubString{S}` field with a plain byte range —
`(kind, has_entities, offset::Int, ncodeunits::Int)` — making `Token`
non-parametric and isbits (24 bytes). The `(Token, TokenizerState)` tuple
returned by `iterate` is now isbits, so it returns in registers/sret with no
heap allocation even though the tokenizer body is too large to inline. This
removes the per-token allocation that was the cursor's residual cost (see #61).
Token API:
- `raw(token, data) -> SubString` reconstructs the text view from the source.
Multibyte-safe: it lands the end index on the START of the last char via
`prevind` (a naive `SubString(data, off+1, off+ncu)` passes a UTF-8
continuation byte as the end index and throws — verified on "aé"/"日本").
`_token_root` resolves `data::SubString` to its parent (offsets are
root-relative). This matters for the UTF-16 path of #62, whose fix
transcodes to a UTF-8 String upstream of the tokenizer → dense multibyte.
- Emit-site constructors `Token(kind, view)` / `Token(kind, has_amp, view)`
keep only the view's range, so all 22 tokenizer emit sites are unchanged.
- `tag_name` / `attr_value` / `pi_target` now take `(token, data)`.
- `TokenizerState` and `StatefulTokenizer.state` drop the `{S}` parameter
(the buffered `pending` Token is non-parametric); `has_pending` tests
`pending.ncodeunits != 0`; `show(::Token)` prints `KIND @offset+len`.
Consumers thread `data` (`tok.raw` → `raw(tok, data)`): src/XML.jl (eager
_parse), src/lazynode.jl (LazyNode + iterators; `_lazy_pos`/`_token_end`
simplify to direct field access; `LazyAttrIterator` reaches the source via a
small `_src(iter)` helper since it carries only the tokenizer), src/cursor.jl.
xpath.jl needs no change (it uses a distinct `XPathToken` type).
Tests:
- Revives test/test_tokenizer.jl (was orphaned — not in runtests, and its
`using XML.XMLTokenizer` did not import the names so it could not run).
Fixed imports, migrated all `.raw`/accessor sites to thread the source,
updated the `show` test (no longer prints text), and wired it into
runtests.jl. Its multibyte cases (café/über/héllo/日本語) now guard the
`raw()` round-trip in CI.
- Full suite green, byte-identical to baseline: LazyNode 175/175,
XMLTokenizer 122/122, Cursor 46/46, XPath 66/66, W3C 559/577 wf +
195/940 not-wf (unchanged counts — Token is representational, the
accept/reject scan logic is untouched).
Measured (N=100k synth placemarks, @benchmark seconds=3, Julia 1.12.6):
- Cursor advance-only: 305 MiB/4.0M allocs → 0.00 MiB / 1 alloc.
- Cursor full value-extraction: 103 ms/305 MiB → 83 ms / 30.5 MiB / 1.0M,
below the tech-4 target (57 ms/123 MiB), achieving #61's memory goal. The
residual 30 MiB is the `value()::Union{SubString,String}` boxing (one per
text node) — orthogonal, a separate monomorphization micro-opt.
This modifies the core `Token` type, so it is NOT orthogonal/additive: it
needs coordination with the maintainer and rebasing onto #54 before any
upstream merge. Develop in parallel on this stacked branch.
Ref: #61, #62
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Nested for_each_child silently skipped a parent's second (and later) subtrees when the source had no inter-element whitespace (minified XML): the inner sweep broke on the boundary node by consuming it (next!() at the top of the loop), then the enclosing sweep's next!() advanced past that same node. Whitespace text nodes between elements accidentally masked the bug by serving as a throwaway boundary; minified machine-generated XML (common for KML) has none. Fix: make the cursor peekable via a `held` flag. On reaching the end of its subtree a sweep sets `c.held` instead of consuming the boundary node; the next `next!` re-yields the held node without advancing, so the enclosing sweep sees it. Composition is then correct for full DFS at any depth, independent of whitespace. Verified by 3 new test_cursor cases (minified + whitespaced + 3-level DFS); full suite green (Cursor 49, LazyNode 175, XMLTokenizer 122, W3C 754). This is a correctness fix for the Phase-1 cursor; it is committed here on the stacked bitstype-Token branch but logically belongs on feature-cursor — move or reorder when restructuring for the upstream PR stack. Ref: #61 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ch_child Support for driving the cursor from a known subtree position (Phase 3 wiring): - Cursor(data, startpos::Integer): primitive cursor whose token stream starts at a byte offset instead of the document start — for walking a subtree whose start is known. LazyNode-agnostic. Cursor(node::LazyNode) becomes a thin, removable convenience over it (the only place Cursor mentions LazyNode), the inverse of the LazyNode(c) snapshot. for_each_child auto-stops at the subtree boundary. - @for_each_child c child body: macro form of for_each_child that INLINES the body (not a closure), so a body accumulating into enclosing locals avoids the capture-boxing a do-block incurs. Measured on a 5k-placemark accumulating walk: 80 B (macro) vs 237 KB (for_each_child do-block) — the latter is one Core.Box per mutated captured local. Mirrors why node-based code uses @for_each_immediate_child. 7 new test_cursor cases (subtree bridge via offset + LazyNode; inlined nested accumulation, minified); full suite green (Cursor 56, LazyNode 175, XMLTokenizer 122, W3C 754). Ref: #61 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ral walks next!/for_each_child advance token-by-token, so a structural walk that classifies a node but doesn't need its contents still tokenizes every skipped subtree. skip_element! advances past an element's whole subtree in one byte scan (XMLTokenizer._skip_element_raw, + _scan_tag_end): counts element-nesting depth and respects CDATA / comment / PI / quoted-`>` boundaries, emitting no internal tokens. O(subtree-bytes) but a far tighter loop than full tokenization (no token emission, no SubString construction). Measured (WRS-2 Document, 28k flat Placemarks): classify WITH skip 21 ms vs 70 ms tokenizing the subtrees — ×3.4, and faster than the v0.3.8 next!() walk (~32 ms) too. Robust: 16 new test_cursor cases (literal </tag> in CDATA/comments, > inside an attr value, nested same-name, self-close, PI, minified) confirm skip lands exactly where for_each_child's full walk does. Full suite green (Cursor 72, LazyNode 175, XMLTokenizer 122, W3C 754). For structural walks like FastKML's layer discovery (the WRS-2 deficit). Ref: #61 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Cursor mirror of `is_simple_value(::LazyNode)`: returns the lone Text/CData value of the current element (or `nothing` if it has attributes / isn't a single-text element). Non-destructive — reads via `_rescan`, so the cursor position is unchanged and callers still advance with `for_each_child` / `skip_element!`. Lets hot streaming paths read a single-text element's value (e.g. an XLSX cell's `<v>`) with no per-element `LazyNode` snapshot. Measured downstream on XLSX.jl's read path (building `Cell` from the cursor instead of a per-cell `LazyNode`): readtable/eachrow on numeric_only & dates_heavy drop ~40% allocations / ~35% memory, taking the v0.4 read regression vs EzXML v0.10.4 from +15–18% back to ~parity (and below v0.10.4 in memory). Output byte-identical (checksum-verified). test/test_cursor.jl: +1 testset (matches LazyNode on text/entity/CDATA; `nothing` for attrs/element-child/empty/mixed/non-element; non-destructive). Cursor suite 72 → 87. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ttribute allocations by 36%
…bom (#65) into v0.4 read path Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
#56 corrected example.kml to valid <![CDATA[ on main; v0.4's tests still asserted the old invalid <![CData[ behavior — a semantic merge conflict. - example.kml testset: assert it reads as a valid Document; keep the invalid-spelling rejection via an inline parse() check. - roundtrip suite: un-skip example.kml (verified write-stable, CDATA survives). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #76 +/- ##
===========================================
+ Coverage 74.28% 94.35% +20.06%
===========================================
Files 3 6 +3
Lines 669 1753 +1084
===========================================
+ Hits 497 1654 +1157
+ Misses 172 99 -73
🚀 New features to boost your workflow:
|
The 1.9 floor came from package extensions (which need >=1.9), but it sat below the LTS and was never exercised — CI runs lts(=1.10) + 1, not 1.9. Flooring at the LTS makes the declared minimum match what CI actually tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
@TimG1964 — v0.4 now has an official integration branch on JuliaData: I dev'd XLSX's |
|
Away this week, so will be a few days. Will do as soon as I can. |
Names (element / attribute / PI / DTD) with non-ASCII characters — café, 日本語, données — were rejected by the tokenizer, then hit StringIndexError once accepted. Fixed at the three layers where a byte-level tokenizer hides the 1-byte = 1-char assumption: - acceptance: NAME_BYTE_TABLE + _dtd_is_name_char admit bytes/chars >= 0x80 - slicing: tag/PI/attr-name slices use prevind (not pos-1); _dtd_read_name advances with nextind - accessors: tag_name / pi_target slice to lastindex (not ncodeunits) Test-first: 6 new assertions (Unicode Support + DTD Parsing); promoted the two @test_broken this resolves (pugixml CJK, libexpat UTF-8 names). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
parse/read now reject ill-formed documents by default — multiple root elements, non-whitespace text outside the root, and empty/invalid-start element names — via a `wellformed = :lenient | :structural | :strict` keyword (default :structural). The level is a `Val` type parameter, so :lenient's checks dead-code-eliminate and the default path's per-token cost is unchanged. Also: parse(::AbstractString) now strips a leading U+FEFF (BOM) character. The byte-level read path already did this (_normalize_bom); the in-memory path left it as a stray top-level Text node, surfaced once :structural rejected it. :strict (content-level: -- in comments, empty PI target, out-of-range char refs) is carried by the API but not yet implemented — follow-on. Test-first: well-formedness testset in 'Spec 2.1' (rejections + legal-prolog guards + the :lenient opt-out); the W3C catalog scrape (a multi-root fixture) opts to :lenient. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
test/test_libxml2_testcases.jl — 1578 lines, 156 testsets borrowed from libxml2 — existed but no include() referenced it, so it executed zero assertions. Wire it into runtests.jl beside the other reference-parser suites (pugixml, libexpat). Three error-case tests asserted the pre-:structural lenient behavior (accept trailing text / bare text / a stray DOCTYPE bracket as a Document) — cases where XML.jl historically diverged from libxml2 by accepting ill-formed input. They now assert the current contract: the default :structural rejects them (matching libxml2), and :lenient still accepts them. 246 assertions. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
test_w3c.jl counted pass/fail but only @warn'd the outcome — the only real assertions were tautological (nodetype==Document after a successful read) or a no-op (@test true), so the suite passed regardless of how many W3C cases were mishandled. Now it asserts, asymmetrically: every well-formed doc must parse (@test n_fail == 0 — 577/577), and the not-well-formed rejection count carries a no-regression floor (@test n_pass >= 156). XML.jl is non-validating, so it cannot reject the ~784 not-wf cases needing DTD/entity validation; the floor ratchets up as :structural/:strict grow, and the live counts stay in @info. Categorising the remaining gap is a follow-on audit. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
:structural rejects document-shape errors; :strict now adds the content-level constraints, all gated on `W === :strict` so :lenient/:structural dead-code-eliminate them: - "--" within a comment (XML §2.5) - an empty or non-Name processing-instruction target (XML §2.6) — reuses _is_name_start, so "xml-stylesheet" and other valid targets still parse - a numeric character reference outside the XML §2.2 Char range — #x0, surrogates, > #x10FFFF. The range is checked explicitly, not via isvalid(Char,·), which accepts #x0 and other C0 controls that XML forbids. The scan runs only when a token actually carries entities. Completes the wellformed = :lenient | :structural | :strict ladder (the keyword was already wired through parse/read). Tests: per-construct :strict cases in the §2.4 / §2.5 / §2.6 spec testsets, each asserting the :strict rejection and that :structural/:lenient still accept. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Both W3C reads now pass wellformed=:strict (was the default :structural). Measured on the pinned xmlts20130923 corpus: - Well-formed (valid/invalid): 577/577 still parse — :strict has zero false-positives on real-world XML, the key safety check for the content-level rules. - Not-well-formed: rejections rise 156 -> 169 (the syntactic ill-formedness :structural missed: -- in comments, bad PI targets, out-of-range char refs). Floor bumped to 169. The remaining 771 not-rejected are validity errors (DTD/entity) outside a non-validating parser's scope; categorising them stays a Phase 6.5 audit item. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Regression tests for already-shipped fixes, plus one clear-error addition: - BOM decode (read path): UTF-16 LE/BE + UTF-8 BOM each decode to <a/> (guards _normalize_bom). - escape(SubString): #60 — escape was String-specialized; the AbstractString fix is now pinned. - UTF-16 without a BOM: _normalize_bom now raises "UTF-16 without a BOM is not well-formed (XML 1.0 §4.3.3)" when no BOM matched but a NUL byte sits in the first two positions. Previously :structural still rejected it, but with a cryptic "invalid element name" (interleaved NULs derail tokenization); this names the real cause. Two comparisons, not an O(n) isvalid(String) scan. The UTF-16-no-BOM tests assert the clear §4.3.3 message specifically — a bare @test_throws would false-pass since :structural already throws. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
What landed in
|
WIP — do not merge. Tracking + collaboration PR for the v0.4 line (supersedes #54).
v0.4-dev= #54's rewrite + the cursor stack + ongoing work.Highlights
A ground-up rewrite of XML.jl's internals initiated by @joshday — a token-based streaming parser, a pull/cursor (StAX-style) API, XPath support, and substantial parse/read speedups.
Done so far
main's 0.3.x fixes & infrastructure, combined by mergingmainin (merge-only, so the cursor branch stays usable for downstreams that build on it).<café>,<日本語/>) now parse correctly.wellformed = :lenient | :structural | :strictoption (default:structural): rejects multiple roots, non-whitespace text outside the root, and empty/invalid names;:strictadditionally rejects--inside comments, empty/invalid PI targets, and character references outside the legal character range (XML 1.0 §2.2). The level is a compile-time type parameter, so a mode never pays for checks above its level —:lenientruns none at all.read; a leading BOM is stripped onparse; UTF-16 without a BOM — which XML 1.0 §4.3.3 ("Character Encoding in Entities") forbids — now raises a clear error instead of failing obscurely downstream.:strict.escapeonSubString(escape should work with AbstractString. #60), BOM decoding, and the UTF-16-without-BOM error are pinned by tests.main— CI bumps, codecov, the W3C-suite cache, and the 0.3.9 CHANGELOG fold-in.Still ahead
v0.4-devvs the current release before quoting figures.Issues addressed
Breaking changes & impact on dependent packages
The low-level streaming API (
Raw,next/prev, single-argumentparent/depth,nodes_equal,escape!/unescape!) is removed in favour of the token parser and cursor API. The high-level DOM API is largely preserved, but note some structural and behavioural changes:Nodeis now parametric (Node{S});attributesis aVector{Pair}andchildrenmay benothing.parse/readdecode entities into values, sovalue()returns&, not&.writeauto-escapes text and attributes (double-escape risk if you pre-escape).Dependents that only use the high-level DOM mostly need a
[compat]bump to0.4plus a spot-check of those behavioural changes; those using the removed low-level API (notably XLSX.jl) need code changes. A migration guide is planned (see "Still ahead").Performance
v0.4 aims at a substantial parse/read speedup over 0.3.x; figures to be (re-)measured on
v0.4-devvs the current release before quoting.