Skip to content

[bugfix] Speak decoded UTF-8 resource names on the wire#54

Open
joewiz wants to merge 3 commits into
eXist-db:developfrom
joewiz:fix/resource-name-encoding
Open

[bugfix] Speak decoded UTF-8 resource names on the wire#54
joewiz wants to merge 3 commits into
eXist-db:developfrom
joewiz:fix/resource-name-encoding

Conversation

@joewiz

@joewiz joewiz commented Jun 10, 2026

Copy link
Copy Markdown
Member

[This PR was co-authored with Claude Code. -Joe]

Summary

The /api/db endpoints spoke eXist's stored, percent-encoded form on the wire. This PR makes the API speak decoded UTF-8 in both directions: a client sends and receives café-ünïcode.xml, never caf%C3%A9-%C3%BCn%C3%AFcode.xml.

eXide and TEI Publisher already work this way (decode for display, encode before storage). This brings existdb-openapi in line, so a name from a listing can be echoed straight back as an operation path and always resolve to the same resource.

What changed

modules/db.xqm — two boundary helpers, applied uniformly across every handler:

  • db:to-stored — an incoming wire path → the stored form, applied once at the top of each handler before any doc()/collection()/xmldb:*-available/xmldb:* call. It uses fn:iri-to-uri.
  • db:to-display — a stored path/name → the decoded wire form, applied to every name and path leaving the API.

Why fn:iri-to-uri and not xmldb:encode

iri-to-uri is the inbound encoder because it reproduces exactly what eXist's storage layer writes. Verified against a live instance:

name xmldb:store stores fn:iri-to-uri xmldb:encode
café.xml caf%C3%A9.xml caf%C3%A9.xml caf%C3%A9.xml
quote'name.xml quote'name.xml quote'name.xml quote%27name.xml
a&b.xml a&b.xml a&b.xml a%26b.xml
a b.xml throws FORG0001 a%20b.xml a%20b.xml

iri-to-uri percent-encodes spaces and non-ASCII but leaves sub-delims (' & + @) and existing %XX untouched. So it (a) matches the stored form xmldb:store produces — including the literal-sub-delim names xmldb:store leaves un-encoded, which xmldb:encode would instead store as %27/%26 and then fail to resolve; (b) additionally encodes the space that xmldb:store rejects outright; and (c) is idempotent on already-encoded input, so older clients still sending caf%C3%A9.xml keep working with no double-encoding.

On the decode side, db:to-display protects a literal + (as %2B) before calling xmldb:decode-uri. xmldb:decode-uri otherwise form-decodes + to a space (the x-www-form-urlencoded convention; eXist-db/exist#1824), but a + in a stored name is always a literal + (spaces are stored as %20), so without this protection naïve+test.xml would read back as naïve test.xml. This mirrors URIUtils.decodeForURI (the core fix in eXist-db/exist#6451) at the API layer, so it is correct independent of the core build and forward-compatible once #6451 lands.

URLs in responses (e.g. runPath) stay encoded, since they are clickable links, not operation keys.

Test plan

  • New Cypress coverage (db.cy.js) — store / read / list / remove a café déjà.xml, an o'brien.xml, and a naïve+test.xml by their decoded names, asserting listing names are never percent-encoded. (The existing db tests use ASCII names and don't exercise this boundary.)
  • Real-client end-to-end (existdb-oxygen-plugin, against a beta3 instance carrying this fix): display, read, create, rename, and move/copy all round-trip on space, non-ASCII, CJK, Cyrillic, &, parentheses, and apostrophe — with zero plugin-side encode/decode logic.
  • On-disk verification (bypassing the API): every stored name is the exact iri-to-uri form of the user-typed name, all resolve, and the full decoded-path round-trip returns content (including Cyrillic).

Known boundary (out of scope, documented)

  • Literal % can't be disambiguated from a percent-escape in Phase 1 (needs a bijective encoding) — iri-to-uri leaves a literal % untouched. This is the one remaining edge of the broader resource-naming contract work; this PR is the existdb-openapi piece of it.

(The literal-+ case is handled here directly; see the decode note above.)

joewiz and others added 3 commits June 9, 2026 20:22
The /api/db endpoints spoke eXist's stored, percent-encoded form on the
wire: listings returned "caf%C3%A9.xml", so the existdb-oxygen-plugin and
other clients displayed encoded names and could not tell which form was
canonical. eXide and TEI Publisher already decode for display and encode
before storage; this brings existdb-openapi in line.

Every handler now encodes the incoming wire path once (db:to-stored)
before any doc()/collection()/xmldb:* call, and decodes every name and
path leaving the API (db:to-display).

fn:iri-to-uri is used inbound (not xmldb:encode) because it matches
eXist's own storage escaping: it percent-encodes spaces and non-ASCII but
leaves sub-delims (' & + @) and existing %XX untouched. Verified against a
live instance: xmldb:store("café.xml") -> caf%C3%A9.xml and
store("quote'name.xml") -> quote'name.xml (literal), while xmldb:store
throws outright on a raw space. iri-to-uri reproduces store's output where
store succeeds, additionally encodes the space store rejects, and is
idempotent on already-encoded paths -- so older encoded-path clients keep
working. xmldb:encode would full RFC-3986 encode the sub-delims and fail
to resolve names xmldb:store left literal.

Proven end-to-end with "späce & quøte'd.xml" (space + sub-delim +
non-ASCII): store -> list shows decoded -> get by decoded path returns
content -> delete.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The existing db.cy.js tests use ASCII names only, so they pass but never
exercise the encode-on-input / decode-on-output boundary in db.xqm. Add an
awkward-names block: store, read, list, and remove resources named
"café déjà.xml" (non-ASCII + space) and "o'brien.xml" (sub-delim apostrophe,
which xmldb:store leaves literal) — all addressed by their DECODED name. Assert
the read echoes the decoded path, content is intact, and the listing shows
decoded names (no %XX), confirming the round trip.

Verified on a live instance (existdb-openapi on an ft:fields bed): both names
store and round-trip with decoded path/content and decoded listing names.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
db:to-display decoded names with xmldb:decode-uri, which form-decodes "+"
to a space (the x-www-form-urlencoded convention; eXist-db/exist#1824).
But a "+" in a stored name is always a literal "+" -- spaces are stored as
%20 -- and db:to-stored (fn:iri-to-uri) leaves "+" untouched on the encode
side, so a name like "naïve+test.xml" stored correctly but read back as
"naïve test.xml".

Protect a literal "+" as %2B before xmldb:decode-uri so it decodes back to
"+", restoring symmetry with the encode side. This mirrors what
URIUtils.decodeForURI (the core fix in eXist-db/exist#6451) does, applied
at the API layer so it is correct independent of the core build, and
forward-compatible once #6451 lands. Spaces (%20) are unaffected.

Verified end-to-end against a live instance: "naïve+test.xml" stores as
na%C3%AFve+test.xml on disk, lists and reads back as naïve+test.xml. Adds
the "+" case to the Cypress awkward-name coverage.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant