[bugfix] Speak decoded UTF-8 resource names on the wire#54
Open
joewiz wants to merge 3 commits into
Open
Conversation
The /api/db endpoints spoke eXist's stored, percent-encoded form on the
wire: listings returned "caf%C3%A9.xml", so the existdb-oxygen-plugin and
other clients displayed encoded names and could not tell which form was
canonical. eXide and TEI Publisher already decode for display and encode
before storage; this brings existdb-openapi in line.
Every handler now encodes the incoming wire path once (db:to-stored)
before any doc()/collection()/xmldb:* call, and decodes every name and
path leaving the API (db:to-display).
fn:iri-to-uri is used inbound (not xmldb:encode) because it matches
eXist's own storage escaping: it percent-encodes spaces and non-ASCII but
leaves sub-delims (' & + @) and existing %XX untouched. Verified against a
live instance: xmldb:store("café.xml") -> caf%C3%A9.xml and
store("quote'name.xml") -> quote'name.xml (literal), while xmldb:store
throws outright on a raw space. iri-to-uri reproduces store's output where
store succeeds, additionally encodes the space store rejects, and is
idempotent on already-encoded paths -- so older encoded-path clients keep
working. xmldb:encode would full RFC-3986 encode the sub-delims and fail
to resolve names xmldb:store left literal.
Proven end-to-end with "späce & quøte'd.xml" (space + sub-delim +
non-ASCII): store -> list shows decoded -> get by decoded path returns
content -> delete.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The existing db.cy.js tests use ASCII names only, so they pass but never exercise the encode-on-input / decode-on-output boundary in db.xqm. Add an awkward-names block: store, read, list, and remove resources named "café déjà.xml" (non-ASCII + space) and "o'brien.xml" (sub-delim apostrophe, which xmldb:store leaves literal) — all addressed by their DECODED name. Assert the read echoes the decoded path, content is intact, and the listing shows decoded names (no %XX), confirming the round trip. Verified on a live instance (existdb-openapi on an ft:fields bed): both names store and round-trip with decoded path/content and decoded listing names. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
db:to-display decoded names with xmldb:decode-uri, which form-decodes "+" to a space (the x-www-form-urlencoded convention; eXist-db/exist#1824). But a "+" in a stored name is always a literal "+" -- spaces are stored as %20 -- and db:to-stored (fn:iri-to-uri) leaves "+" untouched on the encode side, so a name like "naïve+test.xml" stored correctly but read back as "naïve test.xml". Protect a literal "+" as %2B before xmldb:decode-uri so it decodes back to "+", restoring symmetry with the encode side. This mirrors what URIUtils.decodeForURI (the core fix in eXist-db/exist#6451) does, applied at the API layer so it is correct independent of the core build, and forward-compatible once #6451 lands. Spaces (%20) are unaffected. Verified end-to-end against a live instance: "naïve+test.xml" stores as na%C3%AFve+test.xml on disk, lists and reads back as naïve+test.xml. Adds the "+" case to the Cypress awkward-name coverage. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
[This PR was co-authored with Claude Code. -Joe]
Summary
The
/api/dbendpoints spoke eXist's stored, percent-encoded form on the wire. This PR makes the API speak decoded UTF-8 in both directions: a client sends and receivescafé-ünïcode.xml, nevercaf%C3%A9-%C3%BCn%C3%AFcode.xml.eXide and TEI Publisher already work this way (decode for display, encode before storage). This brings existdb-openapi in line, so a name from a listing can be echoed straight back as an operation path and always resolve to the same resource.
What changed
modules/db.xqm— two boundary helpers, applied uniformly across every handler:db:to-stored— an incoming wire path → the stored form, applied once at the top of each handler before anydoc()/collection()/xmldb:*-available/xmldb:*call. It usesfn:iri-to-uri.db:to-display— a stored path/name → the decoded wire form, applied to every name and path leaving the API.Why
fn:iri-to-uriand notxmldb:encodeiri-to-uriis the inbound encoder because it reproduces exactly what eXist's storage layer writes. Verified against a live instance:xmldb:storestoresfn:iri-to-urixmldb:encodecafé.xmlcaf%C3%A9.xmlcaf%C3%A9.xml✅caf%C3%A9.xmlquote'name.xmlquote'name.xmlquote'name.xml✅quote%27name.xml❌a&b.xmla&b.xmla&b.xml✅a%26b.xml❌a b.xmlFORG0001a%20b.xml✅a%20b.xmliri-to-uripercent-encodes spaces and non-ASCII but leaves sub-delims (' & + @) and existing%XXuntouched. So it (a) matches the stored formxmldb:storeproduces — including the literal-sub-delim namesxmldb:storeleaves un-encoded, whichxmldb:encodewould instead store as%27/%26and then fail to resolve; (b) additionally encodes the space thatxmldb:storerejects outright; and (c) is idempotent on already-encoded input, so older clients still sendingcaf%C3%A9.xmlkeep working with no double-encoding.On the decode side,
db:to-displayprotects a literal+(as%2B) before callingxmldb:decode-uri.xmldb:decode-uriotherwise form-decodes+to a space (thex-www-form-urlencodedconvention; eXist-db/exist#1824), but a+in a stored name is always a literal+(spaces are stored as%20), so without this protectionnaïve+test.xmlwould read back asnaïve test.xml. This mirrorsURIUtils.decodeForURI(the core fix in eXist-db/exist#6451) at the API layer, so it is correct independent of the core build and forward-compatible once #6451 lands.URLs in responses (e.g.
runPath) stay encoded, since they are clickable links, not operation keys.Test plan
db.cy.js) — store / read / list / remove acafé déjà.xml, ano'brien.xml, and anaïve+test.xmlby their decoded names, asserting listing names are never percent-encoded. (The existing db tests use ASCII names and don't exercise this boundary.)&, parentheses, and apostrophe — with zero plugin-side encode/decode logic.iri-to-uriform of the user-typed name, all resolve, and the full decoded-path round-trip returns content (including Cyrillic).Known boundary (out of scope, documented)
%can't be disambiguated from a percent-escape in Phase 1 (needs a bijective encoding) —iri-to-urileaves a literal%untouched. This is the one remaining edge of the broader resource-naming contract work; this PR is the existdb-openapi piece of it.(The literal-
+case is handled here directly; see the decode note above.)