|
| 1 | +// SPDX-License-Identifier: MPL-2.0 |
| 2 | +// SPDX-FileCopyrightText: 2025-2026 hyperpolymath |
| 3 | += Design: lowering string `++` (and `slice`) to wasm |
| 4 | +:toc: macro |
| 5 | + |
| 6 | +*Status:* PROPOSAL — awaiting owner sign-off before implementation. |
| 7 | +*Author:* string-wall slice-8 design (Phase F, `proposals/MIGRATION-PLAN.adoc`). |
| 8 | +*Scope:* the last remaining string-wall items — string concatenation (`++`) |
| 9 | +and polymorphic `slice` — which slices 1-7 deferred as "type-directed". |
| 10 | + |
| 11 | +toc::[] |
| 12 | + |
| 13 | +== TL;DR |
| 14 | + |
| 15 | +* String `++` *silently miscompiles* to wasm today (confirmed below). It |
| 16 | + typechecks, the interpreter is correct, but the wasm bytes are wrong. |
| 17 | +* The fix is *not* a new builtin: `++` is the canonical, pervasively-used |
| 18 | + string-concat surface, and a `string_concat` builtin was *deliberately |
| 19 | + removed* (#330). We must make `++` *lower correctly*. |
| 20 | +* The blocker is that codegen is *type-blind* — it cannot tell string `++` |
| 21 | + from list `++`. The typechecker already makes that distinction; the work is |
| 22 | + *threading it to codegen*. |
| 23 | +* *Recommendation:* a typecheck-time elaboration that rewrites string-`++` |
| 24 | + into a dedicated internal AST node which codegen lowers via the established |
| 25 | + byte-copy idiom. `slice`-on-string is mostly already covered by the shipped |
| 26 | + `string_sub`. |
| 27 | + |
| 28 | +== The bug (confirmed) |
| 29 | + |
| 30 | +`++` is polymorphic over `String` and `[T]` (`typecheck.ml:1056-1068`): it |
| 31 | +synthesises the lhs type and dispatches `TCon "String"` -> string concat, |
| 32 | +`TApp("Array",…)` -> array concat. So *string `++` typechecks*, and the |
| 33 | +interpreter handles it (`interp.ml` `binop_string`). |
| 34 | + |
| 35 | +But codegen's `ExprBinary(_, OpConcat, _)` handler lowers *only* the list |
| 36 | +case — `[len][elem i @ +4 + i*4]`, a 4-byte element stride. Applied to a |
| 37 | +string (`[len][utf8 bytes]`), it copies the source's length word and bytes as |
| 38 | +if they were i32 elements. Measured for `"ab" ++ "cd"` (should be "abcd"): |
| 39 | + |
| 40 | +[cols="1,1,1",options="header"] |
| 41 | +|=== |
| 42 | +| byte index | interp | wasm |
| 43 | +| 0 | 97 ('a') | 97 (coincidental — "ab" packs into one i32) |
| 44 | +| 1 | 98 ('b') | 98 (coincidental) |
| 45 | +| 2 | 99 ('c') | *2* ← the length word of "cd" leaking through |
| 46 | +| 3 | 100 ('d') | *0* |
| 47 | +|=== |
| 48 | + |
| 49 | +This is a *#555-class silent mis-lowering*: no error, wrong result. It is the |
| 50 | +load-bearing reason this work is a correctness fix, not a feature. |
| 51 | + |
| 52 | +== Why "add a `string_concat` builtin / ban string `++`" is rejected |
| 53 | + |
| 54 | +The obvious alternative — add a name-dispatched `string_concat(a, b)` builtin |
| 55 | +(like slices 1-7) and stop using `++` for strings — is *ruled out by the |
| 56 | +codebase*: |
| 57 | + |
| 58 | +* *`++` is the canonical surface.* `string_concat` was *deliberately removed* |
| 59 | + in STDLIB-04c / #330; `stdlib/effects.affine:42` records "the canonical |
| 60 | + surface" is `++`. Re-adding the builtin reverses a deliberate decision. |
| 61 | +* *The stdlib uses string `++` pervasively.* `encoding.affine` (`"0" ++ s`, |
| 62 | + `out ++ "=="`), `json.affine` (`"\\u00" ++ hex_digit(hi) ++ …`), |
| 63 | + `io.affine` (`"DEBUG: " ++ show(value)`), `testing.affine` |
| 64 | + (`"Assertion failed: " ++ message ++ …`), `AlibSchema.affine`, … Rejecting |
| 65 | + string `++` in the typechecker would break all of these. |
| 66 | +* *It is tested as `++`.* `test/test_stdlib_laws.ml` has `string_concat` |
| 67 | + associativity / left-unit / right-unit laws and |
| 68 | + `tests/conformance/string/concat.affine` — all exercising the `++` string |
| 69 | + semantics. |
| 70 | + |
| 71 | +Conclusion: the only acceptable fix makes the *existing* `++` lower correctly. |
| 72 | + |
| 73 | +== The real blocker: codegen is type-blind |
| 74 | + |
| 75 | +`ExprBinary of expr * binary_op * expr` (`ast.ml`) carries *no type*. Codegen |
| 76 | +has no per-expression type environment. And two facts rule out the easy |
| 77 | +channels: |
| 78 | + |
| 79 | +* The `expr` AST nodes carry *no span* and *no id*, so a side table keyed by |
| 80 | + source location is not available. |
| 81 | +* `Opt.fold_constants_program` rebuilds the AST before codegen, so a side |
| 82 | + table keyed by *physical node identity* (typecheck's nodes vs codegen's |
| 83 | + post-fold nodes) would not survive. |
| 84 | + |
| 85 | +The typechecker, by contrast, *already* computes the string-vs-array decision |
| 86 | +at every `++` node. The task is to carry that decision across to codegen. |
| 87 | + |
| 88 | +== Options for the typecheck -> codegen channel |
| 89 | + |
| 90 | +[cols="1,3,2,2",options="header"] |
| 91 | +|=== |
| 92 | +| # | Mechanism | Pros | Cons |
| 93 | +| A1 |
| 94 | +| *Elaboration to a dedicated internal node.* Typecheck (which knows the type) |
| 95 | + rewrites a string `++` from `ExprBinary(a, OpConcat, b)` into a new internal |
| 96 | + `ExprStringConcat(a, b)` AST node, threaded to codegen. Codegen lowers |
| 97 | + `ExprStringConcat` via the byte-copy idiom; `OpConcat` stays list-only. |
| 98 | +| `++` stays idiomatic and lowers correctly; no source/stdlib breakage; the |
| 99 | + new node is unambiguous so codegen needs no type logic; const-fold passes it |
| 100 | + through; reusable for other type-directed rewrites. |
| 101 | +| Adds one AST constructor (touch every `match` on `expr`: resolve / interp / |
| 102 | + opt / faces / codegen — mostly a pass-through arm); the compile pipeline must |
| 103 | + thread the *elaborated* program to codegen. |
| 104 | +| A2 |
| 105 | +| *Annotation field on `ExprBinary`.* Add `concat_kind : [`String|`Array|`Unknown]`; |
| 106 | + typecheck sets it; codegen reads it. |
| 107 | +| No new constructor; smaller conceptually. |
| 108 | +| Changes `ExprBinary`'s shape (every construction + match site updates); |
| 109 | + const-fold must preserve the field; "Unknown" still needs a fallback. |
| 110 | +| B |
| 111 | +| *`string_concat` builtin + reject string `++`.* |
| 112 | +| Smallest codegen change (name-dispatched, like slices 1-7). |
| 113 | +| *Rejected* — breaks the stdlib and reverses #330 (see above). |
| 114 | +|=== |
| 115 | + |
| 116 | +*Recommended: A1 (elaboration to `ExprStringConcat`).* It is the cleanest |
| 117 | +*sound* option: codegen gets an unambiguous node and needs no type access; the |
| 118 | +idiomatic `++` surface is unchanged; the stdlib keeps compiling. The cost is |
| 119 | +mechanical (a new constructor that most passes treat as a pass-through, plus |
| 120 | +threading the elaborated program). |
| 121 | + |
| 122 | +The byte-concat *lowering* itself is already proven: it is the list-`++` |
| 123 | +allocate-then-copy idiom (`codegen.ml`) with a 1-byte stride instead of 4 — |
| 124 | +allocate `4 + la + lb`, store the length, two copy loops |
| 125 | +(`I32Load8U`/`I32Store8`), exactly as slices 3/5 copy bytes. |
| 126 | + |
| 127 | +== `slice` |
| 128 | + |
| 129 | +`slice(coll, lo, hi)` is a *polymorphic scheme* (`typecheck.ml:1497`), so it |
| 130 | +faces the same type-blindness. But the common string case — non-negative |
| 131 | +substring extraction — is *already lowered* by `string_sub` (slice 3, |
| 132 | +merged). The only residual is `slice`-on-string with JS-style negative-index |
| 133 | +normalisation, which is niche. Proposal: ride the same elaboration channel |
| 134 | +(rewrite `slice` on a string into an internal string-slice node), or defer it |
| 135 | +as low-value once `++` lands. Recommend deferring until a concrete consumer |
| 136 | +needs negative-index string slicing. |
| 137 | + |
| 138 | +== Implementation plan (on sign-off) |
| 139 | + |
| 140 | +. Add `ExprStringConcat of expr * expr` to `ast.ml`; add pass-through arms in |
| 141 | + resolve, interp (delegate to the existing string `++` semantics), opt |
| 142 | + (fold-through), and the non-wasm codegens (or a clean "unsupported" where |
| 143 | + appropriate). |
| 144 | +. In typecheck, at the `TCon "String"` `++` branch, emit `ExprStringConcat` |
| 145 | + into the elaborated program (confirm the compile pipeline threads |
| 146 | + typecheck's output to codegen; if not, add a thin elaboration pass that |
| 147 | + consumes the type decision). |
| 148 | +. Lower `ExprStringConcat` in `codegen.ml` via the byte-copy idiom (alloc |
| 149 | + `4 + la + lb`, two `I32Load8U`/`I32Store8` loops) — mirroring the list-`++` |
| 150 | + handler. |
| 151 | +. Tests: `test/test_e2e.ml` interp group (already correct — pins the oracle) + |
| 152 | + a `tests/codegen/string_concat.{affine,…mjs}` executable parity check across |
| 153 | + empty / single / multi-word / chained `a ++ b ++ c`, byte-exact via the |
| 154 | + slice-1 reader. The very case that miscompiles today (`"ab" ++ "cd"` byte 2) |
| 155 | + becomes a regression test. |
| 156 | +. Evidence doc + ledger update (Phase F slice 8). |
| 157 | + |
| 158 | +== Risks |
| 159 | + |
| 160 | +* *AST-constructor blast radius.* A new `expr` constructor touches every |
| 161 | + exhaustive `match` on `expr`. Mitigation: most arms are a one-line |
| 162 | + pass-through (treat like `ExprApp`); the OCaml compiler's exhaustiveness |
| 163 | + warnings enumerate the sites. |
| 164 | +* *Pipeline threading.* Need to confirm codegen consumes the typecheck- |
| 165 | + elaborated AST (not the raw parse). If the pipeline doesn't already thread |
| 166 | + it, a small dedicated elaboration pass (post-typecheck, pre-codegen) is the |
| 167 | + fallback. |
| 168 | +* *Const-fold interaction.* `Opt.fold_constants` must pass `ExprStringConcat` |
| 169 | + through (and may even constant-fold `"a" ++ "b"` to `"ab"` — a nice-to-have). |
| 170 | + |
| 171 | +== Decision requested |
| 172 | + |
| 173 | +Sign off to proceed with *A1* (make string `++` lower correctly via the |
| 174 | +`ExprStringConcat` elaboration), defer the `slice` negative-index residual, and |
| 175 | +land it as Phase F slice 8 — or redirect. |
0 commit comments