Skip to content

Commit 9796fb6

Browse files
design: lowering string ++ (and slice) to wasm — proposal for sign-off (#574)
## Design proposal — lowering string `++` (and `slice`) to wasm **Design-only, awaiting your sign-off before implementation.** This is the write-up you asked for on the last remaining string-wall items (concat / `slice`), now grounded in investigation. ### The headline finding: string `++` silently miscompiles to wasm `++` is type-dispatched in typecheck (string vs array), so string `++` typechecks and the interpreter is correct — but **codegen lowers `++` only as list-concat** (4-byte element stride), copying a string's `[len][bytes]` as if they were i32 elements. Measured for `"ab" ++ "cd"` (should be "abcd"): | byte | interp | wasm | |---|---|---| | 0–1 | 97, 98 | 97, 98 (coincidental — "ab" packs into one i32) | | 2 | 99 (`'c'`) | **2** ← the length word of "cd" leaking through | | 3 | 100 (`'d'`) | **0** | A #555-class silent mis-lowering — so this is a **correctness fix**, not a feature. ### Why "add a `string_concat` builtin" is ruled out (by evidence) - `++` is the **canonical** string-concat surface; `string_concat` was **deliberately removed** in #330 (`stdlib/effects.affine:42`). - The stdlib uses string `++` **pervasively** (`encoding`/`json`/`io`/`testing`/…) — rejecting it breaks the stdlib. - It's **tested as `++`** (`test_stdlib_laws.ml` concat laws + a conformance suite). So the only acceptable fix makes the **existing `++` lower correctly**. ### The blocker + recommendation Codegen is type-blind (`ExprBinary` carries no type; AST has no spans/ids; `Opt.fold` rebuilds the AST). The typechecker already decides string-vs-array per `++` node — the work is threading that to codegen. **Recommended (A1):** a typecheck-time elaboration that rewrites string `++` into a dedicated internal `ExprStringConcat` node, which codegen lowers via the established byte-copy idiom (alloc `4 + la + lb`, two `I32Load8U`/`I32Store8` loops — the list-`++` pattern with a 1-byte stride). `++` stays idiomatic, the stdlib keeps compiling, codegen needs no type logic. `slice`-on-string is mostly already covered by the shipped `string_sub` (slice 3); defer its negative-index residual. Full analysis, alternatives (A2 annotation field; B rejected), implementation plan, and risks are in `proposals/DESIGN-string-concat.adoc`. ### Decision requested Sign off to proceed with **A1** as Phase F slice 8, or redirect. I'll implement only after your go-ahead. https://claude.ai/code/session_01WoKhFQePiRsAj7aqnxbG8s --- _Generated by [Claude Code](https://claude.ai/code/session_01WoKhFQePiRsAj7aqnxbG8s)_ Co-authored-by: Claude <noreply@anthropic.com>
1 parent b38654d commit 9796fb6

1 file changed

Lines changed: 175 additions & 0 deletions

File tree

Lines changed: 175 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,175 @@
1+
// SPDX-License-Identifier: MPL-2.0
2+
// SPDX-FileCopyrightText: 2025-2026 hyperpolymath
3+
= Design: lowering string `++` (and `slice`) to wasm
4+
:toc: macro
5+
6+
*Status:* PROPOSAL — awaiting owner sign-off before implementation.
7+
*Author:* string-wall slice-8 design (Phase F, `proposals/MIGRATION-PLAN.adoc`).
8+
*Scope:* the last remaining string-wall items — string concatenation (`++`)
9+
and polymorphic `slice` — which slices 1-7 deferred as "type-directed".
10+
11+
toc::[]
12+
13+
== TL;DR
14+
15+
* String `++` *silently miscompiles* to wasm today (confirmed below). It
16+
typechecks, the interpreter is correct, but the wasm bytes are wrong.
17+
* The fix is *not* a new builtin: `++` is the canonical, pervasively-used
18+
string-concat surface, and a `string_concat` builtin was *deliberately
19+
removed* (#330). We must make `++` *lower correctly*.
20+
* The blocker is that codegen is *type-blind* — it cannot tell string `++`
21+
from list `++`. The typechecker already makes that distinction; the work is
22+
*threading it to codegen*.
23+
* *Recommendation:* a typecheck-time elaboration that rewrites string-`++`
24+
into a dedicated internal AST node which codegen lowers via the established
25+
byte-copy idiom. `slice`-on-string is mostly already covered by the shipped
26+
`string_sub`.
27+
28+
== The bug (confirmed)
29+
30+
`++` is polymorphic over `String` and `[T]` (`typecheck.ml:1056-1068`): it
31+
synthesises the lhs type and dispatches `TCon "String"` -> string concat,
32+
`TApp("Array",…)` -> array concat. So *string `++` typechecks*, and the
33+
interpreter handles it (`interp.ml` `binop_string`).
34+
35+
But codegen's `ExprBinary(_, OpConcat, _)` handler lowers *only* the list
36+
case — `[len][elem i @ +4 + i*4]`, a 4-byte element stride. Applied to a
37+
string (`[len][utf8 bytes]`), it copies the source's length word and bytes as
38+
if they were i32 elements. Measured for `"ab" ++ "cd"` (should be "abcd"):
39+
40+
[cols="1,1,1",options="header"]
41+
|===
42+
| byte index | interp | wasm
43+
| 0 | 97 ('a') | 97 (coincidental — "ab" packs into one i32)
44+
| 1 | 98 ('b') | 98 (coincidental)
45+
| 2 | 99 ('c') | *2* ← the length word of "cd" leaking through
46+
| 3 | 100 ('d') | *0*
47+
|===
48+
49+
This is a *#555-class silent mis-lowering*: no error, wrong result. It is the
50+
load-bearing reason this work is a correctness fix, not a feature.
51+
52+
== Why "add a `string_concat` builtin / ban string `++`" is rejected
53+
54+
The obvious alternative — add a name-dispatched `string_concat(a, b)` builtin
55+
(like slices 1-7) and stop using `++` for strings — is *ruled out by the
56+
codebase*:
57+
58+
* *`++` is the canonical surface.* `string_concat` was *deliberately removed*
59+
in STDLIB-04c / #330; `stdlib/effects.affine:42` records "the canonical
60+
surface" is `++`. Re-adding the builtin reverses a deliberate decision.
61+
* *The stdlib uses string `++` pervasively.* `encoding.affine` (`"0" ++ s`,
62+
`out ++ "=="`), `json.affine` (`"\\u00" ++ hex_digit(hi) ++ …`),
63+
`io.affine` (`"DEBUG: " ++ show(value)`), `testing.affine`
64+
(`"Assertion failed: " ++ message ++ …`), `AlibSchema.affine`, … Rejecting
65+
string `++` in the typechecker would break all of these.
66+
* *It is tested as `++`.* `test/test_stdlib_laws.ml` has `string_concat`
67+
associativity / left-unit / right-unit laws and
68+
`tests/conformance/string/concat.affine` — all exercising the `++` string
69+
semantics.
70+
71+
Conclusion: the only acceptable fix makes the *existing* `++` lower correctly.
72+
73+
== The real blocker: codegen is type-blind
74+
75+
`ExprBinary of expr * binary_op * expr` (`ast.ml`) carries *no type*. Codegen
76+
has no per-expression type environment. And two facts rule out the easy
77+
channels:
78+
79+
* The `expr` AST nodes carry *no span* and *no id*, so a side table keyed by
80+
source location is not available.
81+
* `Opt.fold_constants_program` rebuilds the AST before codegen, so a side
82+
table keyed by *physical node identity* (typecheck's nodes vs codegen's
83+
post-fold nodes) would not survive.
84+
85+
The typechecker, by contrast, *already* computes the string-vs-array decision
86+
at every `++` node. The task is to carry that decision across to codegen.
87+
88+
== Options for the typecheck -> codegen channel
89+
90+
[cols="1,3,2,2",options="header"]
91+
|===
92+
| # | Mechanism | Pros | Cons
93+
| A1
94+
| *Elaboration to a dedicated internal node.* Typecheck (which knows the type)
95+
rewrites a string `++` from `ExprBinary(a, OpConcat, b)` into a new internal
96+
`ExprStringConcat(a, b)` AST node, threaded to codegen. Codegen lowers
97+
`ExprStringConcat` via the byte-copy idiom; `OpConcat` stays list-only.
98+
| `++` stays idiomatic and lowers correctly; no source/stdlib breakage; the
99+
new node is unambiguous so codegen needs no type logic; const-fold passes it
100+
through; reusable for other type-directed rewrites.
101+
| Adds one AST constructor (touch every `match` on `expr`: resolve / interp /
102+
opt / faces / codegen — mostly a pass-through arm); the compile pipeline must
103+
thread the *elaborated* program to codegen.
104+
| A2
105+
| *Annotation field on `ExprBinary`.* Add `concat_kind : [`String|`Array|`Unknown]`;
106+
typecheck sets it; codegen reads it.
107+
| No new constructor; smaller conceptually.
108+
| Changes `ExprBinary`'s shape (every construction + match site updates);
109+
const-fold must preserve the field; "Unknown" still needs a fallback.
110+
| B
111+
| *`string_concat` builtin + reject string `++`.*
112+
| Smallest codegen change (name-dispatched, like slices 1-7).
113+
| *Rejected* — breaks the stdlib and reverses #330 (see above).
114+
|===
115+
116+
*Recommended: A1 (elaboration to `ExprStringConcat`).* It is the cleanest
117+
*sound* option: codegen gets an unambiguous node and needs no type access; the
118+
idiomatic `++` surface is unchanged; the stdlib keeps compiling. The cost is
119+
mechanical (a new constructor that most passes treat as a pass-through, plus
120+
threading the elaborated program).
121+
122+
The byte-concat *lowering* itself is already proven: it is the list-`++`
123+
allocate-then-copy idiom (`codegen.ml`) with a 1-byte stride instead of 4 —
124+
allocate `4 + la + lb`, store the length, two copy loops
125+
(`I32Load8U`/`I32Store8`), exactly as slices 3/5 copy bytes.
126+
127+
== `slice`
128+
129+
`slice(coll, lo, hi)` is a *polymorphic scheme* (`typecheck.ml:1497`), so it
130+
faces the same type-blindness. But the common string case — non-negative
131+
substring extraction — is *already lowered* by `string_sub` (slice 3,
132+
merged). The only residual is `slice`-on-string with JS-style negative-index
133+
normalisation, which is niche. Proposal: ride the same elaboration channel
134+
(rewrite `slice` on a string into an internal string-slice node), or defer it
135+
as low-value once `++` lands. Recommend deferring until a concrete consumer
136+
needs negative-index string slicing.
137+
138+
== Implementation plan (on sign-off)
139+
140+
. Add `ExprStringConcat of expr * expr` to `ast.ml`; add pass-through arms in
141+
resolve, interp (delegate to the existing string `++` semantics), opt
142+
(fold-through), and the non-wasm codegens (or a clean "unsupported" where
143+
appropriate).
144+
. In typecheck, at the `TCon "String"` `++` branch, emit `ExprStringConcat`
145+
into the elaborated program (confirm the compile pipeline threads
146+
typecheck's output to codegen; if not, add a thin elaboration pass that
147+
consumes the type decision).
148+
. Lower `ExprStringConcat` in `codegen.ml` via the byte-copy idiom (alloc
149+
`4 + la + lb`, two `I32Load8U`/`I32Store8` loops) — mirroring the list-`++`
150+
handler.
151+
. Tests: `test/test_e2e.ml` interp group (already correct — pins the oracle) +
152+
a `tests/codegen/string_concat.{affine,…mjs}` executable parity check across
153+
empty / single / multi-word / chained `a ++ b ++ c`, byte-exact via the
154+
slice-1 reader. The very case that miscompiles today (`"ab" ++ "cd"` byte 2)
155+
becomes a regression test.
156+
. Evidence doc + ledger update (Phase F slice 8).
157+
158+
== Risks
159+
160+
* *AST-constructor blast radius.* A new `expr` constructor touches every
161+
exhaustive `match` on `expr`. Mitigation: most arms are a one-line
162+
pass-through (treat like `ExprApp`); the OCaml compiler's exhaustiveness
163+
warnings enumerate the sites.
164+
* *Pipeline threading.* Need to confirm codegen consumes the typecheck-
165+
elaborated AST (not the raw parse). If the pipeline doesn't already thread
166+
it, a small dedicated elaboration pass (post-typecheck, pre-codegen) is the
167+
fallback.
168+
* *Const-fold interaction.* `Opt.fold_constants` must pass `ExprStringConcat`
169+
through (and may even constant-fold `"a" ++ "b"` to `"ab"` — a nice-to-have).
170+
171+
== Decision requested
172+
173+
Sign off to proceed with *A1* (make string `++` lower correctly via the
174+
`ExprStringConcat` elaboration), defer the `slice` negative-index residual, and
175+
land it as Phase F slice 8 — or redirect.

0 commit comments

Comments
 (0)