feat(function-mapper): expand Spark dialect coverage; retag LDBC sweep xfails#347
Conversation
…p xfails Closes the most-common Category A leaks surfaced by the LDBC sweep (PR #346): `anyLast`, `countIf`, temporal extractors (toYear/Month/...), `has(arr, elem)`, `toString(...)`, and `tuple(...)` list-construction were all emitting CH-native names into Spark SQL. Changes - Registry: add `anyLast → any_value`, `countIf → count_if`. Mark temporal extractors (toYear, toMonth, toDayOfMonth, toHour, toMinute, toSecond, toDayOfWeek, toDayOfYear, toQuarter, toISOWeek) with `databricks_name` so they resolve to Spark `year/month/...`. - Make `wrap_epoch_millis_arg` dialect-aware: emit `fromUnixTimestamp64Milli` on CH, `timestamp_millis` on Spark. - Add `tuple_constructor()` to FunctionMapper trait — `tuple` on CH, `struct` on Spark — and route `LogicalExpr::List` through it. - Route `has(arr, elem)` through `mapper.array_contains()` — `has(...)` on CH, `array_contains(...)` on Spark. - Route `toString(...)` through `mapper.cast_string()` — `toString(...)` on CH, `cast(... as string)` on Spark — at the JSON-builder, VLP zero-hop, MapLiteral, and composite-ID emission sites. - Route countIf in BFS shortestPath CTE through `mapper.count_if()`. LDBC sweep result (15 passed / 21 xfailed / 5 skipped, unchanged) Coverage closure pushed the failing queries past the FunctionMapper layer onto the next gap (CTE alias resolution — `with_*_cte_N.<col>` vs `<scope>.<col>` — i.e. existing Category C). The xfail markers are retained but their reason strings now reflect the actual current failure mode. Residual A leaks: `tuple()` from `NodeId::sql_tuple` (bi-6), `toUnixTimestamp64Milli` in duration arithmetic (bi-17), `caseWithExpression` (bi-13), `formatRowNoNewline` (complex-12, short-2), arity mismatch in 2-arg `count_if` (complex-5), and `minIf` (complex-13, complex-14). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR expands Databricks/Spark SQL function mapping coverage in the ClickGraph SQL generation pipeline, especially for LDBC sweep failures where ClickHouse-native functions were leaking into Spark SQL.
Changes:
- Adds dialect-aware tuple/struct, string-cast, array-membership, temporal-extractor,
anyLast, andcountIfmappings. - Routes several hard-coded
toString,has, and tuple/list emission paths throughFunctionMapper. - Retags Spark LDBC sweep expected failures to reflect the new residual failure categories.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
tests/spark_smoke/test_ldbc_sweep.py |
Updates strict xfail reasons for the LDBC Spark sweep. |
src/sql_generator/function_mapper/mod.rs |
Extends mapper trait with tuple/struct constructor abstraction. |
src/sql_generator/function_mapper/databricks.rs |
Adds Spark struct tuple-constructor mapping and test assertion. |
src/sql_generator/function_mapper/clickhouse.rs |
Adds ClickHouse tuple constructor mapping. |
src/sql_generator/emitters/clickhouse/variable_length_cte.rs |
Routes BFS countIf through the active function mapper. |
src/sql_generator/emitters/clickhouse/to_sql.rs |
Routes list tuple construction, array membership, and map value casts through the mapper. |
src/sql_generator/emitters/clickhouse/to_sql_query.rs |
Routes additional render-expression membership and map-cast sites through the mapper. |
src/sql_generator/emitters/clickhouse/multi_type_vlp_joins.rs |
Replaces hard-coded toString calls in multi-type VLP paths with mapper casts. |
src/sql_generator/emitters/clickhouse/json_builder.rs |
Replaces multi-type union ID toString calls with mapper casts. |
src/sql_generator/emitters/clickhouse/function_registry.rs |
Adds Databricks names/transforms for temporal extractors, toUnixTimestampMillis, anyLast, and countIf. |
src/render_plan/cte_extraction.rs |
Routes extracted CTE array membership through the mapper. |
src/graph_catalog/config.rs |
Replaces composite-ID string casts with mapper-provided string casts. |
| // dayOfWeek(datetime) -> CH: toDayOfWeek, Spark: dayofweek | ||
| m.insert("dayofweek", FunctionMapping { | ||
| neo4j_name: "dayOfWeek", | ||
| clickhouse_name: "toDayOfWeek", | ||
| databricks_name: None, | ||
| databricks_name: Some("dayofweek"), | ||
| arg_transform: Some(wrap_epoch_millis_arg), |
There was a problem hiding this comment.
Good catch — reverted dayofweek's databricks_name to None in 13d36de. CH toDayOfWeek is ISO (1=Monday) and Spark dayofweek is 1=Sunday, so the symbol swap was silently wrong. Until a structural rewrite (e.g. weekday(x) + 1) lands, falling through to toDayOfWeek makes the gap surface as UNRESOLVED_ROUTINE.
| {target} AS end_id,\n \ | ||
| CASE WHEN countIf(node_id = {target}) > 0\n \ | ||
| CASE WHEN {count_if}(node_id = {target}) > 0\n \ | ||
| THEN minIf({cast_u16}(hop), node_id = {target})\n \ |
There was a problem hiding this comment.
Agreed — reverted the countIf rewrite at this site in 13d36de. The accompanying minIf has no Spark equivalent, so the partial mapping produced inconsistent dialect-mixed SQL. The pair needs to be rewritten together as min(CASE WHEN cond THEN val END) — tracked in the PR description's residual-A-leaks list (complex-13/14).
…OfWeek/minIf Address Copilot review on PR #347: 1. `dayOfWeek` databricks_name reverted to None. CH `toDayOfWeek` returns 1=Monday..7=Sunday (ISO); Spark `dayofweek` returns 1=Sunday..7=Saturday. Direct symbol swap silently shifted results by one day. Needs structural rewrite (`weekday(x) + 1`) — until then, fall through to `toDayOfWeek` so the gap surfaces as UNRESOLVED_ROUTINE. 2. Revert `count_if` rewrite in BFS shortestPath result branch. The same SELECT also emits `minIf(...)`, which Spark has no symbol for. Half- rewriting only `countIf` produced inconsistent dialect-mixed SQL. Pattern needs to be rewritten as a pair to `min(CASE WHEN cond THEN val END)` — deferred to a follow-up. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Summary
anyLast,countIf, temporal extractors,has(arr, elem),toString(...), andtuple(...)no longer emit CH-native names into Spark SQL.tuple_constructor()to theFunctionMappertrait (CHtuple/ Sparkstruct).wrap_epoch_millis_argdialect-aware sodatetime({epochMillis: ...})and friends emittimestamp_millis()on Spark,fromUnixTimestamp64Milli()on CH.toString(...)/has(...)/tuple(...)emission sites (JSON builder, VLP zero-hop, MapLiteral, composite-ID, BFS shortestPath countIf) through the mapper.LDBC sweep result
15 passed / 21 xfailed / 5 skipped— same green count as before. The coverage closure pushed most failing A-queries past the FunctionMapper layer onto the next latent gap (Category C — CTE alias resolution:with_*_cte_N.<col>vs<scope>.<col>). The xfail markers are retained but their reason strings now reflect the actual current failure mode, ready for the next PR to tackle.Residual A leaks (next FunctionMapper PR)
tuple()fromNodeId::sql_tuplecomposite-key pathcaseWithExpressiontoUnixTimestamp64Milliin duration-arithmeticcount_if(cond, val)— arity mismatch, needs structural rewriteformatRowNoNewlineminIf— needs structural rewrite tomin(CASE WHEN cond THEN val END)Test plan
cargo build --release -p clickgraph-tool --features databrickscleancargo fmt --all && cargo clippy --all-targets --features databrickscleancargo test --lib— 1370 passed, 0 failedCLICKGRAPH_SPARK_TESTS=1 pytest tests/spark_smoke/test_ldbc_sweep.py— 15/21/5 (unchanged from main)🤖 Generated with Claude Code