feat(function-mapper): expand Spark dialect coverage; retag LDBC sweep xfails by genezhang · Pull Request #347 · genezhang/clickgraph

genezhang · 2026-05-18T16:08:25Z

Summary

Close the most-common Category A FunctionMapper leaks surfaced by PR test(spark-smoke): LDBC sweep — 36 official queries on Delta #346's LDBC sweep: anyLast, countIf, temporal extractors, has(arr, elem), toString(...), and tuple(...) no longer emit CH-native names into Spark SQL.
Add tuple_constructor() to the FunctionMapper trait (CH tuple / Spark struct).
Make wrap_epoch_millis_arg dialect-aware so datetime({epochMillis: ...}) and friends emit timestamp_millis() on Spark, fromUnixTimestamp64Milli() on CH.
Route the remaining hard-coded toString(...) / has(...) / tuple(...) emission sites (JSON builder, VLP zero-hop, MapLiteral, composite-ID, BFS shortestPath countIf) through the mapper.

LDBC sweep result

15 passed / 21 xfailed / 5 skipped — same green count as before. The coverage closure pushed most failing A-queries past the FunctionMapper layer onto the next latent gap (Category C — CTE alias resolution: with_*_cte_N.<col> vs <scope>.<col>). The xfail markers are retained but their reason strings now reflect the actual current failure mode, ready for the next PR to tackle.

Residual A leaks (next FunctionMapper PR)

Query	Routine
bi-6	`tuple()` from `NodeId::sql_tuple` composite-key path
bi-13	`caseWithExpression`
bi-17	`toUnixTimestamp64Milli` in duration-arithmetic
complex-5	`count_if(cond, val)` — arity mismatch, needs structural rewrite
complex-12, short-2	`formatRowNoNewline`
complex-13, complex-14	`minIf` — needs structural rewrite to `min(CASE WHEN cond THEN val END)`

Test plan

cargo build --release -p clickgraph-tool --features databricks clean
cargo fmt --all && cargo clippy --all-targets --features databricks clean
cargo test --lib — 1370 passed, 0 failed
CLICKGRAPH_SPARK_TESTS=1 pytest tests/spark_smoke/test_ldbc_sweep.py — 15/21/5 (unchanged from main)

🤖 Generated with Claude Code

…p xfails Closes the most-common Category A leaks surfaced by the LDBC sweep (PR #346): `anyLast`, `countIf`, temporal extractors (toYear/Month/...), `has(arr, elem)`, `toString(...)`, and `tuple(...)` list-construction were all emitting CH-native names into Spark SQL. Changes - Registry: add `anyLast → any_value`, `countIf → count_if`. Mark temporal extractors (toYear, toMonth, toDayOfMonth, toHour, toMinute, toSecond, toDayOfWeek, toDayOfYear, toQuarter, toISOWeek) with `databricks_name` so they resolve to Spark `year/month/...`. - Make `wrap_epoch_millis_arg` dialect-aware: emit `fromUnixTimestamp64Milli` on CH, `timestamp_millis` on Spark. - Add `tuple_constructor()` to FunctionMapper trait — `tuple` on CH, `struct` on Spark — and route `LogicalExpr::List` through it. - Route `has(arr, elem)` through `mapper.array_contains()` — `has(...)` on CH, `array_contains(...)` on Spark. - Route `toString(...)` through `mapper.cast_string()` — `toString(...)` on CH, `cast(... as string)` on Spark — at the JSON-builder, VLP zero-hop, MapLiteral, and composite-ID emission sites. - Route countIf in BFS shortestPath CTE through `mapper.count_if()`. LDBC sweep result (15 passed / 21 xfailed / 5 skipped, unchanged) Coverage closure pushed the failing queries past the FunctionMapper layer onto the next gap (CTE alias resolution — `with_*_cte_N.<col>` vs `<scope>.<col>` — i.e. existing Category C). The xfail markers are retained but their reason strings now reflect the actual current failure mode. Residual A leaks: `tuple()` from `NodeId::sql_tuple` (bi-6), `toUnixTimestamp64Milli` in duration arithmetic (bi-17), `caseWithExpression` (bi-13), `formatRowNoNewline` (complex-12, short-2), arity mismatch in 2-arg `count_if` (complex-5), and `minIf` (complex-13, complex-14). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Copilot

Pull request overview

This PR expands Databricks/Spark SQL function mapping coverage in the ClickGraph SQL generation pipeline, especially for LDBC sweep failures where ClickHouse-native functions were leaking into Spark SQL.

Changes:

Adds dialect-aware tuple/struct, string-cast, array-membership, temporal-extractor, anyLast, and countIf mappings.
Routes several hard-coded toString, has, and tuple/list emission paths through FunctionMapper.
Retags Spark LDBC sweep expected failures to reflect the new residual failure categories.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`tests/spark_smoke/test_ldbc_sweep.py`	Updates strict xfail reasons for the LDBC Spark sweep.
`src/sql_generator/function_mapper/mod.rs`	Extends mapper trait with tuple/struct constructor abstraction.
`src/sql_generator/function_mapper/databricks.rs`	Adds Spark `struct` tuple-constructor mapping and test assertion.
`src/sql_generator/function_mapper/clickhouse.rs`	Adds ClickHouse `tuple` constructor mapping.
`src/sql_generator/emitters/clickhouse/variable_length_cte.rs`	Routes BFS `countIf` through the active function mapper.
`src/sql_generator/emitters/clickhouse/to_sql.rs`	Routes list tuple construction, array membership, and map value casts through the mapper.
`src/sql_generator/emitters/clickhouse/to_sql_query.rs`	Routes additional render-expression membership and map-cast sites through the mapper.
`src/sql_generator/emitters/clickhouse/multi_type_vlp_joins.rs`	Replaces hard-coded `toString` calls in multi-type VLP paths with mapper casts.
`src/sql_generator/emitters/clickhouse/json_builder.rs`	Replaces multi-type union ID `toString` calls with mapper casts.
`src/sql_generator/emitters/clickhouse/function_registry.rs`	Adds Databricks names/transforms for temporal extractors, `toUnixTimestampMillis`, `anyLast`, and `countIf`.
`src/render_plan/cte_extraction.rs`	Routes extracted CTE array membership through the mapper.
`src/graph_catalog/config.rs`	Replaces composite-ID string casts with mapper-provided string casts.

genezhang · 2026-05-18T16:27:15Z

+        // dayOfWeek(datetime) -> CH: toDayOfWeek, Spark: dayofweek
        m.insert("dayofweek", FunctionMapping {
            neo4j_name: "dayOfWeek",
            clickhouse_name: "toDayOfWeek",
-            databricks_name: None,
+            databricks_name: Some("dayofweek"),
            arg_transform: Some(wrap_epoch_millis_arg),


Good catch — reverted dayofweek's databricks_name to None in 13d36de. CH toDayOfWeek is ISO (1=Monday) and Spark dayofweek is 1=Sunday, so the symbol swap was silently wrong. Until a structural rewrite (e.g. weekday(x) + 1) lands, falling through to toDayOfWeek makes the gap surface as UNRESOLVED_ROUTINE.

genezhang · 2026-05-18T16:27:17Z

                 {target} AS end_id,\n        \
-                 CASE WHEN countIf(node_id = {target}) > 0\n            \
+                 CASE WHEN {count_if}(node_id = {target}) > 0\n            \
                 THEN minIf({cast_u16}(hop), node_id = {target})\n            \


Agreed — reverted the countIf rewrite at this site in 13d36de. The accompanying minIf has no Spark equivalent, so the partial mapping produced inconsistent dialect-mixed SQL. The pair needs to be rewritten together as min(CASE WHEN cond THEN val END) — tracked in the PR description's residual-A-leaks list (complex-13/14).

…OfWeek/minIf Address Copilot review on PR #347: 1. `dayOfWeek` databricks_name reverted to None. CH `toDayOfWeek` returns 1=Monday..7=Sunday (ISO); Spark `dayofweek` returns 1=Sunday..7=Saturday. Direct symbol swap silently shifted results by one day. Needs structural rewrite (`weekday(x) + 1`) — until then, fall through to `toDayOfWeek` so the gap surfaces as UNRESOLVED_ROUTINE. 2. Revert `count_if` rewrite in BFS shortestPath result branch. The same SELECT also emits `minIf(...)`, which Spark has no symbol for. Half- rewriting only `countIf` produced inconsistent dialect-mixed SQL. Pattern needs to be rewritten as a pair to `min(CASE WHEN cond THEN val END)` — deferred to a follow-up. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings May 18, 2026 16:08

Copilot started reviewing on behalf of genezhang May 18, 2026 16:09 View session

Copilot AI reviewed May 18, 2026

View reviewed changes

genezhang merged commit af06679 into main May 18, 2026
4 checks passed

genezhang deleted the feat/functionmapper-spark-coverage branch May 18, 2026 16:35

genezhang mentioned this pull request May 18, 2026

fix(render-plan): rewrite raw CTE-name qualifiers to FROM/JOIN alias #348

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(function-mapper): expand Spark dialect coverage; retag LDBC sweep xfails#347

feat(function-mapper): expand Spark dialect coverage; retag LDBC sweep xfails#347
genezhang merged 2 commits into
mainfrom
feat/functionmapper-spark-coverage

genezhang commented May 18, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

genezhang May 18, 2026 •

edited

Loading

Uh oh!

genezhang May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

genezhang commented May 18, 2026

Summary

LDBC sweep result

Residual A leaks (next FunctionMapper PR)

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

genezhang May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

genezhang May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

genezhang May 18, 2026 •

edited

Loading