Skip to content

feat(function-mapper): expand Spark dialect coverage; retag LDBC sweep xfails#347

Merged
genezhang merged 2 commits into
mainfrom
feat/functionmapper-spark-coverage
May 18, 2026
Merged

feat(function-mapper): expand Spark dialect coverage; retag LDBC sweep xfails#347
genezhang merged 2 commits into
mainfrom
feat/functionmapper-spark-coverage

Conversation

@genezhang

Copy link
Copy Markdown
Owner

Summary

  • Close the most-common Category A FunctionMapper leaks surfaced by PR test(spark-smoke): LDBC sweep — 36 official queries on Delta #346's LDBC sweep: anyLast, countIf, temporal extractors, has(arr, elem), toString(...), and tuple(...) no longer emit CH-native names into Spark SQL.
  • Add tuple_constructor() to the FunctionMapper trait (CH tuple / Spark struct).
  • Make wrap_epoch_millis_arg dialect-aware so datetime({epochMillis: ...}) and friends emit timestamp_millis() on Spark, fromUnixTimestamp64Milli() on CH.
  • Route the remaining hard-coded toString(...) / has(...) / tuple(...) emission sites (JSON builder, VLP zero-hop, MapLiteral, composite-ID, BFS shortestPath countIf) through the mapper.

LDBC sweep result

15 passed / 21 xfailed / 5 skipped — same green count as before. The coverage closure pushed most failing A-queries past the FunctionMapper layer onto the next latent gap (Category C — CTE alias resolution: with_*_cte_N.<col> vs <scope>.<col>). The xfail markers are retained but their reason strings now reflect the actual current failure mode, ready for the next PR to tackle.

Residual A leaks (next FunctionMapper PR)

Query Routine
bi-6 tuple() from NodeId::sql_tuple composite-key path
bi-13 caseWithExpression
bi-17 toUnixTimestamp64Milli in duration-arithmetic
complex-5 count_if(cond, val) — arity mismatch, needs structural rewrite
complex-12, short-2 formatRowNoNewline
complex-13, complex-14 minIf — needs structural rewrite to min(CASE WHEN cond THEN val END)

Test plan

  • cargo build --release -p clickgraph-tool --features databricks clean
  • cargo fmt --all && cargo clippy --all-targets --features databricks clean
  • cargo test --lib — 1370 passed, 0 failed
  • CLICKGRAPH_SPARK_TESTS=1 pytest tests/spark_smoke/test_ldbc_sweep.py — 15/21/5 (unchanged from main)

🤖 Generated with Claude Code

…p xfails

Closes the most-common Category A leaks surfaced by the LDBC sweep
(PR #346): `anyLast`, `countIf`, temporal extractors (toYear/Month/...),
`has(arr, elem)`, `toString(...)`, and `tuple(...)` list-construction
were all emitting CH-native names into Spark SQL.

Changes
- Registry: add `anyLast → any_value`, `countIf → count_if`. Mark
  temporal extractors (toYear, toMonth, toDayOfMonth, toHour, toMinute,
  toSecond, toDayOfWeek, toDayOfYear, toQuarter, toISOWeek) with
  `databricks_name` so they resolve to Spark `year/month/...`.
- Make `wrap_epoch_millis_arg` dialect-aware: emit
  `fromUnixTimestamp64Milli` on CH, `timestamp_millis` on Spark.
- Add `tuple_constructor()` to FunctionMapper trait — `tuple` on CH,
  `struct` on Spark — and route `LogicalExpr::List` through it.
- Route `has(arr, elem)` through `mapper.array_contains()` — `has(...)`
  on CH, `array_contains(...)` on Spark.
- Route `toString(...)` through `mapper.cast_string()` — `toString(...)`
  on CH, `cast(... as string)` on Spark — at the JSON-builder, VLP
  zero-hop, MapLiteral, and composite-ID emission sites.
- Route countIf in BFS shortestPath CTE through `mapper.count_if()`.

LDBC sweep result (15 passed / 21 xfailed / 5 skipped, unchanged)

Coverage closure pushed the failing queries past the FunctionMapper
layer onto the next gap (CTE alias resolution — `with_*_cte_N.<col>`
vs `<scope>.<col>` — i.e. existing Category C). The xfail markers are
retained but their reason strings now reflect the actual current
failure mode. Residual A leaks: `tuple()` from `NodeId::sql_tuple` (bi-6),
`toUnixTimestamp64Milli` in duration arithmetic (bi-17), `caseWithExpression`
(bi-13), `formatRowNoNewline` (complex-12, short-2), arity mismatch
in 2-arg `count_if` (complex-5), and `minIf` (complex-13, complex-14).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 18, 2026 16:08

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR expands Databricks/Spark SQL function mapping coverage in the ClickGraph SQL generation pipeline, especially for LDBC sweep failures where ClickHouse-native functions were leaking into Spark SQL.

Changes:

  • Adds dialect-aware tuple/struct, string-cast, array-membership, temporal-extractor, anyLast, and countIf mappings.
  • Routes several hard-coded toString, has, and tuple/list emission paths through FunctionMapper.
  • Retags Spark LDBC sweep expected failures to reflect the new residual failure categories.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/spark_smoke/test_ldbc_sweep.py Updates strict xfail reasons for the LDBC Spark sweep.
src/sql_generator/function_mapper/mod.rs Extends mapper trait with tuple/struct constructor abstraction.
src/sql_generator/function_mapper/databricks.rs Adds Spark struct tuple-constructor mapping and test assertion.
src/sql_generator/function_mapper/clickhouse.rs Adds ClickHouse tuple constructor mapping.
src/sql_generator/emitters/clickhouse/variable_length_cte.rs Routes BFS countIf through the active function mapper.
src/sql_generator/emitters/clickhouse/to_sql.rs Routes list tuple construction, array membership, and map value casts through the mapper.
src/sql_generator/emitters/clickhouse/to_sql_query.rs Routes additional render-expression membership and map-cast sites through the mapper.
src/sql_generator/emitters/clickhouse/multi_type_vlp_joins.rs Replaces hard-coded toString calls in multi-type VLP paths with mapper casts.
src/sql_generator/emitters/clickhouse/json_builder.rs Replaces multi-type union ID toString calls with mapper casts.
src/sql_generator/emitters/clickhouse/function_registry.rs Adds Databricks names/transforms for temporal extractors, toUnixTimestampMillis, anyLast, and countIf.
src/render_plan/cte_extraction.rs Routes extracted CTE array membership through the mapper.
src/graph_catalog/config.rs Replaces composite-ID string casts with mapper-provided string casts.

Comment on lines 876 to 881
// dayOfWeek(datetime) -> CH: toDayOfWeek, Spark: dayofweek
m.insert("dayofweek", FunctionMapping {
neo4j_name: "dayOfWeek",
clickhouse_name: "toDayOfWeek",
databricks_name: None,
databricks_name: Some("dayofweek"),
arg_transform: Some(wrap_epoch_millis_arg),

@genezhang genezhang May 18, 2026

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — reverted dayofweek's databricks_name to None in 13d36de. CH toDayOfWeek is ISO (1=Monday) and Spark dayofweek is 1=Sunday, so the symbol swap was silently wrong. Until a structural rewrite (e.g. weekday(x) + 1) lands, falling through to toDayOfWeek makes the gap surface as UNRESOLVED_ROUTINE.

{target} AS end_id,\n \
CASE WHEN countIf(node_id = {target}) > 0\n \
CASE WHEN {count_if}(node_id = {target}) > 0\n \
THEN minIf({cast_u16}(hop), node_id = {target})\n \

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — reverted the countIf rewrite at this site in 13d36de. The accompanying minIf has no Spark equivalent, so the partial mapping produced inconsistent dialect-mixed SQL. The pair needs to be rewritten together as min(CASE WHEN cond THEN val END) — tracked in the PR description's residual-A-leaks list (complex-13/14).

…OfWeek/minIf

Address Copilot review on PR #347:

1. `dayOfWeek` databricks_name reverted to None. CH `toDayOfWeek` returns
   1=Monday..7=Sunday (ISO); Spark `dayofweek` returns 1=Sunday..7=Saturday.
   Direct symbol swap silently shifted results by one day. Needs structural
   rewrite (`weekday(x) + 1`) — until then, fall through to `toDayOfWeek` so
   the gap surfaces as UNRESOLVED_ROUTINE.

2. Revert `count_if` rewrite in BFS shortestPath result branch. The same
   SELECT also emits `minIf(...)`, which Spark has no symbol for. Half-
   rewriting only `countIf` produced inconsistent dialect-mixed SQL.
   Pattern needs to be rewritten as a pair to `min(CASE WHEN cond THEN val
   END)` — deferred to a follow-up.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@genezhang genezhang merged commit af06679 into main May 18, 2026
4 checks passed
@genezhang genezhang deleted the feat/functionmapper-spark-coverage branch May 18, 2026 16:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants