Skip to content

[Bug] json_to_map diverges from Hive jsoniter reference UDF (coverage-guided differential fuzzing — tracking issue) #666

Description

@zhangxffff

Component Selection

  • Core Engine (Expression eval, Memory, Vector)
  • Connectors / File Formats (Hive, Parquet, etc.)
  • API / Bindings (Python, etc.)
  • Build
  • Other

Describe the Bug

This is a tracking issue for a family of behavior divergences between the
Spark json_to_map function (bolt/functions/sparksql/JsonToMap.cpp) and the
online Hive reference UDF it is meant to replace, which is backed by
com.jsoniter (JsonIterator.deserialize(json).keys() + Any.toString() per value).

json_to_map has two backends selected by sonic.json_parse (default true
sonic; falsesimdjson). We compared both backends against the
jsoniter reference using coverage-guided differential fuzzing.

Fuzzing method

  1. Coverage-guided corpus generation. A libFuzzer target (built from the
    LLVM 19.1.7 compiler-rt/lib/fuzzer sources, since the local Clang ships no
    compiler-rt) runs both native backends on each input. The sonic and
    simdjson parsers are compiled with SanitizerCoverage
    (-fsanitize=fuzzer-no-link), so libFuzzer keeps every input that reaches a
    new edge in the real parsers — this guarantees input diversity grows with
    the parsers' branch space rather than saturating (a prior blind-random pass
    plateaued: 80k samples from a fixed distribution found 0 new behavior
    categories).
  2. Differential oracle. The coverage-diverse corpus is then replayed through
    three implementations — sonic backend, simdjson backend, and the jsoniter
    reference — and each result is reduced to a canonical, byte-exact form
    (status | sorted(hex(key):hex(value))) and diffed. Inputs were restricted
    to valid UTF-8 so the Java String round-trip faithfully mirrors Hive.

Scale

Metric Value
Fuzz workers 32 (parallel)
Wall time ~30 min
Total executions ~180,000,000
Coverage reached 2,133 edges / 11,898 features
Corpus produced 317,430 inputs
Valid-UTF-8 inputs replayed 142,400
Divergent inputs 57,079

Divergence taxonomy

Every divergence fell into one of the categories below (no "other" bucket). Each
has a dedicated sub-issue with a minimal repro:

Category Backend Status Summary
S1 (#667) sonic ✅ numbers fixed; residual number precision/format fixed (raw-number parse flags); in-range float reformat + nested whitespace/escaping remain (DOM has no raw-source API)
S4 (#670) sonic ✅ fixed out-of-range exponent numbers (1e400) no longer rejected
S6 (#672) both ✅ fixed (PR #674) raw (unescaped) control chars in strings/keys returned NULL
S7 (#673) simdjson ✅ fixed returned NULL for valid non-object scalar JSON
#671 (consolidates #668) both ⏳ deferred / intentional invalid non-object handling. Intended rule: valid non-object → empty map, malformed → NULL. The simdjson-side validation to enforce malformed→NULL was reverted to keep PR #674 minimal, so on the simdjson backend malformed non-object input still yields an empty map (sonic already returns NULL). jsoniter's inconsistent empty-map-for-garbage (#668) intentionally not matched
S3 (#669) sonic vs simd ❌ not done invalid/lone UTF-16 surrogate escapes handled differently (low value / high risk)

Reproduction Steps

The smallest input per category (default backend unless noted), reproducible via
a JsonToMapTest unit test or SELECT json_to_map(c0):

#667  json_to_map('{"v":221.5222225222225222}')      was sonic {v=221.52222252222253}  -> now {v=221.5222225222225222}
#670  json_to_map('{"v":1e400}')                      was sonic NULL                    -> now {v=1e400}
#673  json_to_map('123') (simdjson backend)           was NULL                          -> now {}    (valid non-object)
#671  json_to_map('[')   (simdjson backend)           still {} (empty map)              ; jsoniter/sonic NULL  (deferred)
#672  json_to_map('{"a":"x<LF>y"}')                   was NULL                          -> now {a=x<LF>y}
#669  json_to_map('{"k":"\ud83d"}')                   sonic NULL ; jsoniter/simdjson keep U+FFFD  (not fixed)

Bolt Version / Commit ID

main @ e570b40

System Configuration

  • OS: Debian GNU/Linux 13 (trixie)
  • Compiler: GCC 12.4 (product build); Clang 19.1.7 (fuzz harness)
  • Build Type: Release
  • CPU Arch: x86_64 (SSE4.2/AVX2)
  • Framework: Spark

Expected Behavior

For all valid JSON, json_to_map should produce the same map as the Hive
jsoniter reference UDF, and both native backends (sonic, simdjson) should agree
with each other.

Status: #672, #670, #673 are fixed, and the number part of #667. Remaining:
the #667 residual (sonic re-serializes in-range floats and nested
whitespace/escaping — needs a raw-source API the DOM parser lacks); #671
(simdjson leniency on malformed non-object input — the validating fix was
deferred to keep PR #674 minimal); and #669. #668 (folded into #671) is
intentional — jsoniter's NULL-vs-empty-map behavior on malformed input is
inconsistent and not worth replicating.

Additional context

Fixes are prepared on a branch (#672 in PR #674; the rest staged on top).
#668 has been consolidated into #671 (same root problem: invalid non-object
handling). Produced by coverage-guided differential fuzzing as described above.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions