Component Selection
Describe the Bug
This is a tracking issue for a family of behavior divergences between the
Spark json_to_map function (bolt/functions/sparksql/JsonToMap.cpp) and the
online Hive reference UDF it is meant to replace, which is backed by
com.jsoniter (JsonIterator.deserialize(json).keys() + Any.toString() per value).
json_to_map has two backends selected by sonic.json_parse (default true →
sonic; false → simdjson). We compared both backends against the
jsoniter reference using coverage-guided differential fuzzing.
Fuzzing method
- Coverage-guided corpus generation. A libFuzzer target (built from the
LLVM 19.1.7 compiler-rt/lib/fuzzer sources, since the local Clang ships no
compiler-rt) runs both native backends on each input. The sonic and
simdjson parsers are compiled with SanitizerCoverage
(-fsanitize=fuzzer-no-link), so libFuzzer keeps every input that reaches a
new edge in the real parsers — this guarantees input diversity grows with
the parsers' branch space rather than saturating (a prior blind-random pass
plateaued: 80k samples from a fixed distribution found 0 new behavior
categories).
- Differential oracle. The coverage-diverse corpus is then replayed through
three implementations — sonic backend, simdjson backend, and the jsoniter
reference — and each result is reduced to a canonical, byte-exact form
(status | sorted(hex(key):hex(value))) and diffed. Inputs were restricted
to valid UTF-8 so the Java String round-trip faithfully mirrors Hive.
Scale
| Metric |
Value |
| Fuzz workers |
32 (parallel) |
| Wall time |
~30 min |
| Total executions |
~180,000,000 |
| Coverage reached |
2,133 edges / 11,898 features |
| Corpus produced |
317,430 inputs |
| Valid-UTF-8 inputs replayed |
142,400 |
| Divergent inputs |
57,079 |
Divergence taxonomy
Every divergence fell into one of the categories below (no "other" bucket). Each
has a dedicated sub-issue with a minimal repro:
| Category |
Backend |
Status |
Summary |
| S1 (#667) |
sonic |
✅ numbers fixed; residual |
number precision/format fixed (raw-number parse flags); in-range float reformat + nested whitespace/escaping remain (DOM has no raw-source API) |
| S4 (#670) |
sonic |
✅ fixed |
out-of-range exponent numbers (1e400) no longer rejected |
| S6 (#672) |
both |
✅ fixed (PR #674) |
raw (unescaped) control chars in strings/keys returned NULL |
| S7 (#673) |
simdjson |
✅ fixed |
returned NULL for valid non-object scalar JSON |
| #671 (consolidates #668) |
both |
⏳ deferred / intentional |
invalid non-object handling. Intended rule: valid non-object → empty map, malformed → NULL. The simdjson-side validation to enforce malformed→NULL was reverted to keep PR #674 minimal, so on the simdjson backend malformed non-object input still yields an empty map (sonic already returns NULL). jsoniter's inconsistent empty-map-for-garbage (#668) intentionally not matched |
| S3 (#669) |
sonic vs simd |
❌ not done |
invalid/lone UTF-16 surrogate escapes handled differently (low value / high risk) |
Reproduction Steps
The smallest input per category (default backend unless noted), reproducible via
a JsonToMapTest unit test or SELECT json_to_map(c0):
#667 json_to_map('{"v":221.5222225222225222}') was sonic {v=221.52222252222253} -> now {v=221.5222225222225222}
#670 json_to_map('{"v":1e400}') was sonic NULL -> now {v=1e400}
#673 json_to_map('123') (simdjson backend) was NULL -> now {} (valid non-object)
#671 json_to_map('[') (simdjson backend) still {} (empty map) ; jsoniter/sonic NULL (deferred)
#672 json_to_map('{"a":"x<LF>y"}') was NULL -> now {a=x<LF>y}
#669 json_to_map('{"k":"\ud83d"}') sonic NULL ; jsoniter/simdjson keep U+FFFD (not fixed)
Bolt Version / Commit ID
main @ e570b40
System Configuration
- OS: Debian GNU/Linux 13 (trixie)
- Compiler: GCC 12.4 (product build); Clang 19.1.7 (fuzz harness)
- Build Type: Release
- CPU Arch: x86_64 (SSE4.2/AVX2)
- Framework: Spark
Expected Behavior
For all valid JSON, json_to_map should produce the same map as the Hive
jsoniter reference UDF, and both native backends (sonic, simdjson) should agree
with each other.
Status: #672, #670, #673 are fixed, and the number part of #667. Remaining:
the #667 residual (sonic re-serializes in-range floats and nested
whitespace/escaping — needs a raw-source API the DOM parser lacks); #671
(simdjson leniency on malformed non-object input — the validating fix was
deferred to keep PR #674 minimal); and #669. #668 (folded into #671) is
intentional — jsoniter's NULL-vs-empty-map behavior on malformed input is
inconsistent and not worth replicating.
Additional context
Fixes are prepared on a branch (#672 in PR #674; the rest staged on top).
#668 has been consolidated into #671 (same root problem: invalid non-object
handling). Produced by coverage-guided differential fuzzing as described above.
Component Selection
Describe the Bug
This is a tracking issue for a family of behavior divergences between the
Spark
json_to_mapfunction (bolt/functions/sparksql/JsonToMap.cpp) and theonline Hive reference UDF it is meant to replace, which is backed by
com.jsoniter(JsonIterator.deserialize(json).keys()+Any.toString()per value).json_to_maphas two backends selected bysonic.json_parse(defaulttrue→sonic;
false→ simdjson). We compared both backends against thejsoniter reference using coverage-guided differential fuzzing.
Fuzzing method
LLVM 19.1.7
compiler-rt/lib/fuzzersources, since the local Clang ships nocompiler-rt) runs both native backends on each input. The sonic and
simdjson parsers are compiled with SanitizerCoverage
(
-fsanitize=fuzzer-no-link), so libFuzzer keeps every input that reaches anew edge in the real parsers — this guarantees input diversity grows with
the parsers' branch space rather than saturating (a prior blind-random pass
plateaued: 80k samples from a fixed distribution found 0 new behavior
categories).
three implementations — sonic backend, simdjson backend, and the jsoniter
reference — and each result is reduced to a canonical, byte-exact form
(
status | sorted(hex(key):hex(value))) and diffed. Inputs were restrictedto valid UTF-8 so the Java
Stringround-trip faithfully mirrors Hive.Scale
Divergence taxonomy
Every divergence fell into one of the categories below (no "other" bucket). Each
has a dedicated sub-issue with a minimal repro:
1e400) no longer rejectedReproduction Steps
The smallest input per category (default backend unless noted), reproducible via
a
JsonToMapTestunit test orSELECT json_to_map(c0):Bolt Version / Commit ID
main @
e570b40System Configuration
Expected Behavior
For all valid JSON,
json_to_mapshould produce the same map as the Hivejsoniter reference UDF, and both native backends (sonic, simdjson) should agree
with each other.
Status: #672, #670, #673 are fixed, and the number part of #667. Remaining:
the #667 residual (sonic re-serializes in-range floats and nested
whitespace/escaping — needs a raw-source API the DOM parser lacks); #671
(simdjson leniency on malformed non-object input — the validating fix was
deferred to keep PR #674 minimal); and #669. #668 (folded into #671) is
intentional — jsoniter's NULL-vs-empty-map behavior on malformed input is
inconsistent and not worth replicating.
Additional context
Fixes are prepared on a branch (#672 in PR #674; the rest staged on top).
#668 has been consolidated into #671 (same root problem: invalid non-object
handling). Produced by coverage-guided differential fuzzing as described above.