Skip to content

Add datasketches HLL sketch aggregate functions#63143

Open
nooneuse wants to merge 32 commits into
apache:masterfrom
nooneuse:add_datasketches_union_aggregate_functions
Open

Add datasketches HLL sketch aggregate functions#63143
nooneuse wants to merge 32 commits into
apache:masterfrom
nooneuse:add_datasketches_union_aggregate_functions

Conversation

@nooneuse
Copy link
Copy Markdown

@nooneuse nooneuse commented May 11, 2026

What problem does this PR solve?

An aggregate function is required to process user data containing Datasketches HLL sketches. In many data aggregation scenarios, users pre‑aggregate detailed data in Hive using the sketching techniques provided by Apache Datasketches, and then analyze the resulting sketches across various OLAP engines. Compared with the HLL union aggregate functions natively offered by these engines, there are two key diff to using Datasketches HLL sketches: firstly, the use cases differ; and secondly, HLL sketches can be used seamlessly across different engines—for example, simultaneously in ES, Doris, and ClickHouse. Such requirements are common in many production environments.

Issue Number:

Summary:
Implemented a built-in aggregate function that integrates the Datasketches HLL sketch. This aggregate function cannot rely on the Java UDF environment. Considering that in the Java UDF environment, Strings are encoded in UTF-8, which corrupts the binary data of sketches, the serialization/deserialization operations for sketches must be implemented on the BE side. (additionally, since Apache Datasketches has been added to the contrib directory via a git submodule, it will become very easy to add other sketches such as theta sketch in the future.)

see: #63142
use case: see regression test & #63142

Release note

  1. Add Apache Datasketches Thirdparty submodule
  2. Implemented an aggregate function that integrates the Datasketches HLL sketch.

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
  • Behavior changed:

    • No.
  • Does this need documentation?

    • No. No separate documentation is needed; the usage is easy to understand, and it is clearly explained in the regression tests.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

Comment thread fe/pom.xml
@BePPPower
Copy link
Copy Markdown
Contributor

run buildall

@nooneuse
Copy link
Copy Markdown
Author

run buildall

@nooneuse
Copy link
Copy Markdown
Author

compile

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 4.00% (1/25) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 78.05% (1849/2369)
Line Coverage 64.73% (33222/51327)
Region Coverage 65.25% (16441/25198)
Branch Coverage 55.81% (8780/15732)

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 4.00% (1/25) 🎉
Increment coverage report
Complete coverage report

@nooneuse
Copy link
Copy Markdown
Author

run buildall

@nooneuse
Copy link
Copy Markdown
Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 78.05% (1849/2369)
Line Coverage 64.73% (33225/51327)
Region Coverage 65.24% (16439/25198)
Branch Coverage 55.80% (8779/15732)

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 4.00% (1/25) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 171529 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 9a7d371bfcfd2baa7cf28fac615a878c81686923, data reload: false

query5	4309	664	517	517
query6	340	222	202	202
query7	4235	549	316	316
query8	328	237	224	224
query9	8851	3986	3984	3984
query10	446	336	304	304
query11	5770	2443	2244	2244
query12	178	129	124	124
query13	1276	667	417	417
query14	5973	5403	5062	5062
query14_1	4392	4391	4369	4369
query15	215	204	189	189
query16	1024	461	458	458
query17	1158	750	623	623
query18	2594	512	386	386
query19	221	211	169	169
query20	136	133	132	132
query21	218	140	121	121
query22	13644	13575	13405	13405
query23	17293	16437	16120	16120
query23_1	16272	16309	16265	16265
query24	7551	1773	1305	1305
query24_1	1321	1301	1303	1301
query25	586	516	451	451
query26	1320	333	177	177
query27	2659	562	349	349
query28	4498	1980	1956	1956
query29	1017	651	524	524
query30	309	246	203	203
query31	1112	1056	945	945
query32	90	77	76	76
query33	551	373	320	320
query34	1154	1109	629	629
query35	798	774	668	668
query36	1354	1361	1193	1193
query37	158	103	92	92
query38	3194	3134	3099	3099
query39	928	926	899	899
query39_1	883	890	881	881
query40	228	150	128	128
query41	67	66	64	64
query42	116	114	110	110
query43	324	328	282	282
query44	
query45	210	198	197	197
query46	1080	1173	744	744
query47	2342	2341	2219	2219
query48	417	427	297	297
query49	633	490	390	390
query50	988	344	267	267
query51	4373	4308	4185	4185
query52	107	108	94	94
query53	267	284	205	205
query54	315	277	260	260
query55	96	92	88	88
query56	309	325	303	303
query57	1415	1407	1344	1344
query58	302	276	281	276
query59	1591	1645	1425	1425
query60	370	330	306	306
query61	159	156	153	153
query62	663	607	571	571
query63	245	204	211	204
query64	2395	812	648	648
query65	
query66	1708	489	350	350
query67	30002	30130	29904	29904
query68	
query69	461	346	310	310
query70	1052	969	1008	969
query71	314	289	273	273
query72	2981	2712	2444	2444
query73	873	797	442	442
query74	5093	5045	4736	4736
query75	2867	2622	2254	2254
query76	2282	1170	770	770
query77	414	419	338	338
query78	12373	12246	11681	11681
query79	1466	1011	743	743
query80	641	552	458	458
query81	453	285	246	246
query82	1347	158	125	125
query83	356	283	245	245
query84	259	140	113	113
query85	894	569	481	481
query86	414	367	365	365
query87	3421	3335	3202	3202
query88	3515	2681	2654	2654
query89	432	392	335	335
query90	1970	190	182	182
query91	181	171	138	138
query92	79	76	76	76
query93	1480	1440	885	885
query94	525	355	302	302
query95	664	386	438	386
query96	1005	799	314	314
query97	2727	2703	2577	2577
query98	258	234	246	234
query99	1103	1073	985	985
Total cold run time: 254075 ms
Total hot run time: 171529 ms

@nooneuse
Copy link
Copy Markdown
Author

nooneuse commented May 20, 2026

plz add doc in: https://github.com/apache/doris-website

Hello @zclllyybb , I have added docs for the datasketches_hll_union_agg aggregate function(apache/doris-website#3711).
Could you please help review it when you have time? Thanks!

Also, could you please help trigger /review again? I've fixed the new comments from the bot.

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 31515 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 33110b177aec07e9a7dda59202bd6e07876eb2f3, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17661	3935	3862	3862
q2	q3	10809	1425	781	781
q4	4680	479	341	341
q5	7534	2309	2137	2137
q6	246	178	137	137
q7	910	777	641	641
q8	9448	1765	1568	1568
q9	5067	4906	4888	4888
q10	6393	2060	1843	1843
q11	431	273	247	247
q12	627	417	302	302
q13	18145	3294	2774	2774
q14	262	259	249	249
q15	q16	823	782	706	706
q17	896	937	956	937
q18	7025	5714	5730	5714
q19	1372	1154	1167	1154
q20	510	423	406	406
q21	6196	2818	2519	2519
q22	468	387	309	309
Total cold run time: 99503 ms
Total hot run time: 31515 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4643	4789	4546	4546
q2	q3	4872	5219	4639	4639
q4	2109	2198	1413	1413
q5	4680	4817	4554	4554
q6	236	174	127	127
q7	1795	1788	1540	1540
q8	2357	2025	2025	2025
q9	7716	7471	7170	7170
q10	4455	4374	3938	3938
q11	516	379	355	355
q12	710	719	512	512
q13	3010	3347	2779	2779
q14	264	284	249	249
q15	q16	677	696	597	597
q17	1254	1237	1228	1228
q18	7092	6733	6625	6625
q19	1081	1064	1106	1064
q20	2211	2212	1923	1923
q21	5247	4596	4425	4425
q22	523	443	396	396
Total cold run time: 55448 ms
Total hot run time: 50105 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 168794 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 33110b177aec07e9a7dda59202bd6e07876eb2f3, data reload: false

query5	4322	662	530	530
query6	347	221	197	197
query7	4235	588	319	319
query8	338	230	213	213
query9	8767	3997	3952	3952
query10	469	340	300	300
query11	5786	2402	2228	2228
query12	173	134	123	123
query13	1275	627	441	441
query14	5963	5418	5046	5046
query14_1	4322	4314	4340	4314
query15	215	212	182	182
query16	1015	470	419	419
query17	1125	703	587	587
query18	2692	471	354	354
query19	216	207	163	163
query20	143	135	132	132
query21	215	142	118	118
query22	13629	13610	13413	13413
query23	17238	16335	15943	15943
query23_1	16180	16219	16120	16120
query24	7463	1737	1284	1284
query24_1	1305	1294	1302	1294
query25	537	461	457	457
query26	1311	349	176	176
query27	2665	558	337	337
query28	4450	1988	1970	1970
query29	998	658	521	521
query30	300	246	203	203
query31	1127	1068	956	956
query32	88	76	76	76
query33	556	362	309	309
query34	1165	1128	647	647
query35	775	792	670	670
query36	1329	1310	1174	1174
query37	161	108	96	96
query38	3191	3092	3040	3040
query39	927	927	890	890
query39_1	874	874	884	874
query40	241	155	134	134
query41	71	70	71	70
query42	111	112	113	112
query43	328	334	285	285
query44	
query45	219	206	200	200
query46	1073	1205	746	746
query47	2286	2361	2165	2165
query48	392	400	315	315
query49	664	513	412	412
query50	989	349	261	261
query51	4324	4332	4202	4202
query52	109	106	101	101
query53	266	300	209	209
query54	323	293	264	264
query55	96	94	86	86
query56	324	316	335	316
query57	1428	1379	1305	1305
query58	324	293	278	278
query59	1584	1673	1441	1441
query60	343	336	325	325
query61	183	173	205	173
query62	683	620	567	567
query63	242	194	204	194
query64	2376	806	624	624
query65	
query66	1656	476	348	348
query67	29880	30019	29710	29710
query68	
query69	473	343	308	308
query70	1030	998	1021	998
query71	299	274	273	273
query72	2999	2877	2413	2413
query73	846	754	416	416
query74	5048	4910	4743	4743
query75	2678	2604	2262	2262
query76	2303	1148	735	735
query77	398	395	328	328
query78	12372	12224	11654	11654
query79	1461	1011	757	757
query80	786	525	450	450
query81	473	284	248	248
query82	1349	153	121	121
query83	362	274	245	245
query84	270	142	112	112
query85	945	523	450	450
query86	432	344	323	323
query87	3420	3398	3231	3231
query88	3529	2687	2664	2664
query89	446	384	334	334
query90	1770	174	176	174
query91	180	173	138	138
query92	80	81	73	73
query93	1549	1447	856	856
query94	607	371	301	301
query95	679	385	444	385
query96	1059	804	368	368
query97	2713	2681	2550	2550
query98	249	230	228	228
query99	1087	1108	982	982
Total cold run time: 253259 ms
Total hot run time: 168794 ms

@nooneuse nooneuse requested a review from linrrzqqq May 20, 2026 10:12
@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 89.29% (100/112) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.54% (20750/38755)
Line Coverage 37.19% (196263/527769)
Region Coverage 33.51% (153773/458914)
Branch Coverage 34.53% (67011/194070)

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 4.76% (1/21) 🎉
Increment coverage report
Complete coverage report

@nooneuse
Copy link
Copy Markdown
Author

run p0

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 90.18% (101/112) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.73% (27988/37958)
Line Coverage 57.71% (303813/526417)
Region Coverage 55.04% (255015/463336)
Branch Coverage 56.48% (110028/194797)

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 71.43% (15/21) 🎉
Increment coverage report
Complete coverage report

@nooneuse
Copy link
Copy Markdown
Author

run nonconcurrent

@nooneuse
Copy link
Copy Markdown
Author

run buildall

Comment thread build.sh

echo "install datasketches-cpp to thirdparty path before build be"
update_submodule "contrib/datasketches-cpp" "datasketches-cpp" "https://github.com/apache/datasketches-cpp/archive/refs/tags/5.2.0.tar.gz"
cd "${DORIS_HOME}/contrib/datasketches-cpp"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why maintain through contrib instead of thirdparty?

AggregateFunctionPtr create_aggregate_function_datasketches_hll_union_agg(
const std::string& name, const DataTypes& argument_types, const DataTypePtr& result_type,
const bool result_is_nullable, const AggregateFunctionAttr& attr) {
return creator_with_type_list<TYPE_STRING, TYPE_VARCHAR, TYPE_BINARY, TYPE_VARBINARY>::create<
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In FE, you have only registered .args(StringType.INSTANCE). Do the other types also need to be registered in FE?

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 31164 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit f4223700ce1631d3f1125e0ea6b94f05583a329a, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17633	3839	3799	3799
q2	q3	10832	1364	811	811
q4	4683	464	344	344
q5	7681	2253	2144	2144
q6	264	172	138	138
q7	924	766	630	630
q8	9384	1766	1628	1628
q9	6518	4908	4912	4908
q10	6475	2101	1773	1773
q11	440	276	244	244
q12	693	422	295	295
q13	18168	3663	2759	2759
q14	264	257	233	233
q15	q16	823	781	708	708
q17	1011	979	946	946
q18	6869	5798	5406	5406
q19	1247	1332	1254	1254
q20	582	429	271	271
q21	6113	2655	2567	2567
q22	457	366	306	306
Total cold run time: 101061 ms
Total hot run time: 31164 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4617	4514	4583	4514
q2	q3	4850	5187	4688	4688
q4	2115	2157	1403	1403
q5	4738	4704	4658	4658
q6	249	183	135	135
q7	1797	1714	1541	1541
q8	2414	2034	1898	1898
q9	7163	7172	7151	7151
q10	4545	4448	4010	4010
q11	528	379	348	348
q12	709	710	512	512
q13	3055	3383	2852	2852
q14	274	277	255	255
q15	q16	680	692	619	619
q17	1252	1249	1221	1221
q18	7346	6785	6758	6758
q19	1097	1067	1073	1067
q20	2234	2223	1942	1942
q21	5250	4587	4523	4523
q22	518	477	416	416
Total cold run time: 55431 ms
Total hot run time: 50511 ms

AggregateFunctionCreator creator =
create_aggregate_function_datasketches_hll_union_agg<AggregateFunctionHllSketchData>;
factory.register_function_both("datasketches_hll_union_agg", creator);
factory.register_alias("datasketches_hll_union_agg", "ds_hll_union_count");
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the basis for aliases? For the same behavior, sr named ds_hll_estimate, SnowFlake named DATASKETCHES_HLL_ESTIMATE. I think these should be registered with this name or alias.

* as this value is used as a performance baseline in the relevant documentation.
* (https://datasketches.apache.org/docs/HLL/HllPerformance.html)
*/
static const uint8_t DEFAULT_LOG_K = 12;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
static const uint8_t DEFAULT_LOG_K = 12;
static constexpr uint8_t DEFAULT_LOG_K = 12;

}
try {
hll_union_data->update(sketch_data);
} catch (...) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we refine the exception handling? Can we include doris::Exception and std::exception::what() in the error messages to identifying the cause of DataSketche failure?

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 170760 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit f4223700ce1631d3f1125e0ea6b94f05583a329a, data reload: false

query5	4312	663	517	517
query6	335	224	199	199
query7	4310	533	288	288
query8	328	231	213	213
query9	8850	3996	3971	3971
query10	447	331	296	296
query11	5849	2419	2267	2267
query12	185	125	138	125
query13	1276	616	449	449
query14	6084	5365	5068	5068
query14_1	4385	4398	4360	4360
query15	215	209	189	189
query16	1045	469	478	469
query17	1192	701	580	580
query18	2677	477	358	358
query19	215	207	164	164
query20	140	135	130	130
query21	215	138	116	116
query22	13580	13577	13396	13396
query23	17167	16354	16023	16023
query23_1	16182	16160	16140	16140
query24	7407	1761	1309	1309
query24_1	1293	1301	1299	1299
query25	539	474	421	421
query26	1322	320	191	191
query27	2651	573	346	346
query28	4427	1939	1926	1926
query29	978	621	511	511
query30	311	230	190	190
query31	1122	1072	959	959
query32	86	73	76	73
query33	539	345	277	277
query34	1184	1156	685	685
query35	753	795	685	685
query36	1337	1355	1148	1148
query37	158	101	88	88
query38	3219	3093	3051	3051
query39	936	916	896	896
query39_1	880	876	867	867
query40	235	148	126	126
query41	67	63	61	61
query42	109	111	110	110
query43	327	323	285	285
query44	
query45	216	201	191	191
query46	1058	1256	752	752
query47	2254	2320	2153	2153
query48	396	419	285	285
query49	624	486	382	382
query50	1001	344	250	250
query51	4285	4258	4234	4234
query52	107	105	96	96
query53	259	286	205	205
query54	309	281	261	261
query55	90	87	88	87
query56	307	294	304	294
query57	1440	1382	1290	1290
query58	317	268	253	253
query59	1566	1608	1372	1372
query60	323	325	314	314
query61	159	150	149	149
query62	681	616	562	562
query63	244	204	206	204
query64	2386	806	627	627
query65	
query66	1695	495	373	373
query67	30126	29993	29946	29946
query68	
query69	465	351	318	318
query70	1000	992	1019	992
query71	326	287	277	277
query72	3284	2687	2411	2411
query73	842	734	424	424
query74	5106	4932	4735	4735
query75	2656	2610	2256	2256
query76	2284	1148	734	734
query77	400	402	320	320
query78	12185	12075	11641	11641
query79	1542	1072	738	738
query80	645	530	451	451
query81	457	276	239	239
query82	1367	158	123	123
query83	349	279	260	260
query84	263	142	111	111
query85	883	554	461	461
query86	409	326	349	326
query87	3425	3350	3214	3214
query88	3559	2684	2630	2630
query89	453	379	337	337
query90	1986	181	182	181
query91	175	166	160	160
query92	80	79	72	72
query93	1511	1406	872	872
query94	545	341	297	297
query95	674	470	350	350
query96	1021	778	328	328
query97	2680	2684	2552	2552
query98	238	226	232	226
query99	1078	1111	1010	1010
Total cold run time: 253511 ms
Total hot run time: 170760 ms

Comment on lines +129 to +153
namespace detail {
/** The structure for the delegation work to add one element to the `datasketches_hll_union_agg` aggregate functions.
* Used for partial specialization to add strings.
*/
template <PrimitiveType T, typename Data>
struct OneAdder {
static void ALWAYS_INLINE add(Data& data, const IColumn& column, size_t row_num) {
if constexpr (is_string_type(T) || is_varbinary(T)) {
StringRef value = column.get_data_at(row_num);
if (value.empty()) {
throw Exception(ErrorCode::CORRUPTION,
"HLL sketch data corrupted when add: empty input.");
}
try {
using Sketch = typename Data::Sketch;
using Alloc = typename Data::Alloc;
Sketch sketch_data = Sketch::deserialize(value.begin(), value.size, Alloc());
data.merge(sketch_data);
} catch (...) {
throw Exception(ErrorCode::CORRUPTION, "HLL sketch data corrupted when add.");
}
}
}
};
} // namespace detail
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The detail::OneAdder is used by the current aggregation function only and is placed separately in the namespace, I think the abstraction benefits are not obvious.

Can this be changed to the private static function of AggregateFunctionDataSketchesHllUnionAgg to make the call relationship and scope more intuitive?

class AggregateFunctionDataSketchesHllUnionAgg final
: public IAggregateFunctionDataHelper<Data,
AggregateFunctionDataSketchesHllUnionAgg<T, Data>>,
VarargsExpression,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
VarargsExpression,
UnaryExpression,

public class DataSketchesHllUnionAgg extends NotNullableAggregateFunction
implements UnaryExpression, ExplicitlyCastableSignature, FunctionTrait, RollUpTrait {
public static final List<FunctionSignature> SIGNATURES = ImmutableList.of(
FunctionSignature.ret(BigIntType.INSTANCE).args(StringType.INSTANCE)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the consideration for returning to BigInt? I see that the SnowFlake and the get_estimate api return type provided by the third-party library are both double.

struct OneAdder {
static void ALWAYS_INLINE add(Data& data, const IColumn& column, size_t row_num) {
if constexpr (is_string_type(T) || is_varbinary(T)) {
StringRef value = column.get_data_at(row_num);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert_cast<PrimitiveTypeTraits<T>::ColumnType>(column) before call get_data_at to avoid the overhead of virtual function calls.

@linrrzqqq
Copy link
Copy Markdown
Collaborator

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one correctness issue that should be fixed before merge.

Critical checkpoint conclusions:

  • Goal/test coverage: the PR adds a DataSketches HLL union aggregate with BE/FE registration and regression/unit tests, but coverage misses valid low-lg_k input sketches.
  • Scope/focus: the change is mostly focused on the new aggregate and build dependency wiring.
  • Concurrency/lifecycle: no new shared concurrency path was identified; aggregate state lifecycle is handled through create/destroy/reset, but the initial union construction has a validity bug.
  • Configuration/compatibility: no new config or persisted storage format is introduced; table aggregation-type exposure from earlier revisions appears removed.
  • Parallel paths: BE registers string/varchar/binary/varbinary while FE currently exposes string only; this was already raised in an existing thread, so I did not duplicate it.
  • Error handling/data correctness: invalid inputs are converted to CORRUPTION, but one valid DataSketches input range is also converted to failure because of the union lg_k initialization.
  • Memory/observability: allocator-aware DataSketches state is now used; no additional observability requirement found for this aggregate.
  • Regression standards: result queries are deterministic; negative cases use test/exception.
  • User focus: no additional user-provided review focus was specified.


void merge(const Sketch& sketch_data) {
if (!hll_union_data.has_value()) {
hll_union_data.emplace(sketch_data.get_lg_config_k(), Alloc());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This initializes hll_union_alloc with the input sketch's lg_config_k, but DataSketches allows HLL sketches with lg_config_k down to 4 while hll_union_alloc requires lg_max_k to be at least 7. A valid serialized sketch produced with new HllSketch(4|5|6, HLL_*) will deserialize successfully, then this constructor throws and add()/read() report CORRUPTION for valid data. Please clamp the union lg_k to the union API minimum (or use the library constant if available) and add a unit/regression case for a valid low-lg_k sketch.

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 89.29% (100/112) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.54% (20750/38755)
Line Coverage 37.18% (196207/527769)
Region Coverage 33.50% (153754/458914)
Branch Coverage 34.52% (66988/194070)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 90.18% (101/112) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.63% (27950/37958)
Line Coverage 57.57% (303060/526417)
Region Coverage 54.79% (253861/463336)
Branch Coverage 56.30% (109666/194797)

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 71.43% (15/21) 🎉
Increment coverage report
Complete coverage report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants