GH-49985: Duplicate function aliases with same parameters#49986
Open
lriggs wants to merge 10 commits into
Open
GH-49985: Duplicate function aliases with same parameters#49986lriggs wants to merge 10 commits into
lriggs wants to merge 10 commits into
Conversation
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Rationale for this change
Silently allowing duplicate functions that only differ by return type provides a confusing function registry. Callers do not typically expect the return type to be part of determining which function to call leading to ambiguity.
It is possible to register different Gandiva functions with the same alias and parameters but different return types, resulting in confusing function overloads.
For example, DATE_EXTRACTION_TRUNCATION_FNS in [cpp/src/gandiva/function_registry_datetime.cc] was invoked twice with the same SQL alias lists — once for extract* (returns int64) and once for date_trunc_* (returns the input date/timestamp type):
As a result the registry contained four entries for day(...) where there should have been two:
The same problem existed for every calendar-unit alias: year, month, quarter, week, weekofyear, yearweek, dayofmonth, hour, minute, second. Resolution behavior depended on the caller's inferred return type, which is not the SQL semantics anyone expects from day(timestamp_col).
FunctionRegistry::Add was silently allowing these registrations: unordered_map::emplace keeps the first entry and discards subsequent ones with no warning.
What changes are included in this PR?
Diagnostics
Added two ARROW_LOG(ERROR) checks inside FunctionRegistry::Add:
Duplicate-signature check — when pc_registry_map_.emplace reports the entry already existed (same name + params + return type), log the conflict including both pc_names.
Alias-collision check — a new call_shape_map_ keyed on (lower(name), param_type_ids) (return type excluded) detects the case where the same name(args) could resolve to two functions with different return types. This is the check that catches the day family.
These run at registry-construction time, so future regressions surface in test output rather than being silently shadowed.
Macro split
DATE_EXTRACTION_TRUNCATION_FNS is split into two macros:
DATE_EXTRACTION_FNS — keeps the existing SQL alias lists ({"year"}, {"day", "dayofmonth"}, etc.).
DATE_TRUNCATION_FNS — every alias list is {}. The truncate functions are still reachable through their date_trunc_* base names.
Strengthened registry test
Added a third loop to TestNoDuplicates that mirrors the runtime alias-collision check: keys each registered signature on (lower(name), param_types) only and fails if two signatures share that key but disagree on return type. The two pre-existing loops still cover their original cases (same pc_name + params + ret, and same full signature appearing twice).
Cleanup
Removed the standalone NativeFunction("add", ..., date64+int64 → timestamp, "add_date64_int64"). DATE_ADD_FNS(add, {}) already provides the entry with the correct → date64 return type matching the precompiled symbol.
Dropped the {"convert_fromutf8"} and {"convert_replaceutf8"} aliases; the base names already match case-insensitively.
Are these changes tested?
Registry construction is now silent — no duplicate or alias-collision logs.
TestFunctionRegistry — 6/6 pass, including the strengthened TestNoDuplicates.
ctest -R gandiva — 4/4 binaries pass (gandiva-internals-test, gandiva-precompiled-test, gandiva-projector-test, gandiva-projector-test-static).
Are there any user-facing changes?
The Gandiva function list is different, with some aliases no longer appearing. Functionality is still available.