Skip to content

[CBRD-26900] Evaluate eligible after-join predicates in the hash join probe loop#7269

Open
youngjinj wants to merge 41 commits into
CUBRID:developfrom
youngjinj:CBRD-26900
Open

[CBRD-26900] Evaluate eligible after-join predicates in the hash join probe loop#7269
youngjinj wants to merge 41 commits into
CUBRID:developfrom
youngjinj:CBRD-26900

Conversation

@youngjinj

@youngjinj youngjinj commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

http://jira.cubrid.org/browse/CBRD-26900

Purpose

해시 조인은 잔여 조건(inner join의 비동등 조건, outer join의 after-join 조건)을 리스트 파일로 저장된 조인 결과를 상위 스캔에서 다시 읽으며 평가하므로, 걸러질 튜플까지 일단 저장했다가 한 번 더 읽는다. 이 변경은 probe 단계에서 평가 가능한 잔여 조건을 해시 키가 매칭된 시점에 바로 평가해, 조건을 만족하지 않는 튜플을 리스트 파일에 저장하기 전에 걸러낸다.

Implementation

  1. optimizer는 조인 조건을 평가 위치별로 분류해, probe 단계에서 평가 가능한 잔여 조건을 해시 조인 노드(after_join_pred)로 전달한다. 각 조건이 한 곳에서만 평가되도록 상위 스캔에서는 제거하며, 최종 결과가 확정되어야 하거나 서브쿼리 실행이 필요한 조건은 기존대로 상위 스캔에 남긴다.
  2. executor는 probe 단계에서 잔여 조건을 평가해, 만족하지 않는 튜플을 리스트 파일에 저장하기 전에 걸러낸다. inner probe와 outer probe 모두에 적용하며, outer join의 null-padding 의미를 보존한다.
  3. 병렬 해시 조인(px_hash_join) 경로에도 동일한 평가를 적용한다.
  4. (regression) 병렬 gather 스캔(px_scan)에서 after_join_pred가 평가되지 않아 튜플이 걸러지지 않던 문제를 보강한다.
  5. (regression) 최종 정렬이 필요한 해시 조인과 Sort-Merge 조인 플랜에 최종 정렬 단계(SORT_ORDERBY)를 명시적으로 추가해, 플랜 출력과 실제 실행이 일치하게 한다. 서브 플랜의 order-by skip은 유지해 부분 범위 처리를 보존한다.

Remarks

  • 플랜과 XASL에 새 필드를 추가하지 않는다. probe 단계로 배정한 잔여 조건은 기존 after_join_pred 필드를 재사용한다.
  • need_final_sort 관련 optimizer 변경과 px_scan 변경은 본 작업 중 발견한 기존 regression 수정으로, 해시 조인에서 발생하는 문제라 연관성이 깊어 여기서 함께 해결한다.

youngjinj and others added 28 commits June 4, 2026 15:43
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…-padding

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…er_probe

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…(inner)

For INNER hash joins, move two-input residual conditions (incl. non-equi /
range join conditions such as t1.c < t2.d) from the parent buildlist's
if_pred into the hash-join probe loop (proc->probe_pred).

qo_collect_hashjoin_probe_terms() selects from plan->sarged_terms the terms
that reference only the two join inputs and are not already realized as hash
keys / join edges, excluding inst_num()/rownum (TOTALLY_AFTER_JOIN) and
correlated-subquery terms. gen_hashjoin() prunes the selected terms from both
plan->sarged_terms and the local pred_set copy that feeds the parent list
scan's if_pred, so the residual is evaluated in exactly one place.

qo_init_projection_info() is extended so the probe terms' columns are added to
outer/inner pred_list (regu_list_pred coverage, fetched into val_descr at probe
time) and to the build/probe projection (name_list), without affecting the
hash-join's final output. Outer joins are untouched (their inter-table ON
condition lives in during/after_join_terms, not sarged_terms; guarded
explicitly).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… pushdown

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
For LEFT/RIGHT OUTER hash joins, move the two-input WHERE residual conditions
that live in plan->plan_un.join.after_join_terms (e.g. "t1 LEFT JOIN t2 ON
t1.a=t2.b WHERE t2.d is null or t1.c < t2.d") into the hash-join probe loop
(proc->probe_pred), instead of evaluating them in the parent buildlist's
after_join_pred during the second scan.

after_join terms are applied AFTER null-padding in the existing second scan;
the outer probe applies probe_pred to the final (matched or null-padded) tuple
with the same semantics (build side cleared to NULL on null-fill), so the result
is unchanged. ON-clause conditions (during_join_terms) drive matching /
null-padding and are left untouched.

qo_hashjoin_probe_term_eligible() factors out the per-term eligibility test
shared by the INNER (sarged_terms) and the new OUTER (after_join_terms)
collectors: term references only the two join inputs, no correlated subquery,
not output-position dependent (inst_num/rownum), not already a hash key / join
edge, and only PT_NAME segments (fetchable through regu_list_pred at probe time).

qo_collect_hashjoin_probe_after_terms() collects eligible terms for JOIN_LEFT /
JOIN_RIGHT only; FULL OUTER (JOIN_OUTER) is excluded since the serial outer probe
never reaches it. gen_hashjoin() prunes the collected terms from both
plan->plan_un.join.after_join_terms and the local pred_set copy (which unions
after_join_terms in gen_outer) that feeds the parent after_join_pred, so each is
evaluated in exactly one place, and unions them into the probe_terms bitset that
already flows through qo_init_projection_info (column coverage) and
make_hashjoin_proc (combined probe predicate via is_always_true).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…probe passes

A LEFT OUTER hash join carrying both an ON-clause non-equi term
(during_join_pred) and a WHERE term referencing inner-table columns
(probe_pred) crashed the server with an or_advance assertion abort.

qo_init_projection_info builds outer/inner pred_list in two passes: the
during-join pass and the probe-term pass. A column referenced by both
(e.g. t1.c, t2.d in "ON t1.a=t2.b AND t1.c<t2.d WHERE t2.d IS NULL OR
t1.c<t2.d") was appended to pred_list twice, producing a regu_list_pred
with duplicate TYPE_POSITION entries (pos_no 2, 2).

fetch_peek_dbval_pos walks a single forward-only OR_BUF iterator over the
regu list, assuming non-decreasing AND effectively distinct positions:
after consuming pos 2 it advances the value index to 3, then the second
pos-2 regu forces qfile_locate_tuple_next_value to read a non-existent
4th value of a 3-value tuple, overrunning the buffer and tripping
or_advance's assert (object_representation.h:1478).

Fix: track segments already added to each side's pred_list with two
bitsets and skip duplicates in both passes, so pred_list holds one entry
per segment, matching the bitset-built name_list and yielding distinct,
monotonic regu positions.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The correlated-subquery result cache (pt_make_sq_cache_key_struct) walks
every predicate attached to the cached XASL - spec preds, if_pred,
during_join_pred, after_join_pred - so the key is exhaustive over each
DB_VALUE the subquery can read. The HASHJOIN_PROC residual predicate
proc.hashjoin.probe_pred was the one pred field omitted from that walk.

Today this is not a correctness bug: qo_hashjoin_probe_term_eligible
excludes any term carrying a correlated subquery and requires the term to
reference only the two join inputs, so probe_pred can never hold a
correlated value that the key would otherwise miss (verified: a correlated
hash-join subquery whose two-input residual is pushed to probe_pred while
the correlated term stays in if_pred returns results identical to the
nested-loop ground truth, including on repeated correlated key values).

Add probe_pred to the key walk anyway, guarded by type == HASHJOIN_PROC,
so the key stays complete and a future relaxation of the push-eligibility
rules cannot silently cause a cache-key collision.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…and docs

Plan visibility: gen_hashjoin removes the pushed residual terms from
plan->sarged_terms and plan->plan_un.join.after_join_terms before the plan
dump runs (pt_to_buildlist_proc -> qo_to_xasl -> gen_hashjoin precedes
qo_plan_dump in pt_to_xasl), so a pushed predicate such as "t1.c < t2.d"
disappeared from every structural plan section - it survived only in the
rewritten "Query stmt:" text. A DBA could not see where the residual was
evaluated.

Record the pushed terms on the plan in a new join.probe_terms bitset and
print them as a "probe:" line in qo_plan_print_outer_join_terms, mirroring
the existing "during:" / "after:" lines. The detailed plan dump now shows
e.g. "edge: term[1]" / "probe: term[0]" for an inner hash join with a
non-equi residual. Also free hash_terms in qo_join_free (previously
init'd but never released).

Naming clarity (mechanical):
  - local after_probe_terms              -> probe_after_join_terms
  - qo_collect_hashjoin_probe_after_terms -> qo_collect_hashjoin_after_join_probe_terms

Wording: the eligibility guard rejects a term only when one of its
SEGMENTS is not a plain column (PT_NAME); expressions built over plain
columns (e.g. "t1.c between t2.d-10 and t2.d+10", "upper(t1.s)=t2.s2")
have only PT_NAME segments and ARE pushed. Correct the misleading comment
in qo_hashjoin_probe_term_eligible and the spec section 6 exclusions to
match this actual, verified behavior.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The parallel gather scan's qualification loop drain_slot_oids evaluated
only m_xasl->if_pred before writing rows, silently dropping
after_join_pred. LEFT OUTER join (merge/hash) plans whose materialized
result list is scanned in parallel with an after-join WHERE returned all
matched rows instead of the filtered subset.

Mirror the serial buildlist loop (query_executor.c: after_join_pred then
if_pred): evaluate after_join_pred before if_pred in drain_slot_oids,
applying to MERGEABLE_LIST, BUILDVALUE_OPT and XASL_SNAPSHOT alike (the
eval site precedes the result-type switch). Also content-check
after_join_pred in px_scan_checker at both if_pred sites so
parallel-unsafe predicate elements disqualify the scan by the same rules.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…redicate evaluation

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ined residual set

The intermediate after_join_residual_terms bitset existed only to know what
to prune from plan_un.join.after_join_terms. sarged_terms and
after_join_terms are disjoint by construction (query_planner.c subtracts
sarg_out_terms, which includes after_join_terms, when building
sarged_terms), so both collectors can fill the single residual_terms set
and the combined set can be subtracted from each source directly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ity, hoist null clear

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The residual predicate is evaluated before the tuple merge at every probe
site, so the member order now mirrors the evaluation order. Serialization
is unaffected (xts/stx encode field order explicitly).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…mber order

Pack, size, unpack, and dump residual_pred before merge_info, matching
HASHJOIN_PROC_NODE member order. Write/read symmetry is preserved (both
sides moved together); XASL streams are transient within one build, so no
compatibility concern.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…MANAGER

residual_pred was listed under the "Pointer to a member of XASL_NODE"
block, but it points to a HASHJOIN_PROC_NODE member. Move it into its own
provenance group, matching the struct's existing comment style, and mirror
the same grouping at the hjoin_init_manager assignment site.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… after_join_pred

Drop the dedicated HASHJOIN_PROC_NODE.residual_pred field and store the
residual conditions pushed into the hash-join probe loop on the HASHJOIN
xasl node's own after_join_pred slot instead.

A reachability audit of every after_join_pred reader confirms this is safe
and strictly simpler: the HASHJOIN node sits on the parent's aptr_list, has
no spec_list, and never runs the generic scan loop, so the scan-loop
evaluators of after_join_pred are unreachable by construction on this node.
The only readers that see the field (px_scan_checker, px_query_checker) are
content/eligibility observers, not result-affecting evaluators.

make_hashjoin_proc now reuses add_after_join_predicate (mirroring the
during_join_pred site) to store the pushed residual, and hjoin_init_manager
sources HASHJOIN_MANAGER.residual_pred from xasl->after_join_pred. An
assert (xasl->spec_list == NULL) guards the invariant that a HASHJOIN node
never gains a scan-loop execution; otherwise the generic scan loop would
double-evaluate after_join_pred alongside the probe.

Because the slot is a generic xasl-node field, ALL the dedicated residual_pred
plumbing is deleted, not migrated -- the generic per-node paths already cover
the HASHJOIN node:
  - serialization/deserialization (xts_process/stx_build hashjoin proc)
  - the two clears in qexec_clear_xasl / qexec_clear_xasl_for_parallel_aptr
    (generic after_join_pred clear runs in the unconditional is_final block)
  - the HASHJOIN-gated sq-cache-key block (generic after_join_pred block is
    reached via the aptr_list SQ_TYPE_XASL recursion)
  - the dedicated qdump [residual_pred] print (generic after_join_pred print
    in qdump_print_xasl covers it)

The runtime "residual" terminology is kept everywhere (HASHJOIN_MANAGER /
HASHJOIN_CONTEXT.residual_pred copies, px spawn get_residual_pred /
m_residual_pred, optimizer residual_terms bitsets, the residual: plan label);
only the XASL storage source changed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The pushed hash-join residual predicate is stored in the HASHJOIN xasl
node's after_join_pred slot. Rename the runtime copies of that predicate
(HASHJOIN_CONTEXT/HASHJOIN_MANAGER fields, the spawn_manager member and
accessor, and the null-fill helper) from residual_pred to after_join_pred
so the copies mirror their XASL source field name, matching the existing
during_join_pred convention.

Optimizer-side "residual" classification terminology is unchanged
(residual_terms bitset, qo_collect_hashjoin_residual_terms and friends,
the plan "residual:" label). Stale comment references to the renamed
runtime field were updated to track the new name.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…lumbing sites

The generic xasl-node clear and dump paths need no local commentary; the
design is documented in the spec.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…al pushdown

Prune only the local pred_set copy when pushing hash-join residual terms into
the probe loop. The audit confirms this is sufficient for exactly-once
evaluation: gen_hashjoin's pred_set is the local predset built by gen_outer and
is the only source for this join node's parent if_pred/after_join_pred (built in
init_list_scan_proc); the plan's sarged_terms/after_join_terms are never read to
build that parent predicate after gen_hashjoin runs. inst_num()/rownum terms are
excluded from collection and make_outer_instnum is not on the hash-join path, so
instnum handling is unaffected.

Stop mutating plan->sarged_terms and plan->plan_un.join.after_join_terms, and
drop the plan_un.join.residual_terms record together with its "residual:" dump
label; the pushed terms now remain visible in their original sargs:/after:
sections, and the probe-time evaluation site is observable via PROBE row counts
in the server trace. This also removes the single-invocation fragility that the
plan-bitset mutation forced. The hash_terms bitset_delset leak fix in
qo_join_free is retained. Spec updated.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Fold the two-input residual-term classification into
qo_init_projection_info and carry the pushed predicate on the HASHJOIN
node's after_join_pred slot (no new plan/XASL fields).

Executor:
- Unify serial and parallel predicate evaluation through a shared
  hjoin_eval_pred(), memoizing per-tuple fetches with is_ready.
- Inline the build-side null-fill; drop fill_qualified for direct
  ev_res branching.
- Fix the parallel outer null-fill branch: the after-join early
  continue skipped the need_skip_next reset, leaving the flag set so
  the next hjoin_fetch_key aborted (debug) or mis-processed a valid
  probe row (release). Reset need_skip_next at the top of the branch.

Optimizer: gate the ORDER BY skip on need_final_sort (qo_top_plan_new,
qo_plan_is_orderby_skip_candidate, qo_plan_cmp) so hash/merge-join
plans no longer drop the sort from the dump while execution sorts.

px_scan: order the after_join_pred/if_pred check clusters to match
drain_slot_oids and drop the now-redundant SYNC GUARD comments.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@youngjinj youngjinj self-assigned this Jun 9, 2026
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown

❌ TC Merge Gate — Merge Blocked

One or more TC PRs are still open. Please merge or close them before merging this PR.

TC Repositories & Branches:

  • cubrid-testcases: TC PR tc/pr-7269 is open (draft) — must be merged or closed first
  • cubrid-testcases-private-ex: TC PR tc/pr-7269 is open (draft) — must be merged or closed first

Steps to unblock:

  1. Merge or close all TC PRs listed above.
  2. Re-run this check: Actions tab → TC Merge Gate → Re-run failed jobs

@greptile-apps

greptile-apps Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Reviews (1): Last reviewed commit: "Merge remote-tracking branch 'upstream/d..." | Re-trigger Greptile

Comment thread src/query/query_hash_join.c
Comment thread src/query/parallel/px_hash_join/px_hash_join_task_manager.cpp
Comment thread src/optimizer/plan_generation.c
Comment thread src/optimizer/plan_generation.c
youngjinj added a commit to CUBRID/cubrid-testcases that referenced this pull request Jun 10, 2026
…D/cubrid#7269

Query plans now show an explicit SORT (order by) step where the
ORDER BY skip optimization no longer applies.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@youngjinj

Copy link
Copy Markdown
Contributor Author

/run all

1 similar comment
@youngjinj

Copy link
Copy Markdown
Contributor Author

/run all

…parallel fallback

The single-thread fallback path is guarded by degree < 2, so degree can be
0 or 1. The assert (degree == 0) wrongly excluded the degree == 1 case.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@youngjinj

Copy link
Copy Markdown
Contributor Author

/run all

@youngjinj

Copy link
Copy Markdown
Contributor Author

/run all

youngjinj added a commit to CUBRID/cubrid-testcases that referenced this pull request Jun 23, 2026
…CUBRID/cubrid#7269

Cases #4 (multiple tables), #5 (inline views) and #8 (json output) were
missed in the earlier partial update (3bcd97b). The ORDER BY skip
optimization no longer applies over the hash join, so the plan now shows
an explicit temp(order by) / SORT (order by) step.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@youngjinj

Copy link
Copy Markdown
Contributor Author

/run all

youngjinj and others added 2 commits June 24, 2026 12:56
The during/after-join predicates of a hash join reference columns from both
inputs. pt_to_pred_expr resolves a list-file column to its DB_VALUE through the
column's table_info, but each column's live value lives in its input
buildlist_proc's val_list at its name_list position - the slot
proc.regu_list_pred fetches into. The two can disagree:
  - a simple-spec input whose val_list / attribute_list orderings differ (an
    input that also projects an expression column), or
  - a nested-plan input whose buildlist val_list is no spec's value_list at all.
In both cases the predicate read an unfetched, stale DB_VALUE and produced
wrong results vs nested-loop (e.g. TPC-H Q7 returned 0 rows under use_hash).

Bind the predicate columns through a combined listfile context (outer ++ inner
name_list / value_list, the value_list sharing the inputs' buildlist DB_VALUEs)
installed on the symbol table while generating the during/after-join
predicates, so pt_to_pred_expr resolves each column to exactly the slot
regu_list_pred fetches. Restored after generation and on the error path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…shjoin_proc

Inline the single-use make_hashjoin_listfile_val_list helper, snapshot/restore
the symbol table through one SYMBOL_INFO (save_symbol) instead of four save_*
fields, and fold the during/after predicate generation into the listfile-context
block with a single restore point (dropping the listfile_ctx_set flag).
No behavior change; clarify comments.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@youngjinj

Copy link
Copy Markdown
Contributor Author

/run all

qdump_print_xasl recursed into a buildlist's eptr_list but not into a
CTE_PROC's inner trees, so xasl_debug_dump never showed the plan inside
a CTE. Recurse into proc.cte.non_recursive_part / recursive_part so a
materialized CTE's inner plan (hash joins, etc.) is dumped in tree
position under the cte_proc node.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
youngjinj and others added 2 commits June 30, 2026 15:55
# Conflicts:
#	src/query/query_hash_join.c
…ash join

Apply code_style.sh and unify the need_skip_next = false comment as
/* init */ in px_hash_join_task_manager.cpp.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@youngjinj

Copy link
Copy Markdown
Contributor Author

/run all

@youngjinj

Copy link
Copy Markdown
Contributor Author

/run all

@youngjinj youngjinj marked this pull request as ready for review July 1, 2026 00:48
@greptile-apps

greptile-apps Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Reviews (2): Last reviewed commit: "Merge remote-tracking branch 'upstream/d..." | Re-trigger Greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant