Skip to content

[fix] support hash table spill before hash probe#589

Open
guhaiyan0221 wants to merge 1 commit into
bytedance:mainfrom
guhaiyan0221:fix_spill_before_hashprobe
Open

[fix] support hash table spill before hash probe#589
guhaiyan0221 wants to merge 1 commit into
bytedance:mainfrom
guhaiyan0221:fix_spill_before_hashprobe

Conversation

@guhaiyan0221

Copy link
Copy Markdown
Collaborator

What problem does this PR solve?

Issue Number: close #577

Type of Change

  • 🐛 Bug fix (non-breaking change which fixes an issue)
  • ✨ New feature (non-breaking change which adds functionality)
  • 🚀 Performance improvement (optimization)
  • ⚠️ Breaking change (fix or feature that would cause existing functionality to change)
  • 🔨 Refactoring (no logic changes)
  • 🔧 Build/CI or Infrastructure changes
  • 📝 Documentation only

Description

Background

After HashBuild finishes building and before HashProbe starts, the build-side hash table remains resident in HashJoinBridge waiting to be consumed by the probe side. Currently, this resident hash table memory cannot be reclaimed by the node-level memory reclaimer, which may cause task OOM under memory pressure.

This PR supports spilling the hash table that has already been published to HashJoinBridge but has not yet been consumed by HashProbe. This allows memory arbitration to reclaim this memory before HashProbe starts.

Changes
1. Support bridge-level hash table spill
  • Create a table spill callback in HashBuild::finishHashBuild() before publishing the hash table, and pass it to HashJoinBridge::setHashTable().
    • Add table spill state management in HashJoinBridge:
      • tableSpillFunc_
      • tableSpillInProgress_
      • probeStarted_
    • HashJoinBridge::reclaim() triggers resident hash table spill when all of the following conditions are met:
      • probe has not started;
      • the bridge currently has a build result;
      • there is no pending spill partition;
      • the table spill callback is still valid;
      • no table spill is currently in progress.
    • After spill succeeds, clear the resident table and publish the generated spill partitions for the probe side to restore and read.
2. Support fallback to bridge reclaim in HashJoinMemoryReclaimer
  • HashJoinMemoryReclaimer now holds the corresponding HashJoinBridge.

    • When reclaiming from the hash build operator pool is insufficient, it falls back to HashJoinBridge::reclaim() to try to reclaim the hash table that is still waiting in the bridge before probe starts.
    3. Reuse spill stats accumulation logic
    • Add OperatorStats::addSpillStats() to extract and reuse the spill stats accumulation logic from addSpillDetails().
    • Bridge table spill stats are merged into operator stats through HashBuild::stats(clear).

Performance Impact

  • No Impact: This change does not affect the critical path (e.g., build system, doc, error handling).

  • Positive Impact: I have run benchmarks.

    Click to view Benchmark Results
    Paste your google-benchmark or TPC-H results here.
    Before: 10.5s
    After:   8.2s  (+20%)
    
  • Negative Impact: Explained below (e.g., trade-off for correctness).

Release Note

Please describe the changes in this PR

Release Note:

Release Note:
- Fixed a crash in `substr` when input is null.
- optimized `group by` performance by 20%.

Checklist (For Author)

  • I have added/updated unit tests (ctest).
  • I have verified the code with local build (Release/Debug).
  • I have run clang-format / linters.
  • (Optional) I have run Sanitizers (ASAN/TSAN) locally for complex C++ changes.
  • No need to test or manual test.

Breaking Changes

  • No

  • Yes (Description: ...)

    Click to view Breaking Changes
    Breaking Changes:
    - Description of the breaking change.
    - Possible solutions or workarounds.
    - Any other relevant information.
    

@guhaiyan0221 guhaiyan0221 requested a review from fzhedu May 25, 2026 13:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] HashBuild OOM caused by

1 participant