Fix invalid L2 distances from IndexHNSWFlat with stale cached_l2norms#5326
Fix invalid L2 distances from IndexHNSWFlat with stale cached_l2norms#5326notandruu wants to merge 1 commit into
Conversation
IndexFlatL2::get_FlatCodesDistanceComputer() used the cached-norms distance computer whenever cached_l2norms was non-empty. That computer indexes cached_l2norms[i] for every vector up to ntotal, but add() after a sync_l2norms() does not extend (or invalidate) cached_l2norms, so its size can be smaller than ntotal. The computer then read stale or out-of-range norms and returned invalid squared L2 distances, including negative values, with inconsistent top-k results. Only take the cached path when cached_l2norms covers every vector (size == ntotal); otherwise fall back to the on-the-fly L2 computer. The fast path is unchanged for the normal case (add all vectors, then call sync_l2norms once). Fixes facebookresearch#5320
|
Hi @notandruu! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks! |
|
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks! |
Fixes #5320
Root cause
IndexFlatL2::get_FlatCodesDistanceComputer()selects the cached-norms distance computer (FlatL2WithNormsDis) whenevercached_l2normsis non-empty. That computer readscached_l2norms[i]for every vector up tontotal. Howeveradd()after async_l2norms()does not extend or invalidatecached_l2norms, so its size can be smaller thanntotal. The computer then reads stale / out-of-range norms and returns invalid squared L2 distances (including negative values), with inconsistent and wrong top-k results. This is reachable fromIndexHNSWFlat(METRIC_L2), whose storage is anIndexFlatL2.Fix
Take the cached path only when
cached_l2norms.size() == ntotal, i.e. when the cache covers every vector; otherwise fall back to the on-the-fly L2 computer. The fast path is unchanged for the normal usage (add all vectors, thensync_l2norms()once). One line, plus a comment.Reproduction and validation
Repro: add a partial batch to an
IndexHNSWFlat,sync_l2norms()on the downcastIndexFlatL2, add the rest, then search. Validated with a from-source build (BLAS=Accelerate, libomp) using a C++ negative control:Regression test
TestSyncL2Norms.test_indexflat_l2_sync_norms_stale_after_addadded next to the existingsync_l2normstest. It reproduces the scenario and asserts no negative distances and that each query's nearest neighbour matches an exactIndexFlatL2oracle. Run against the released wheel (which has the bug) the test fails (min distance -11114); with this fix the underlying cause is removed.Note: I could not build the Python bindings locally (no swig), so the C++ fix was validated via a from-source C++ build and the Python regression test was validated to fail on the unpatched build; CI will exercise it against the patched bindings.