fix: add minimum sample count check in PCATransformer::Train#2173
Hidden character warning
fix: add minimum sample count check in PCATransformer::Train#2173LHT129 wants to merge 1 commit into
Conversation
PCATransformer::Train divides by (count - 1) for unbiased covariance estimation. When count is 0 or 1, this causes division by zero (producing inf/NaN) or uint64_t underflow, silently corrupting the PCA matrix and all downstream search results. Signed-off-by: LHT129 <tianlan.lht@antgroup.com> Co-authored-by: opencode <opencode@anthropic.com>
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Require kind labelWonderful, this rule succeeded.
🟢 Require version labelWonderful, this rule succeeded.
🟢 Require linked issue for feature/bug PRsWonderful, this rule succeeded.
|
There was a problem hiding this comment.
Code Review
This pull request adds a minimum sample count validation (requiring at least 2 samples) to the PCATransformer::Train method, along with corresponding unit tests. The reviewer suggests expanding this validation to include defensive checks for null pointers, non-positive input dimensions, and potential integer overflows during buffer allocation, as well as adding unit tests to cover these additional edge cases.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| if (count < 2) { | ||
| throw VsagException(ErrorType::INVALID_ARGUMENT, | ||
| fmt::format("PCA training requires at least 2 samples, got {}", count)); | ||
| } |
There was a problem hiding this comment.
To ensure robustness and prevent potential crashes or security vulnerabilities, we should add several defensive checks at the entry of Train():
- Null Pointer Check: If
dataisnullptrbutcount >= 2, dereferencingdatainComputeColumnMeanwill cause a segmentation fault. - Input Dimension Validation: If
input_dim_is non-positive, it can cause undefined behavior or division by zero. - Integer Overflow Check: If
count * input_dim_overflowsuint64_t,centralized_data.resizewill allocate a smaller buffer than required, leading to a heap buffer overflow during the centralization loop.
if (data == nullptr) {
throw VsagException(ErrorType::INVALID_ARGUMENT, "PCA training data pointer is null");
}
if (count < 2) {
throw VsagException(ErrorType::INVALID_ARGUMENT,
fmt::format("PCA training requires at least 2 samples, got {}", count));
}
if (input_dim_ <= 0) {
throw VsagException(ErrorType::INVALID_ARGUMENT,
fmt::format("PCA training requires positive input dimension, got {}", input_dim_));
}
if (count > UINT64_MAX / static_cast<uint64_t>(input_dim_)) {
throw VsagException(ErrorType::INVALID_ARGUMENT,
fmt::format("PCA training count {} is too large for input dimension {}", count, input_dim_));
}| std::vector<float> single_sample = {1.0f, 2.0f}; | ||
| REQUIRE_THROWS(pca.Train(single_sample.data(), 1)); | ||
| REQUIRE_THROWS(pca.Train(single_sample.data(), 0)); |
There was a problem hiding this comment.
Add unit tests to verify the new defensive checks for null pointer input and integer overflow.
std::vector<float> single_sample = {1.0f, 2.0f};
REQUIRE_THROWS(pca.Train(single_sample.data(), 1));
REQUIRE_THROWS(pca.Train(single_sample.data(), 0));
REQUIRE_THROWS(pca.Train(nullptr, 2));
REQUIRE_THROWS(pca.Train(single_sample.data(), UINT64_MAX));There was a problem hiding this comment.
Pull request overview
This PR fixes a correctness bug in PCATransformer::Train where Bessel-corrected covariance (count - 1) could divide by zero or underflow for count < 2, leading to silent inf/NaN corruption and degraded downstream results.
Changes:
- Add an argument guard in
PCATransformer::Trainto rejectcount < 2withVsagException(ErrorType::INVALID_ARGUMENT). - Add a unit test to ensure
Train()throws forcount == 0andcount == 1.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
src/impl/transform/pca_transformer.cpp |
Adds minimum sample-count validation to prevent invalid covariance scaling. |
src/impl/transform/pca_transformer_test.cpp |
Adds coverage for Train() rejecting insufficient sample counts. |
| void | ||
| PCATransformer::Train(const float* data, uint64_t count) { | ||
| if (count < 2) { | ||
| throw VsagException(ErrorType::INVALID_ARGUMENT, | ||
| fmt::format("PCA training requires at least 2 samples, got {}", count)); | ||
| } | ||
| vsag::Vector<float> centralized_data(allocator_); |
| std::vector<float> single_sample = {1.0f, 2.0f}; | ||
| REQUIRE_THROWS(pca.Train(single_sample.data(), 1)); | ||
| REQUIRE_THROWS(pca.Train(single_sample.data(), 0)); |
Fixes: #2172
Summary
PCATransformer::Traindivides bycount - 1for unbiased covariance estimation (Bessel correction). Whencount < 2, this causes:count == 1: division by zero producing inf/NaN, silently corrupting the PCA matrixcount == 0:uint64_tunderflow wrapping toUINT64_MAX, producing near-zero covarianceThis patch adds a guard at the entry of
Train()that throwsVsagException(INVALID_ARGUMENT)whencount < 2, matching the existing validation pattern inKMeansCluster.Changes
src/impl/transform/pca_transformer.cpp: Addcount < 2check at the start ofTrain()src/impl/transform/pca_transformer_test.cpp: AddTestTrainMinSampleCount()covering bothcount=0andcount=1Test plan
make fmtclean (clang-format-15)