Skip to content

fix: add minimum sample count check in PCATransformer::Train#2173

Open
LHT129 wants to merge 1 commit into
antgroup:mainfrom
LHT129:2026-06-08-修复-pcatransformer-单样本训练除零错误

Hidden character warning

The head ref may contain hidden characters: "2026-06-08-\u4fee\u590d-pcatransformer-\u5355\u6837\u672c\u8bad\u7ec3\u9664\u96f6\u9519\u8bef"
Open

fix: add minimum sample count check in PCATransformer::Train#2173
LHT129 wants to merge 1 commit into
antgroup:mainfrom
LHT129:2026-06-08-修复-pcatransformer-单样本训练除零错误

Conversation

@LHT129

@LHT129 LHT129 commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

Fixes: #2172

Summary

PCATransformer::Train divides by count - 1 for unbiased covariance estimation (Bessel correction). When count < 2, this causes:

  • count == 1: division by zero producing inf/NaN, silently corrupting the PCA matrix
  • count == 0: uint64_t underflow wrapping to UINT64_MAX, producing near-zero covariance

This patch adds a guard at the entry of Train() that throws VsagException(INVALID_ARGUMENT) when count < 2, matching the existing validation pattern in KMeansCluster.

Changes

  • src/impl/transform/pca_transformer.cpp: Add count < 2 check at the start of Train()
  • src/impl/transform/pca_transformer_test.cpp: Add TestTrainMinSampleCount() covering both count=0 and count=1

Test plan

  • PCA Basic Test passed (including new TestTrainMinSampleCount)
  • PCA Serialize / Deserialize Test passed
  • All 9,926,577 assertions passed
  • make fmt clean (clang-format-15)

PCATransformer::Train divides by (count - 1) for unbiased covariance
estimation. When count is 0 or 1, this causes division by zero (producing
inf/NaN) or uint64_t underflow, silently corrupting the PCA matrix and
all downstream search results.

Signed-off-by: LHT129 <tianlan.lht@antgroup.com>
Co-authored-by: opencode <opencode@anthropic.com>
Copilot AI review requested due to automatic review settings June 8, 2026 13:29
@LHT129 LHT129 self-assigned this Jun 8, 2026
@mergify

mergify Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Require kind label

Wonderful, this rule succeeded.
  • label~=^kind/

🟢 Require version label

Wonderful, this rule succeeded.
  • label~=^version/

🟢 Require linked issue for feature/bug PRs

Wonderful, this rule succeeded.
  • body~=(?im)(?:^|[\s\-\*])(?:close[sd]?|fix(?:e[sd])?|resolve[sd]?)\s*:?\s+(?:#\d+|[\w.\-]+/[\w.\-]+#\d+|https?://github\.com/[\w.\-]+/[\w.\-]+/issues/\d+)

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a minimum sample count validation (requiring at least 2 samples) to the PCATransformer::Train method, along with corresponding unit tests. The reviewer suggests expanding this validation to include defensive checks for null pointers, non-positive input dimensions, and potential integer overflows during buffer allocation, as well as adding unit tests to cover these additional edge cases.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +37 to +40
if (count < 2) {
throw VsagException(ErrorType::INVALID_ARGUMENT,
fmt::format("PCA training requires at least 2 samples, got {}", count));
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

To ensure robustness and prevent potential crashes or security vulnerabilities, we should add several defensive checks at the entry of Train():

  1. Null Pointer Check: If data is nullptr but count >= 2, dereferencing data in ComputeColumnMean will cause a segmentation fault.
  2. Input Dimension Validation: If input_dim_ is non-positive, it can cause undefined behavior or division by zero.
  3. Integer Overflow Check: If count * input_dim_ overflows uint64_t, centralized_data.resize will allocate a smaller buffer than required, leading to a heap buffer overflow during the centralization loop.
    if (data == nullptr) {
        throw VsagException(ErrorType::INVALID_ARGUMENT, "PCA training data pointer is null");
    }
    if (count < 2) {
        throw VsagException(ErrorType::INVALID_ARGUMENT,
                            fmt::format("PCA training requires at least 2 samples, got {}", count));
    }
    if (input_dim_ <= 0) {
        throw VsagException(ErrorType::INVALID_ARGUMENT,
                            fmt::format("PCA training requires positive input dimension, got {}", input_dim_));
    }
    if (count > UINT64_MAX / static_cast<uint64_t>(input_dim_)) {
        throw VsagException(ErrorType::INVALID_ARGUMENT,
                            fmt::format("PCA training count {} is too large for input dimension {}", count, input_dim_));
    }

Comment on lines +178 to +180
std::vector<float> single_sample = {1.0f, 2.0f};
REQUIRE_THROWS(pca.Train(single_sample.data(), 1));
REQUIRE_THROWS(pca.Train(single_sample.data(), 0));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Add unit tests to verify the new defensive checks for null pointer input and integer overflow.

    std::vector<float> single_sample = {1.0f, 2.0f};
    REQUIRE_THROWS(pca.Train(single_sample.data(), 1));
    REQUIRE_THROWS(pca.Train(single_sample.data(), 0));
    REQUIRE_THROWS(pca.Train(nullptr, 2));
    REQUIRE_THROWS(pca.Train(single_sample.data(), UINT64_MAX));

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a correctness bug in PCATransformer::Train where Bessel-corrected covariance (count - 1) could divide by zero or underflow for count < 2, leading to silent inf/NaN corruption and degraded downstream results.

Changes:

  • Add an argument guard in PCATransformer::Train to reject count < 2 with VsagException(ErrorType::INVALID_ARGUMENT).
  • Add a unit test to ensure Train() throws for count == 0 and count == 1.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
src/impl/transform/pca_transformer.cpp Adds minimum sample-count validation to prevent invalid covariance scaling.
src/impl/transform/pca_transformer_test.cpp Adds coverage for Train() rejecting insufficient sample counts.

Comment on lines 35 to 41
void
PCATransformer::Train(const float* data, uint64_t count) {
if (count < 2) {
throw VsagException(ErrorType::INVALID_ARGUMENT,
fmt::format("PCA training requires at least 2 samples, got {}", count));
}
vsag::Vector<float> centralized_data(allocator_);
Comment on lines +178 to +180
std::vector<float> single_sample = {1.0f, 2.0f};
REQUIRE_THROWS(pca.Train(single_sample.data(), 1));
REQUIRE_THROWS(pca.Train(single_sample.data(), 0));
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PCATransformer::Train division by zero when count < 2

2 participants