Normalize forest sample weights by ethanglaser · Pull Request #3674 · uxlfoundation/oneDAL

ethanglaser · 2026-06-25T21:55:53Z

Description

Fixes test deselected in uxlfoundation/scikit-learn-intelex#3231
Accompanies uxlfoundation/scikit-learn-intelex#3292
Combined CI: http://intel-ci.intel.com/f170e0af-9e7a-f1a7-82dc-d4f5ef20c6a0

Checklist:

Completeness and readability

I have commented my code, particularly in hard-to-understand areas.
I have updated the documentation to reflect the changes or created a separate PR with updates and provided its number in the description, if necessary.
Git commit message contains an appropriate signed-off-by string (see CONTRIBUTING.md for details).
I have resolved any merge conflicts that might occur with the base branch.

Testing

I have run it locally and tested the changes extensively.
All CI jobs are green or I have provided justification why they aren't.
I have extended testing suite if new functionality was introduced in this PR.

Performance

I have measured performance for affected algorithms using scikit-learn_bench and provided at least a summary table with measured data, if performance change is expected.
I have provided justification why performance and/or quality metrics have changed or why changes are not expected.
I have extended the benchmarking suite and provided a corresponding scikit-learn_bench PR if new measurable functionality was introduced in this PR.

…nding

ethanglaser · 2026-06-25T23:06:07Z

/azp run CI

azure-pipelines · 2026-06-25T23:06:17Z

Azure Pipelines successfully started running 1 pipeline(s).

Copilot

Pull request overview

This PR adjusts decision forest training to reduce sensitivity to the absolute scale of sample weights by introducing a weight-rescaling step (max weight scaled to 1) and applying it in both classification and regression forest training paths. It also aligns an internal regression split-selection accumulator type with the higher-precision intermediate type used in the surrounding impurity math.

Changes:

Added normalizeWeights() helper to rescale input sample weights before training.
Applied weight normalization in both classification and regression *TrainBatchKernel::compute() implementations.
Updated a regression split-selection variable (vBest) to use intermSummFPType for consistency with intermediate computations.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
cpp/daal/src/algorithms/dtrees/forest/df_train_dense_default_impl.i	Adds `normalizeWeights()` helper and required include to create a normalized weights table.
cpp/daal/src/algorithms/dtrees/forest/regression/df_regression_train_dense_default_impl.i	Uses normalized weights in training and adjusts split-selection accumulator precision.
cpp/daal/src/algorithms/dtrees/forest/classification/df_classification_train_dense_default_impl.i	Uses normalized weights in training.

+    const size_t nRows = weights->getNumberOfRows();
+    if (!nRows) return empty;


+    algorithmFPType maxWeight = 0;
+    for (size_t i = 0; i < nRows; ++i)
+    {
+        if (src[i] > maxWeight) maxWeight = src[i];
+    }
+    if (!(maxWeight > 0)) return empty;
+
+    services::SharedPtr<HomogenNumericTableCPU<algorithmFPType, cpu> > normalized =
+        HomogenNumericTableCPU<algorithmFPType, cpu>::create(1, nRows, &s);
+    if (!s) return empty;
+
+    WriteOnlyRows<algorithmFPType, cpu> dstBlock(normalized.get(), 0, nRows);
+    s |= dstBlock.status();
+    if (!s) return empty;
+    algorithmFPType * dst = dstBlock.get();
+
+    for (size_t i = 0; i < nRows; ++i)
+    {
+        dst[i] = src[i] / maxWeight;
+    }


david-cortes-intel · 2026-06-26T06:16:35Z

    ImpurityData right;
    IndexType iBest = -1;
-    algorithmFPType vBest;
+    intermSummFPType vBest;


If you're going to do this, then it makes more sense to make normalizedWeights of intermSummFPType. It also would need to be modified in a lot of other places to be consistent.

The normalized table can't be intermSummFPType - it's read back as algorithmFPType by _helper.init/ReadRows, so a float32 fit needs a float32 table

Then please change the dtype in all other places that calculate similar quantities to be consistent.

I made one other change that I think is related to what you are eluding to, otherwise you're going to need to be more specific as I am not very familiar with these files.

david-cortes-intel · 2026-06-26T06:20:31Z

+    }
+    if (!(maxWeight > 0)) return empty;
+
+    services::SharedPtr<HomogenNumericTableCPU<algorithmFPType, cpu> > normalized =


Does this need to be a oneDAL table? It's just a one-dimensional array which could be passed as a unique or shared pointer.

Downstream functionality (compute, computeForSpecificHelper, TrainBatchTask) all consume NumericTables so it sets up well for those

david-cortes-intel · 2026-06-26T06:21:08Z

+    if (!s) return empty;
+    algorithmFPType * dst = dstBlock.get();
+
+    for (size_t i = 0; i < nRows; ++i)


It could use ?rscl (preferrably) or ?scal from MKL.

On a second look, if it's going to copy beforehand, maybe it'd make more sense to replace with a vectorized loop. Could also set a pragma for the alignment, given that the new array is allocated through oneDAL.

I believe your MKL suggestion would be preferable

ethanglaser · 2026-06-26T23:06:41Z

Updated combined CI: http://intel-ci.intel.com/f171b38c-632b-f18e-89b1-d4f5ef20c6a0

ethanglaser and others added 5 commits June 10, 2026 12:28

Forest fix attempt

4760278

Implement sample weight normalization on oneDAL side with precise rou…

64cada8

…nding

simplify and remove fp precision bs

49fe036

simplify comment

69c827b

Merge branch 'uxlfoundation:main' into dev/eglaser-forest-fix

d1c84ff

ethanglaser marked this pull request as ready for review June 25, 2026 23:04

Copilot AI review requested due to automatic review settings June 25, 2026 23:04

ethanglaser requested review from ahuber21, avolkov-intel, david-cortes-intel, icfaust and razdoburdin as code owners June 25, 2026 23:04

ethanglaser changed the title ~~Dev/eglaser forest fix~~ Normalize forest sample weights Jun 25, 2026

Copilot started reviewing on behalf of ethanglaser June 25, 2026 23:05 View session

ethanglaser added the bug label Jun 25, 2026

Copilot AI reviewed Jun 25, 2026

View reviewed changes

ethanglaser mentioned this pull request Jun 25, 2026

restore forest deselection from normalize sample weights uxlfoundation/scikit-learn-intelex#3292

Open

10 tasks

david-cortes-intel reviewed Jun 26, 2026

View reviewed changes

ethanglaser and others added 3 commits June 26, 2026 00:08

add vectorization pragma and MKL rscl

062a246

more type correction

4a51318

Merge branch 'main' into dev/eglaser-forest-fix

c4c0add

		const size_t nRows = weights->getNumberOfRows();
		if (!nRows) return empty;

Uh oh!

Conversation

ethanglaser commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

ethanglaser commented Jun 25, 2026

Uh oh!

azure-pipelines Bot commented Jun 25, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ethanglaser commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ethanglaser commented Jun 25, 2026 •

edited

Loading