dnn: skip Conv2Int8 fusion for grouped convs with Kg<8 (#28798) by 5usu · Pull Request #28920 · opencv/opencv

5usu · 2026-05-02T18:14:05Z

I agree to contribute to the project under Apache 2 License.
To the best of my knowledge, the proposed patch is not based on a code under GPL or another
license that is incompatible with OpenCV
The PR is proposed to the proper branch
There is a reference to the original bug report and related work
There is accuracy test, performance test and test data in opencv_extra repository, if applicable
Patch to opencv_extra has the same branch name.
The feature is well documented and sample code can be built with the project CMake

The YuNet 2023mar int8 model from opencv_zoo doesn't detect anything on 5.x. With --score_threshold 0.05 the detector still returns no boxes, so it's not just a confidence drop — the network output is
broken.

After dumping intermediate tensors and comparing against onnxruntime, the obj_* branches collapse to
~0 (saturated int8 -128), and the cls_* branches saturate the other way. Final score is cls * obj
so nothing ever crosses threshold.

The cause is in Conv2Int8's VNNI kernel. It processes K0 = 8 output channels per SIMD iteration and
writes them as one K0-wide block to out + (n*K1 + k1)*planesize with k1 = k_base / K0. When
ngroups > 1 and Kg = K/ngroups < K0, consecutive groups share the same k1 slot and overwrite each
other — only the last group's result survives. YuNet has six 1x1x3x3 depthwise conv heads (Kg = 1),
which is exactly this case.

The unfused DequantizeLinear → Conv2 → QuantizeLinear path is fine. The minimal fix here is to skip
the rewrite to Conv2Int8 when Kg < 8 so we fall back to the float path. A proper depthwise int8
kernel can be added later as a separate optimization.

Verified with samples/dnn/face_detect.py on Lena:

Before:
AssertionError: Cannot find a face in samples/data/lena.jpg

After:
Face 0, top-left coordinates: (204, 187), box width: 149, box height 212, score: 0.89

Cross-check vs onnxruntime: cls_8 mean 0.673 (was 0.93, ORT 0.673), obj_8 max 0.0039 (was
0, ORT 0.0039).

The regression test is a single 16-group depthwise QLinearConv with non-zero x_zp, added to
Quantized_Convolution. Test data is in opencv_extra on the matching branch (~9 KB total).

Conv2Int8's VNNI kernel processes K0=8 output channels per SIMD iteration. When ngroups>1 and Kg=K/ngroups < K0, multiple groups end up writing to the same K0-wide output slot and clobber one another, producing fully saturated int8 output for every depthwise conv past the first. The unfused DequantizeLinear -> Conv2 -> QuantizeLinear path is unaffected. Detect the unsupported shape in graph_fusion_qdq.cpp and skip the rewrite to Conv2Int8 in that case, so the unfused float fallback is used. Adds an ONNX regression test (depthwise QLinearConv with x_zp != 0). Fixes opencv#28798

asmorkalov · 2026-05-04T06:59:17Z

@abhishek-gola please join the pr review.

asmorkalov · 2026-05-04T06:59:45Z

@5usu thanks a lot for the investigation and contribution!

asmorkalov · 2026-05-05T06:17:33Z

@5usu I just reached my working place and tested face detection sample ./samples/dnn/face_detect.py like python3 ./face_detect.py --face_detection_model=face_detection_yunet_2023mar_int8.onnx. The sample detects nothing with web camera. Similar behaviour with Lena image:

./face_detect.py --image1 ../data/lena.jpg --face_detection_model=face_detection_yunet_2023mar_int8.onnx
[ WARN:0@0.017] global net_impl_backend.cpp:334 setPreferableTarget Targets are not supported by the new graph engine for now
Traceback (most recent call last):
  File "/mnt/Projects/Projects/opencv-next/samples/dnn/./face_detect.py", line 73, in <module>
    assert faces1[1] is not None, 'Cannot find a face in {}'.format(args.image1)
           ^^^^^^^^^^^^^^^^^^^^^
AssertionError: Cannot find a face in ../data/lena.jpg

5usu · 2026-05-05T06:21:52Z

Let me look into it

abhishek-gola · 2026-05-05T07:36:49Z

The changes you made look correct, thanks for debugging that. However, we still need to re-export the INT8 model, since the current version isn’t consistently detecting faces. The newer model in this PR resolves the issue: opencv/opencv_zoo#309

5usu · 2026-05-05T08:40:42Z

@5usu I just reached my working place and tested face detection sample ./samples/dnn/face_detect.py like python3 ./face_detect.py --face_detection_model=face_detection_yunet_2023mar_int8.onnx. The sample detects nothing with web camera. Similar behaviour with Lena image:
./face_detect.py --image1 ../data/lena.jpg --face_detection_model=face_detection_yunet_2023mar_int8.onnx
[ WARN:0@0.017] global net_impl_backend.cpp:334 setPreferableTarget Targets are not supported by the new graph engine for now
Traceback (most recent call last):
  File "/mnt/Projects/Projects/opencv-next/samples/dnn/./face_detect.py", line 73, in <module>
    assert faces1[1] is not None, 'Cannot find a face in {}'.format(args.image1)
           ^^^^^^^^^^^^^^^^^^^^^
AssertionError: Cannot find a face in ../data/lena.jpg

Hi @asmorkalov — thanks for testing. The default --score_threshold in face_detect.py is
0.9, and the int8 model's score on Lena is 0.89 (vs the FP32 baseline's 0.91), so it's
right under the default and gets filtered. Could you re-run with --score_threshold 0.6 (or
any value ≤ 0.85)? On my machine:
$ python3 ./face_detect.py --image1 ../data/lena.jpg
--face_detection_model=face_detection_yunet_2023mar_int8.onnx --score_threshold 0.6
Face 0, top-left coordinates: (204, 187), box width: 149, box height 212, score: 0.89
If it still fails for you with that, I'll dig into the AVX-VNNI vs AVX2-baseline kernel
paths — your CPU might be hitting the non-VNNI fallback in conv2_int8_kernels.simd.hpp,
which has a slightly different code path I haven't been able to test on my hardware (which
has VNNI).

even the webcam is fine here

asmorkalov · 2026-05-05T10:52:20Z

Hm, you are right about the threshold. The patch works with lower threshold. I propose to change the default threshold in both python and c++ sample to 85% or so.
The only thing that bothers me is performance. FP32 network demonstrates ~150fps on my host, but int8 version - only ~30fps.

asmorkalov · 2026-05-05T10:53:31Z

@abhishek-gola Could you take a look on the performance issue? I use AMD Ryzen 7 5700G without VNNI.

5usu · 2026-05-05T11:03:29Z

Hm, you are right about the threshold. The patch works with lower threshold. I propose to change the default threshold in both python and c++ sample to 85% or so. The only thing that bothers me is performance. FP32 network demonstrates ~150fps on my host, but int8 version - only ~30fps.

should I make some tweaks in the conv2_int8_kernels.simd.hpp trying to solve this you can review it all together, only if you need.

asmorkalov · 2026-05-05T11:38:58Z

Let's split bug fix and optimization in 2 prs or more. Just to have actual fix isolated in git history.

5usu · 2026-05-05T18:17:53Z

Let's split bug fix and optimization in 2 prs or more. Just to have actual fix isolated in git history.

please can you assign for the same

asmorkalov · 2026-05-21T13:18:25Z

@abhishek-gola Could you take a look too?

abhishek-gola · 2026-05-25T06:42:31Z

LGTM

dnn: test data for depthwise QLinearConv (opencv #28798) #1360 Companion PR for [opencv/opencv#28920](opencv/opencv#28920). Adds three small fixtures (~9 KB total) for the depthwise QLinearConv regression test: - `testdata/dnn/onnx/models/quantized_depthwise_conv_int8_weights.onnx` (629 B) - `testdata/dnn/onnx/data/input_quantized_depthwise_conv_int8_weights.npy` (4.2 KB) - `testdata/dnn/onnx/data/output_quantized_depthwise_conv_int8_weights.npy` (4.2 KB) The model is a single 16-group, 1-channel-per-group, 3×3 depthwise QLinearConv with non-zero ` x_zp` — the minimal shape that exercises the `Kg<K0` code path fixed in the OpenCV PR. Reference output produced by onnxruntime. Branch name matches the OpenCV PR per the contribution guide.

5usu mentioned this pull request May 2, 2026

face_detection_yunet_2023mar_int8.onnx model from zoo detects nothing with 5.x #28798

Closed

4 tasks

asmorkalov added this to the 5.0-release milestone May 4, 2026

asmorkalov added bug category: objdetect category: dnn category: dnn (onnx) ONNX suport issues in DNN module labels May 4, 2026

asmorkalov requested review from abhishek-gola and asmorkalov and removed request for asmorkalov May 4, 2026 06:58

asmorkalov requested changes May 6, 2026

View reviewed changes

Comment thread modules/dnn/test/test_onnx_importer.cpp

asmorkalov self-assigned this May 6, 2026

Adjust score threshold to support int8 quantized model.

741e952

5usu mentioned this pull request May 9, 2026

dnn: test data for depthwise QLinearConv (opencv #28798) opencv/opencv_extra#1360

Merged

asmorkalov approved these changes May 21, 2026

View reviewed changes

abhishek-gola approved these changes May 25, 2026

View reviewed changes

asmorkalov merged commit 62930a2 into opencv:5.x May 25, 2026
75 of 81 checks passed

Uh oh!

Uh oh!

Conversation

5usu commented May 2, 2026 • edited by asmorkalov Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

asmorkalov commented May 4, 2026

Uh oh!

asmorkalov commented May 4, 2026

Uh oh!

asmorkalov commented May 5, 2026

Uh oh!

5usu commented May 5, 2026

Uh oh!

abhishek-gola commented May 5, 2026

Uh oh!

5usu commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

asmorkalov commented May 5, 2026

Uh oh!

asmorkalov commented May 5, 2026

Uh oh!

5usu commented May 5, 2026

Uh oh!

asmorkalov commented May 5, 2026

Uh oh!

5usu commented May 5, 2026

Uh oh!

Uh oh!

asmorkalov commented May 21, 2026

Uh oh!

abhishek-gola commented May 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

5usu commented May 2, 2026 •

edited by asmorkalov

Loading

5usu commented May 5, 2026 •

edited

Loading