Skip to content

dnn: skip Conv2Int8 fusion for grouped convs with Kg<8 (#28798)#28920

Merged
asmorkalov merged 2 commits into
opencv:5.xfrom
5usu:fix/28798-yunet-int8-objbranch
May 25, 2026
Merged

dnn: skip Conv2Int8 fusion for grouped convs with Kg<8 (#28798)#28920
asmorkalov merged 2 commits into
opencv:5.xfrom
5usu:fix/28798-yunet-int8-objbranch

Conversation

@5usu

@5usu 5usu commented May 2, 2026

Copy link
Copy Markdown
Contributor

OpenCV Extra: opencv/opencv_extra#1360

  • I agree to contribute to the project under Apache 2 License.
  • To the best of my knowledge, the proposed patch is not based on a code under GPL or another
    license that is incompatible with OpenCV
  • The PR is proposed to the proper branch
  • There is a reference to the original bug report and related work
  • There is accuracy test, performance test and test data in opencv_extra repository, if applicable
    Patch to opencv_extra has the same branch name.
  • The feature is well documented and sample code can be built with the project CMake

Fixes #28798.

The YuNet 2023mar int8 model from opencv_zoo doesn't detect anything on 5.x. With --score_threshold 0.05 the detector still returns no boxes, so it's not just a confidence drop — the network output is
broken.

After dumping intermediate tensors and comparing against onnxruntime, the obj_* branches collapse to
~0 (saturated int8 -128), and the cls_* branches saturate the other way. Final score is cls * obj
so nothing ever crosses threshold.

The cause is in Conv2Int8's VNNI kernel. It processes K0 = 8 output channels per SIMD iteration and
writes them as one K0-wide block to out + (n*K1 + k1)*planesize with k1 = k_base / K0. When
ngroups > 1 and Kg = K/ngroups < K0, consecutive groups share the same k1 slot and overwrite each
other — only the last group's result survives. YuNet has six 1x1x3x3 depthwise conv heads (Kg = 1),
which is exactly this case.

The unfused DequantizeLinear → Conv2 → QuantizeLinear path is fine. The minimal fix here is to skip
the rewrite to Conv2Int8 when Kg < 8 so we fall back to the float path. A proper depthwise int8
kernel can be added later as a separate optimization.

Verified with samples/dnn/face_detect.py on Lena:

Before:
AssertionError: Cannot find a face in samples/data/lena.jpg

After:
Face 0, top-left coordinates: (204, 187), box width: 149, box height 212, score: 0.89

Cross-check vs onnxruntime: cls_8 mean 0.673 (was 0.93, ORT 0.673), obj_8 max 0.0039 (was
0, ORT 0.0039).

The regression test is a single 16-group depthwise QLinearConv with non-zero x_zp, added to
Quantized_Convolution. Test data is in opencv_extra on the matching branch (~9 KB total).

image

Conv2Int8's VNNI kernel processes K0=8 output channels per SIMD iteration.
When ngroups>1 and Kg=K/ngroups < K0, multiple groups end up writing to the
same K0-wide output slot and clobber one another, producing fully saturated
int8 output for every depthwise conv past the first. The unfused
DequantizeLinear -> Conv2 -> QuantizeLinear path is unaffected.

Detect the unsupported shape in graph_fusion_qdq.cpp and skip the rewrite to
Conv2Int8 in that case, so the unfused float fallback is used. Adds an ONNX
regression test (depthwise QLinearConv with x_zp != 0).

Fixes opencv#28798
@asmorkalov

Copy link
Copy Markdown
Contributor

@abhishek-gola please join the pr review.

@asmorkalov

Copy link
Copy Markdown
Contributor

@5usu thanks a lot for the investigation and contribution!

@asmorkalov

Copy link
Copy Markdown
Contributor

@5usu I just reached my working place and tested face detection sample ./samples/dnn/face_detect.py like python3 ./face_detect.py --face_detection_model=face_detection_yunet_2023mar_int8.onnx. The sample detects nothing with web camera. Similar behaviour with Lena image:

./face_detect.py --image1 ../data/lena.jpg --face_detection_model=face_detection_yunet_2023mar_int8.onnx
[ WARN:0@0.017] global net_impl_backend.cpp:334 setPreferableTarget Targets are not supported by the new graph engine for now
Traceback (most recent call last):
  File "/mnt/Projects/Projects/opencv-next/samples/dnn/./face_detect.py", line 73, in <module>
    assert faces1[1] is not None, 'Cannot find a face in {}'.format(args.image1)
           ^^^^^^^^^^^^^^^^^^^^^
AssertionError: Cannot find a face in ../data/lena.jpg

@5usu

5usu commented May 5, 2026

Copy link
Copy Markdown
Contributor Author

Let me look into it

@abhishek-gola

Copy link
Copy Markdown
Contributor

The changes you made look correct, thanks for debugging that. However, we still need to re-export the INT8 model, since the current version isn’t consistently detecting faces. The newer model in this PR resolves the issue: opencv/opencv_zoo#309

@5usu

5usu commented May 5, 2026

Copy link
Copy Markdown
Contributor Author

@5usu I just reached my working place and tested face detection sample ./samples/dnn/face_detect.py like python3 ./face_detect.py --face_detection_model=face_detection_yunet_2023mar_int8.onnx. The sample detects nothing with web camera. Similar behaviour with Lena image:

./face_detect.py --image1 ../data/lena.jpg --face_detection_model=face_detection_yunet_2023mar_int8.onnx
[ WARN:0@0.017] global net_impl_backend.cpp:334 setPreferableTarget Targets are not supported by the new graph engine for now
Traceback (most recent call last):
  File "/mnt/Projects/Projects/opencv-next/samples/dnn/./face_detect.py", line 73, in <module>
    assert faces1[1] is not None, 'Cannot find a face in {}'.format(args.image1)
           ^^^^^^^^^^^^^^^^^^^^^
AssertionError: Cannot find a face in ../data/lena.jpg
image

Hi @asmorkalov — thanks for testing. The default --score_threshold in face_detect.py is
0.9, and the int8 model's score on Lena is 0.89 (vs the FP32 baseline's 0.91), so it's
right under the default and gets filtered. Could you re-run with --score_threshold 0.6 (or
any value ≤ 0.85)? On my machine:
$ python3 ./face_detect.py --image1 ../data/lena.jpg
--face_detection_model=face_detection_yunet_2023mar_int8.onnx --score_threshold 0.6
Face 0, top-left coordinates: (204, 187), box width: 149, box height 212, score: 0.89
If it still fails for you with that, I'll dig into the AVX-VNNI vs AVX2-baseline kernel
paths — your CPU might be hitting the non-VNNI fallback in conv2_int8_kernels.simd.hpp,
which has a slightly different code path I haven't been able to test on my hardware (which
has VNNI).

image

even the webcam is fine here

@asmorkalov

Copy link
Copy Markdown
Contributor

Hm, you are right about the threshold. The patch works with lower threshold. I propose to change the default threshold in both python and c++ sample to 85% or so.
The only thing that bothers me is performance. FP32 network demonstrates ~150fps on my host, but int8 version - only ~30fps.

@asmorkalov

Copy link
Copy Markdown
Contributor

@abhishek-gola Could you take a look on the performance issue? I use AMD Ryzen 7 5700G without VNNI.

@5usu

5usu commented May 5, 2026

Copy link
Copy Markdown
Contributor Author

Hm, you are right about the threshold. The patch works with lower threshold. I propose to change the default threshold in both python and c++ sample to 85% or so. The only thing that bothers me is performance. FP32 network demonstrates ~150fps on my host, but int8 version - only ~30fps.

should I make some tweaks in the conv2_int8_kernels.simd.hpp trying to solve this you can review it all together, only if you need.

@asmorkalov

Copy link
Copy Markdown
Contributor

Let's split bug fix and optimization in 2 prs or more. Just to have actual fix isolated in git history.

@5usu

5usu commented May 5, 2026

Copy link
Copy Markdown
Contributor Author

Let's split bug fix and optimization in 2 prs or more. Just to have actual fix isolated in git history.

please can you assign for the same

Comment thread modules/dnn/test/test_onnx_importer.cpp
@asmorkalov asmorkalov self-assigned this May 6, 2026
@asmorkalov

Copy link
Copy Markdown
Contributor

@abhishek-gola Could you take a look too?

@abhishek-gola

Copy link
Copy Markdown
Contributor

LGTM

asmorkalov pushed a commit to opencv/opencv_extra that referenced this pull request May 25, 2026
dnn: test data for depthwise QLinearConv (opencv #28798) #1360

Companion PR for [opencv/opencv#28920](opencv/opencv#28920).

Adds three small fixtures (~9 KB total) for the depthwise QLinearConv regression test:

- `testdata/dnn/onnx/models/quantized_depthwise_conv_int8_weights.onnx` (629 B)
- `testdata/dnn/onnx/data/input_quantized_depthwise_conv_int8_weights.npy` (4.2 KB)
- `testdata/dnn/onnx/data/output_quantized_depthwise_conv_int8_weights.npy` (4.2 KB)

The model is a single 16-group, 1-channel-per-group, 3×3 depthwise QLinearConv with non-zero ` x_zp` — the minimal shape that exercises the `Kg<K0` code path fixed in the OpenCV PR. Reference output produced by onnxruntime.

Branch name matches the OpenCV PR per the contribution guide.
@asmorkalov asmorkalov merged commit 62930a2 into opencv:5.x May 25, 2026
75 of 81 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants