dnn: skip Conv2Int8 fusion for grouped convs with Kg<8 (#28798)#28920
Conversation
Conv2Int8's VNNI kernel processes K0=8 output channels per SIMD iteration. When ngroups>1 and Kg=K/ngroups < K0, multiple groups end up writing to the same K0-wide output slot and clobber one another, producing fully saturated int8 output for every depthwise conv past the first. The unfused DequantizeLinear -> Conv2 -> QuantizeLinear path is unaffected. Detect the unsupported shape in graph_fusion_qdq.cpp and skip the rewrite to Conv2Int8 in that case, so the unfused float fallback is used. Adds an ONNX regression test (depthwise QLinearConv with x_zp != 0). Fixes opencv#28798
|
@abhishek-gola please join the pr review. |
|
@5usu thanks a lot for the investigation and contribution! |
|
@5usu I just reached my working place and tested face detection sample ./samples/dnn/face_detect.py like |
|
Let me look into it |
|
The changes you made look correct, thanks for debugging that. However, we still need to re-export the INT8 model, since the current version isn’t consistently detecting faces. The newer model in this PR resolves the issue: opencv/opencv_zoo#309 |
Hi @asmorkalov — thanks for testing. The default --score_threshold in face_detect.py is
even the webcam is fine here |
|
Hm, you are right about the threshold. The patch works with lower threshold. I propose to change the default threshold in both python and c++ sample to 85% or so. |
|
@abhishek-gola Could you take a look on the performance issue? I use AMD Ryzen 7 5700G without VNNI. |
should I make some tweaks in the conv2_int8_kernels.simd.hpp trying to solve this you can review it all together, only if you need. |
|
Let's split bug fix and optimization in 2 prs or more. Just to have actual fix isolated in git history. |
please can you assign for the same |
|
@abhishek-gola Could you take a look too? |
|
LGTM |
dnn: test data for depthwise QLinearConv (opencv #28798) #1360 Companion PR for [opencv/opencv#28920](opencv/opencv#28920). Adds three small fixtures (~9 KB total) for the depthwise QLinearConv regression test: - `testdata/dnn/onnx/models/quantized_depthwise_conv_int8_weights.onnx` (629 B) - `testdata/dnn/onnx/data/input_quantized_depthwise_conv_int8_weights.npy` (4.2 KB) - `testdata/dnn/onnx/data/output_quantized_depthwise_conv_int8_weights.npy` (4.2 KB) The model is a single 16-group, 1-channel-per-group, 3×3 depthwise QLinearConv with non-zero ` x_zp` — the minimal shape that exercises the `Kg<K0` code path fixed in the OpenCV PR. Reference output produced by onnxruntime. Branch name matches the OpenCV PR per the contribution guide.


OpenCV Extra: opencv/opencv_extra#1360
license that is incompatible with OpenCV
Patch to opencv_extra has the same branch name.
Fixes #28798.
The YuNet 2023mar int8 model from opencv_zoo doesn't detect anything on 5.x. With
--score_threshold 0.05the detector still returns no boxes, so it's not just a confidence drop — the network output isbroken.
After dumping intermediate tensors and comparing against onnxruntime, the
obj_*branches collapse to~0 (saturated int8
-128), and thecls_*branches saturate the other way. Final score iscls * objso nothing ever crosses threshold.
The cause is in
Conv2Int8's VNNI kernel. It processesK0 = 8output channels per SIMD iteration andwrites them as one
K0-wide block toout + (n*K1 + k1)*planesizewithk1 = k_base / K0. Whenngroups > 1andKg = K/ngroups < K0, consecutive groups share the samek1slot and overwrite eachother — only the last group's result survives. YuNet has six
1x1x3x3depthwise conv heads (Kg = 1),which is exactly this case.
The unfused
DequantizeLinear → Conv2 → QuantizeLinearpath is fine. The minimal fix here is to skipthe rewrite to
Conv2Int8whenKg < 8so we fall back to the float path. A proper depthwise int8kernel can be added later as a separate optimization.
Verified with
samples/dnn/face_detect.pyon Lena:Before:
AssertionError: Cannot find a face in samples/data/lena.jpg
After:
Face 0, top-left coordinates: (204, 187), box width: 149, box height 212, score: 0.89
Cross-check vs onnxruntime:
cls_8mean0.673(was0.93, ORT0.673),obj_8max0.0039(was0, ORT0.0039).The regression test is a single 16-group depthwise
QLinearConvwith non-zerox_zp, added toQuantized_Convolution. Test data is in opencv_extra on the matching branch (~9 KB total).