Summary
We use AIMET-ONNX LiteMP (flip_layers_to_higher_precision(..., override_precision=float16)) to quantize an activation-sensitive RT-DETRv2 detector. The AIMET simulation is essentially lossless (mean AP50 0.4996 vs FP32 0.4987). However, none of the available encoding-export strategies produce a correct mixed-precision context binary through qnn-onnx-converter on QAIRT 2.33.0.
Hard constraint: our production runtime is QAIRT 2.33.0.250327 and cannot be changed. So solutions that rely on Quantizer V2 / encoding 2.0.0 (which only exist in QAIRT >= 2.37) are not viable for us. We are looking for a way to deploy AIMET FP16 mixed precision on 2.33 specifically, or confirmation that it is impossible.
This is closely related to #4090, but that issue's accepted fix (encoding_version="2.0.0" + --use_quantize_v2) requires a newer QAIRT and therefore does not apply to us.
Environment
- AIMET-ONNX: 2.31.0 (+cu126)
- QAIRT / QNN SDK: 2.33.0.250327 (production-locked, cannot upgrade)
- HTP target: v75 (SoC 90) — project "z03"
- Python 3.10
- Model: RT-DETRv2 PResNet-50vd detector (ONNX opset 16, static shapes, contains GridSample / deformable cross-attention). Inputs:
images [1,3,640,640] f32, orig_target_sizes [1,2] int64. Outputs: labels/boxes/scores.
Why FP16 is required (QuantAnalyzer)
check_model_sensitivity_to_quantization shows the model is activation-quantization sensitive, not weight sensitive:
| Config |
eval |
| FP32 |
0.1232 |
| weight-only quant (W8, A=float) |
0.1228 (≈ lossless) |
| activation-only quant (W=float, A16) |
0.0038 (collapse) |
Per-layer activation encoding ranges identify a handful of coordinate activations in the decoder with astronomically large dynamic range, e.g.:
/model/decoder/Add_output_0 range ≈ 2.7e34 (anchors add)
/model/decoder/decoder/Div_output_0..Div_5 range ≈ 1e5 (bbox log-scale wh)
/postprocessor/Mul_2_output_0 range ≈ 4.6e3 (box coords × image size)
INT8/INT16 cannot represent these, so plain w8a16/w16a16 collapse to AP50 ≈ 0.02. LiteMP keeping these (and the SQNR-sensitive layers) in FP16 restores the AIMET-sim accuracy to ≈ FP32. The problem is exporting that mixed precision to a 2.33 context binary.
Baseline (full eval set, 1154 images, 3-class AP50)
| Config |
subject |
crop |
debris |
mean |
note |
| FP32 |
0.7038 |
0.5821 |
0.2103 |
0.4987 |
reference |
| fp16 (AIMET sim) |
0.7047 |
0.5848 |
0.2112 |
0.5002 |
~lossless |
| w16a16 plain PTQ |
0.0151 |
0.0179 |
0.0000 |
0.0110 |
collapse |
| w8a16 plain PTQ |
0.0327 |
0.0304 |
0.0000 |
0.0210 |
collapse |
| LiteMP w8a16 + fp16 flip (AIMET sim) |
0.7043 |
0.5834 |
0.2112 |
0.4996 |
target to deploy |
The three export strategies and how each fails on 2.33
LiteMP flips sensitive layers to FP16 by setting q.enabled = False. We then sim.export(..., encoding_version=...) and run:
qnn-onnx-converter \
--input_network model.onnx \
--quantization_overrides model.encodings \
--input_list input_list.txt \
-o model.cpp
1. Omit-style (default 0.6.1 export — FP16 layers absent from encodings)
Converter succeeds, but the FP16-intended layers are silently quantized to 8-bit (QNN default for tensors without an override). Inspecting the generated cpp:
images input tensor → QNN_DATATYPE_UFIXED_POINT_8 with real scale=0.00392 (1/255), offset=0 — actually 8-bit quantized, not FP16.
- 34 activations that LiteMP intended as FP16 (incl.
/model/decoder/Add_output_0, Concat_3, Mul_output_0, enc_output_proj/Add) come out as UFIXED_POINT_8.
- Only 1 NATIVE
FLOAT_32 tensor remains.
Converter log confirms the fallback semantics:
INFO - Skipping quantization, no input_list provided # when no list
INFO - Processed N quantization encodings # encodings read
# but tensors absent from encodings -> default 8-bit, not fp16
So the deployed model is effectively all-INT8 on the sensitive layers → on-target accuracy will be far below the simulated 0.4996. This matches #4071 ("when the encodings file is missing encodings for some operators, QNN falls back to computing/默认 quantization").
2. Explicit dtype:float (force FP16 entries in 0.6.1)
To stop the silent a8 downgrade, we set the flipped quantizers to float dtype so they export as {"bitwidth":16,"dtype":"float"} (like MMP/choose_mixed_precision does), skipping initializers / onnx:: constants to avoid putting overrides on non-quantizable tensors.
Converter fails:
ERROR - Encountered Error: setQInfo: Can't set quantization data with data type:
QNN_DATATYPE_UFIXED_POINT_8 for: onnx::Add_4934 with quantizable flag 0
...
File ".../backend/qnn_quantizer.py", line 435, in quantize
quantizer.mixed_precision_processing()
ValueError: setQInfo: Can't set quantization data with data type:
QNN_DATATYPE_UFIXED_POINT_8 for: onnx::Add_4934 with quantizable flag 0
Key detail: onnx::Add_4934 is an initializer constant (the bias-like addend of conv1_1/norm/Add) and is absent from both the omit and the float encodings. The omit export converts fine; only the float export triggers this error. So the FP16 dtype:float entries on other layers change QNN's mixed_precision_processing propagation such that it tries to quantize this constant (flag 0) → crash. This looks like the same V1 float-fallback / mixed-precision-propagation bug noted in #4090 as fixed only in Quantizer V2.
3. Encoding 2.0.0 (to trigger Quantizer V2, as recommended in #4090)
sim.export(path, prefix, encoding_version="2.0.0",
export_int32_bias=True, force_activation_as=None)
The 2.0.0 file is a single encodings list (1173 entries, output_dtype/y_scale fields, no param_encodings/activation_encodings keys).
QAIRT 2.33 qnn-onnx-converter cannot parse it:
ERROR - Node /model/backbone/conv1/conv1_1/conv/Conv: 'param_encodings'
File ".../converters/common/converter_ir/op_adapter.py", line 1338, in update_param_quant_overrides
if graph.has_user_quantization_overrides() and \
self.name in graph.user_quantization_overrides['param_encodings']:
KeyError: 'param_encodings'
i.e. 2.33's converter hardcodes the old param_encodings schema key and has no Quantizer-V2 / 2.0.0 support at all. (We confirmed by grepping the SDKs: --use_quantize_v2 and IrQuantizerV2 in qnn-onnx-converter exist only in 2.37/2.41/2.43, not 2.33.)
Questions
- On QAIRT 2.33.0, is there any supported way to deploy AIMET FP16 mixed precision (LiteMP / manual fp16) through
qnn-onnx-converter such that the flipped layers actually run in FP16 (not silently re-quantized to 8-bit, and without the setQInfo ... flag 0 crash)?
- For strategy 1 (omit-style), is there a converter flag (e.g. a float-fallback option) that makes 2.33 keep encoding-less tensors in float instead of defaulting them to 8-bit?
- For strategy 2, is the
setQInfo: Can't set ... for onnx::Add_4934 with quantizable flag 0 error avoidable on 2.33 (e.g. by excluding specific tensor classes from the overrides), or is it inherently the V1 mixed_precision_processing bug?
- Is there a recommended 2.33-compatible encoding format/flag combination for mixed INT/FP16 at all, or is QAIRT >= 2.37 strictly required for this?
Any guidance for the 2.33-locked case would be greatly appreciated.
Summary
We use AIMET-ONNX LiteMP (
flip_layers_to_higher_precision(..., override_precision=float16)) to quantize an activation-sensitive RT-DETRv2 detector. The AIMET simulation is essentially lossless (mean AP50 0.4996 vs FP32 0.4987). However, none of the available encoding-export strategies produce a correct mixed-precision context binary throughqnn-onnx-converteron QAIRT 2.33.0.Hard constraint: our production runtime is QAIRT 2.33.0.250327 and cannot be changed. So solutions that rely on Quantizer V2 / encoding 2.0.0 (which only exist in QAIRT >= 2.37) are not viable for us. We are looking for a way to deploy AIMET FP16 mixed precision on 2.33 specifically, or confirmation that it is impossible.
This is closely related to #4090, but that issue's accepted fix (
encoding_version="2.0.0"+--use_quantize_v2) requires a newer QAIRT and therefore does not apply to us.Environment
images [1,3,640,640] f32,orig_target_sizes [1,2] int64. Outputs:labels/boxes/scores.Why FP16 is required (QuantAnalyzer)
check_model_sensitivity_to_quantizationshows the model is activation-quantization sensitive, not weight sensitive:Per-layer activation encoding ranges identify a handful of coordinate activations in the decoder with astronomically large dynamic range, e.g.:
INT8/INT16 cannot represent these, so plain w8a16/w16a16 collapse to AP50 ≈ 0.02. LiteMP keeping these (and the SQNR-sensitive layers) in FP16 restores the AIMET-sim accuracy to ≈ FP32. The problem is exporting that mixed precision to a 2.33 context binary.
Baseline (full eval set, 1154 images, 3-class AP50)
The three export strategies and how each fails on 2.33
LiteMP flips sensitive layers to FP16 by setting
q.enabled = False. We thensim.export(..., encoding_version=...)and run:1. Omit-style (default 0.6.1 export — FP16 layers absent from encodings)
Converter succeeds, but the FP16-intended layers are silently quantized to 8-bit (QNN default for tensors without an override). Inspecting the generated cpp:
imagesinput tensor →QNN_DATATYPE_UFIXED_POINT_8with realscale=0.00392 (1/255), offset=0— actually 8-bit quantized, not FP16./model/decoder/Add_output_0,Concat_3,Mul_output_0,enc_output_proj/Add) come out asUFIXED_POINT_8.FLOAT_32tensor remains.Converter log confirms the fallback semantics:
So the deployed model is effectively all-INT8 on the sensitive layers → on-target accuracy will be far below the simulated 0.4996. This matches #4071 ("when the encodings file is missing encodings for some operators, QNN falls back to computing/默认 quantization").
2. Explicit
dtype:float(force FP16 entries in 0.6.1)To stop the silent a8 downgrade, we set the flipped quantizers to float dtype so they export as
{"bitwidth":16,"dtype":"float"}(like MMP/choose_mixed_precisiondoes), skipping initializers /onnx::constants to avoid putting overrides on non-quantizable tensors.Converter fails:
Key detail:
onnx::Add_4934is an initializer constant (the bias-like addend ofconv1_1/norm/Add) and is absent from both the omit and the float encodings. The omit export converts fine; only the float export triggers this error. So the FP16dtype:floatentries on other layers change QNN'smixed_precision_processingpropagation such that it tries to quantize this constant (flag 0) → crash. This looks like the same V1 float-fallback / mixed-precision-propagation bug noted in #4090 as fixed only in Quantizer V2.3. Encoding 2.0.0 (to trigger Quantizer V2, as recommended in #4090)
The 2.0.0 file is a single
encodingslist (1173 entries,output_dtype/y_scalefields, noparam_encodings/activation_encodingskeys).QAIRT 2.33
qnn-onnx-convertercannot parse it:i.e. 2.33's converter hardcodes the old
param_encodingsschema key and has no Quantizer-V2 / 2.0.0 support at all. (We confirmed by grepping the SDKs:--use_quantize_v2andIrQuantizerV2inqnn-onnx-converterexist only in 2.37/2.41/2.43, not 2.33.)Questions
qnn-onnx-convertersuch that the flipped layers actually run in FP16 (not silently re-quantized to 8-bit, and without thesetQInfo ... flag 0crash)?setQInfo: Can't set ... for onnx::Add_4934 with quantizable flag 0error avoidable on 2.33 (e.g. by excluding specific tensor classes from the overrides), or is it inherently the V1mixed_precision_processingbug?Any guidance for the 2.33-locked case would be greatly appreciated.