Skip to content

[BUG] LiteMP FP16 mixed precision cannot be deployed via qnn-onnx-converter on QAIRT 2.33.0 #4102

Description

@charmway

Summary

We use AIMET-ONNX LiteMP (flip_layers_to_higher_precision(..., override_precision=float16)) to quantize an activation-sensitive RT-DETRv2 detector. The AIMET simulation is essentially lossless (mean AP50 0.4996 vs FP32 0.4987). However, none of the available encoding-export strategies produce a correct mixed-precision context binary through qnn-onnx-converter on QAIRT 2.33.0.

Hard constraint: our production runtime is QAIRT 2.33.0.250327 and cannot be changed. So solutions that rely on Quantizer V2 / encoding 2.0.0 (which only exist in QAIRT >= 2.37) are not viable for us. We are looking for a way to deploy AIMET FP16 mixed precision on 2.33 specifically, or confirmation that it is impossible.

This is closely related to #4090, but that issue's accepted fix (encoding_version="2.0.0" + --use_quantize_v2) requires a newer QAIRT and therefore does not apply to us.

Environment

  • AIMET-ONNX: 2.31.0 (+cu126)
  • QAIRT / QNN SDK: 2.33.0.250327 (production-locked, cannot upgrade)
  • HTP target: v75 (SoC 90) — project "z03"
  • Python 3.10
  • Model: RT-DETRv2 PResNet-50vd detector (ONNX opset 16, static shapes, contains GridSample / deformable cross-attention). Inputs: images [1,3,640,640] f32, orig_target_sizes [1,2] int64. Outputs: labels/boxes/scores.

Why FP16 is required (QuantAnalyzer)

check_model_sensitivity_to_quantization shows the model is activation-quantization sensitive, not weight sensitive:

Config eval
FP32 0.1232
weight-only quant (W8, A=float) 0.1228 (≈ lossless)
activation-only quant (W=float, A16) 0.0038 (collapse)

Per-layer activation encoding ranges identify a handful of coordinate activations in the decoder with astronomically large dynamic range, e.g.:

/model/decoder/Add_output_0                 range ≈ 2.7e34   (anchors add)
/model/decoder/decoder/Div_output_0..Div_5  range ≈ 1e5      (bbox log-scale wh)
/postprocessor/Mul_2_output_0               range ≈ 4.6e3    (box coords × image size)

INT8/INT16 cannot represent these, so plain w8a16/w16a16 collapse to AP50 ≈ 0.02. LiteMP keeping these (and the SQNR-sensitive layers) in FP16 restores the AIMET-sim accuracy to ≈ FP32. The problem is exporting that mixed precision to a 2.33 context binary.

Baseline (full eval set, 1154 images, 3-class AP50)

Config subject crop debris mean note
FP32 0.7038 0.5821 0.2103 0.4987 reference
fp16 (AIMET sim) 0.7047 0.5848 0.2112 0.5002 ~lossless
w16a16 plain PTQ 0.0151 0.0179 0.0000 0.0110 collapse
w8a16 plain PTQ 0.0327 0.0304 0.0000 0.0210 collapse
LiteMP w8a16 + fp16 flip (AIMET sim) 0.7043 0.5834 0.2112 0.4996 target to deploy

The three export strategies and how each fails on 2.33

LiteMP flips sensitive layers to FP16 by setting q.enabled = False. We then sim.export(..., encoding_version=...) and run:

qnn-onnx-converter \
  --input_network model.onnx \
  --quantization_overrides model.encodings \
  --input_list input_list.txt \
  -o model.cpp

1. Omit-style (default 0.6.1 export — FP16 layers absent from encodings)

Converter succeeds, but the FP16-intended layers are silently quantized to 8-bit (QNN default for tensors without an override). Inspecting the generated cpp:

  • images input tensor → QNN_DATATYPE_UFIXED_POINT_8 with real scale=0.00392 (1/255), offset=0 — actually 8-bit quantized, not FP16.
  • 34 activations that LiteMP intended as FP16 (incl. /model/decoder/Add_output_0, Concat_3, Mul_output_0, enc_output_proj/Add) come out as UFIXED_POINT_8.
  • Only 1 NATIVE FLOAT_32 tensor remains.

Converter log confirms the fallback semantics:

INFO - Skipping quantization, no input_list provided      # when no list
INFO - Processed N quantization encodings                  # encodings read
# but tensors absent from encodings -> default 8-bit, not fp16

So the deployed model is effectively all-INT8 on the sensitive layers → on-target accuracy will be far below the simulated 0.4996. This matches #4071 ("when the encodings file is missing encodings for some operators, QNN falls back to computing/默认 quantization").

2. Explicit dtype:float (force FP16 entries in 0.6.1)

To stop the silent a8 downgrade, we set the flipped quantizers to float dtype so they export as {"bitwidth":16,"dtype":"float"} (like MMP/choose_mixed_precision does), skipping initializers / onnx:: constants to avoid putting overrides on non-quantizable tensors.

Converter fails:

ERROR - Encountered Error: setQInfo: Can't set quantization data with data type:
  QNN_DATATYPE_UFIXED_POINT_8 for: onnx::Add_4934 with quantizable flag 0
...
File ".../backend/qnn_quantizer.py", line 435, in quantize
    quantizer.mixed_precision_processing()
ValueError: setQInfo: Can't set quantization data with data type:
  QNN_DATATYPE_UFIXED_POINT_8 for: onnx::Add_4934 with quantizable flag 0

Key detail: onnx::Add_4934 is an initializer constant (the bias-like addend of conv1_1/norm/Add) and is absent from both the omit and the float encodings. The omit export converts fine; only the float export triggers this error. So the FP16 dtype:float entries on other layers change QNN's mixed_precision_processing propagation such that it tries to quantize this constant (flag 0) → crash. This looks like the same V1 float-fallback / mixed-precision-propagation bug noted in #4090 as fixed only in Quantizer V2.

3. Encoding 2.0.0 (to trigger Quantizer V2, as recommended in #4090)

sim.export(path, prefix, encoding_version="2.0.0",
           export_int32_bias=True, force_activation_as=None)

The 2.0.0 file is a single encodings list (1173 entries, output_dtype/y_scale fields, no param_encodings/activation_encodings keys).

QAIRT 2.33 qnn-onnx-converter cannot parse it:

ERROR - Node /model/backbone/conv1/conv1_1/conv/Conv: 'param_encodings'
File ".../converters/common/converter_ir/op_adapter.py", line 1338, in update_param_quant_overrides
    if graph.has_user_quantization_overrides() and \
       self.name in graph.user_quantization_overrides['param_encodings']:
KeyError: 'param_encodings'

i.e. 2.33's converter hardcodes the old param_encodings schema key and has no Quantizer-V2 / 2.0.0 support at all. (We confirmed by grepping the SDKs: --use_quantize_v2 and IrQuantizerV2 in qnn-onnx-converter exist only in 2.37/2.41/2.43, not 2.33.)

Questions

  1. On QAIRT 2.33.0, is there any supported way to deploy AIMET FP16 mixed precision (LiteMP / manual fp16) through qnn-onnx-converter such that the flipped layers actually run in FP16 (not silently re-quantized to 8-bit, and without the setQInfo ... flag 0 crash)?
  2. For strategy 1 (omit-style), is there a converter flag (e.g. a float-fallback option) that makes 2.33 keep encoding-less tensors in float instead of defaulting them to 8-bit?
  3. For strategy 2, is the setQInfo: Can't set ... for onnx::Add_4934 with quantizable flag 0 error avoidable on 2.33 (e.g. by excluding specific tensor classes from the overrides), or is it inherently the V1 mixed_precision_processing bug?
  4. Is there a recommended 2.33-compatible encoding format/flag combination for mixed INT/FP16 at all, or is QAIRT >= 2.37 strictly required for this?

Any guidance for the 2.33-locked case would be greatly appreciated.

Metadata

Metadata

Labels

QNNIssues related to QNNcommunityIssues initiated by the community

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions