[BUG] LiteMP FP16 mixed precision cannot be deployed via qnn-onnx-converter on QAIRT 2.33.0

## Summary

We use AIMET-ONNX LiteMP (`flip_layers_to_higher_precision(..., override_precision=float16)`) to quantize an **activation-sensitive** RT-DETRv2 detector. The AIMET simulation is essentially lossless (mean AP50 0.4996 vs FP32 0.4987). However, **none** of the available encoding-export strategies produce a correct mixed-precision context binary through `qnn-onnx-converter` on **QAIRT 2.33.0**.

**Hard constraint: our production runtime is QAIRT 2.33.0.250327 and cannot be changed.** So solutions that rely on Quantizer V2 / encoding 2.0.0 (which only exist in QAIRT >= 2.37) are not viable for us. We are looking for a way to deploy AIMET FP16 mixed precision on **2.33** specifically, or confirmation that it is impossible.

This is closely related to #4090, but that issue's accepted fix (`encoding_version="2.0.0"` + `--use_quantize_v2`) requires a newer QAIRT and therefore does not apply to us.

## Environment

- AIMET-ONNX: 2.31.0 (+cu126)
- **QAIRT / QNN SDK: 2.33.0.250327 (production-locked, cannot upgrade)**
- HTP target: v75 (SoC 90) — project "z03"
- Python 3.10
- Model: RT-DETRv2 PResNet-50vd detector (ONNX opset 16, static shapes, contains GridSample / deformable cross-attention). Inputs: `images [1,3,640,640] f32`, `orig_target_sizes [1,2] int64`. Outputs: `labels/boxes/scores`.

## Why FP16 is required (QuantAnalyzer)

`check_model_sensitivity_to_quantization` shows the model is **activation-quantization sensitive, not weight sensitive**:

| Config | eval |
|---|---|
| FP32 | 0.1232 |
| weight-only quant (W8, A=float) | 0.1228 (≈ lossless) |
| activation-only quant (W=float, A16) | 0.0038 (collapse) |

Per-layer activation encoding ranges identify a handful of coordinate activations in the decoder with astronomically large dynamic range, e.g.:

```
/model/decoder/Add_output_0                 range ≈ 2.7e34   (anchors add)
/model/decoder/decoder/Div_output_0..Div_5  range ≈ 1e5      (bbox log-scale wh)
/postprocessor/Mul_2_output_0               range ≈ 4.6e3    (box coords × image size)
```

INT8/INT16 cannot represent these, so plain w8a16/w16a16 collapse to AP50 ≈ 0.02. LiteMP keeping these (and the SQNR-sensitive layers) in FP16 restores the AIMET-sim accuracy to ≈ FP32. The problem is **exporting that mixed precision to a 2.33 context binary**.

## Baseline (full eval set, 1154 images, 3-class AP50)

| Config | subject | crop | debris | mean | note |
|---|---|---|---|---|---|
| FP32 | 0.7038 | 0.5821 | 0.2103 | 0.4987 | reference |
| fp16 (AIMET sim) | 0.7047 | 0.5848 | 0.2112 | 0.5002 | ~lossless |
| w16a16 plain PTQ | 0.0151 | 0.0179 | 0.0000 | 0.0110 | collapse |
| w8a16 plain PTQ | 0.0327 | 0.0304 | 0.0000 | 0.0210 | collapse |
| **LiteMP w8a16 + fp16 flip (AIMET sim)** | 0.7043 | 0.5834 | 0.2112 | **0.4996** | target to deploy |

## The three export strategies and how each fails on 2.33

LiteMP flips sensitive layers to FP16 by setting `q.enabled = False`. We then `sim.export(..., encoding_version=...)` and run:

```bash
qnn-onnx-converter \
  --input_network model.onnx \
  --quantization_overrides model.encodings \
  --input_list input_list.txt \
  -o model.cpp
```

### 1. Omit-style (default 0.6.1 export — FP16 layers absent from encodings)

Converter **succeeds**, but the FP16-intended layers are **silently quantized to 8-bit** (QNN default for tensors without an override). Inspecting the generated cpp:

- `images` input tensor → `QNN_DATATYPE_UFIXED_POINT_8` with real `scale=0.00392 (1/255), offset=0` — actually 8-bit quantized, **not** FP16.
- 34 activations that LiteMP intended as FP16 (incl. `/model/decoder/Add_output_0`, `Concat_3`, `Mul_output_0`, `enc_output_proj/Add`) come out as `UFIXED_POINT_8`.
- Only 1 NATIVE `FLOAT_32` tensor remains.

Converter log confirms the fallback semantics:

```
INFO - Skipping quantization, no input_list provided      # when no list
INFO - Processed N quantization encodings                  # encodings read
# but tensors absent from encodings -> default 8-bit, not fp16
```

So the deployed model is effectively all-INT8 on the sensitive layers → on-target accuracy will be far below the simulated 0.4996. This matches #4071 ("when the encodings file is missing encodings for some operators, QNN falls back to computing/默认 quantization").

### 2. Explicit `dtype:float` (force FP16 entries in 0.6.1)

To stop the silent a8 downgrade, we set the flipped quantizers to float dtype so they export as `{"bitwidth":16,"dtype":"float"}` (like MMP/`choose_mixed_precision` does), **skipping initializers / `onnx::` constants** to avoid putting overrides on non-quantizable tensors.

Converter **fails**:

```
ERROR - Encountered Error: setQInfo: Can't set quantization data with data type:
  QNN_DATATYPE_UFIXED_POINT_8 for: onnx::Add_4934 with quantizable flag 0
...
File ".../backend/qnn_quantizer.py", line 435, in quantize
    quantizer.mixed_precision_processing()
ValueError: setQInfo: Can't set quantization data with data type:
  QNN_DATATYPE_UFIXED_POINT_8 for: onnx::Add_4934 with quantizable flag 0
```

Key detail: `onnx::Add_4934` is an **initializer constant** (the bias-like addend of `conv1_1/norm/Add`) and is **absent from both the omit and the float encodings**. The omit export converts fine; only the float export triggers this error. So the FP16 `dtype:float` entries on *other* layers change QNN's `mixed_precision_processing` propagation such that it tries to quantize this constant (flag 0) → crash. This looks like the same V1 float-fallback / mixed-precision-propagation bug noted in #4090 as fixed only in Quantizer V2.

### 3. Encoding 2.0.0 (to trigger Quantizer V2, as recommended in #4090)

```python
sim.export(path, prefix, encoding_version="2.0.0",
           export_int32_bias=True, force_activation_as=None)
```

The 2.0.0 file is a single `encodings` list (1173 entries, `output_dtype`/`y_scale` fields, no `param_encodings`/`activation_encodings` keys).

QAIRT **2.33** `qnn-onnx-converter` **cannot parse it**:

```
ERROR - Node /model/backbone/conv1/conv1_1/conv/Conv: 'param_encodings'
File ".../converters/common/converter_ir/op_adapter.py", line 1338, in update_param_quant_overrides
    if graph.has_user_quantization_overrides() and \
       self.name in graph.user_quantization_overrides['param_encodings']:
KeyError: 'param_encodings'
```

i.e. 2.33's converter hardcodes the old `param_encodings` schema key and has no Quantizer-V2 / 2.0.0 support at all. (We confirmed by grepping the SDKs: `--use_quantize_v2` and `IrQuantizerV2` in `qnn-onnx-converter` exist only in 2.37/2.41/2.43, not 2.33.)

## Questions

1. On **QAIRT 2.33.0**, is there any supported way to deploy AIMET FP16 mixed precision (LiteMP / manual fp16) through `qnn-onnx-converter` such that the flipped layers actually run in FP16 (not silently re-quantized to 8-bit, and without the `setQInfo ... flag 0` crash)?
2. For strategy 1 (omit-style), is there a converter flag (e.g. a float-fallback option) that makes 2.33 keep encoding-less tensors in float instead of defaulting them to 8-bit?
3. For strategy 2, is the `setQInfo: Can't set ... for onnx::Add_4934 with quantizable flag 0` error avoidable on 2.33 (e.g. by excluding specific tensor classes from the overrides), or is it inherently the V1 `mixed_precision_processing` bug?
4. Is there a recommended 2.33-compatible encoding format/flag combination for mixed INT/FP16 at all, or is QAIRT >= 2.37 strictly required for this?

Any guidance for the 2.33-locked case would be greatly appreciated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] LiteMP FP16 mixed precision cannot be deployed via qnn-onnx-converter on QAIRT 2.33.0 #4102

Summary

Environment

Why FP16 is required (QuantAnalyzer)

Baseline (full eval set, 1154 images, 3-class AP50)

The three export strategies and how each fails on 2.33

1. Omit-style (default 0.6.1 export — FP16 layers absent from encodings)

2. Explicit `dtype:float` (force FP16 entries in 0.6.1)

3. Encoding 2.0.0 (to trigger Quantizer V2, as recommended in #4090)

Questions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Config	eval
FP32	0.1232
weight-only quant (W8, A=float)	0.1228 (≈ lossless)
activation-only quant (W=float, A16)	0.0038 (collapse)

Config	subject	crop	debris	mean	note
FP32	0.7038	0.5821	0.2103	0.4987	reference
fp16 (AIMET sim)	0.7047	0.5848	0.2112	0.5002	~lossless
w16a16 plain PTQ	0.0151	0.0179	0.0000	0.0110	collapse
w8a16 plain PTQ	0.0327	0.0304	0.0000	0.0210	collapse
LiteMP w8a16 + fp16 flip (AIMET sim)	0.7043	0.5834	0.2112	0.4996	target to deploy

Uh oh!

[BUG] LiteMP FP16 mixed precision cannot be deployed via qnn-onnx-converter on QAIRT 2.33.0 #4102

Description

Summary

Environment

Why FP16 is required (QuantAnalyzer)

Baseline (full eval set, 1154 images, 3-class AP50)

The three export strategies and how each fails on 2.33

1. Omit-style (default 0.6.1 export — FP16 layers absent from encodings)

2. Explicit dtype:float (force FP16 entries in 0.6.1)

3. Encoding 2.0.0 (to trigger Quantizer V2, as recommended in #4090)

Questions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

2. Explicit `dtype:float` (force FP16 entries in 0.6.1)