Yes, I can provide more details.
Environment:
- Hardware: MacBook Pro with Apple M5, 24GB RAM
- mano-cua: 1.1.1
- Local model: Mano-P w8a16, path: ~/.mano/models/Mano-P/w8a16
- Python env: ~/.mano/venv
- cider: 0.7.0
- cider.is_available(): True
- vlm_service: OK
- Current config was restored to w8a8=off after the test.
Observed behavior:
-
With w8a8 unset/off:
- Performance was much faster.
- Previous observed speed was around:
- prefill: ~1150 tok/s
- decode: ~24 tok/s
- peak memory: ~6.4GB
- Each GUI step was still not instant, but much more usable.
-
After setting:
mano-cua config --set w8a8 auto
The slowdown was reproducible across multiple steps, not only the first step.
Logs:
- At startup:
[cider] Converted 252 layers to CiderLinear in 27.1s
Step 1:
- decode: 67 tokens, 2.5 tok/s, peak_mem=6.3GB
- step time: 48.0s
- prefill: ~130 tok/s
Step 2:
- decode: 61 tokens, 2.8 tok/s, peak_mem=6.3GB
- step time: 54.8s
- prefill: ~128-130 tok/s
Step 3:
- decode: 64 tokens, 2.8 tok/s
- step time: 55.3s
So it is not just the first step being slow due to model conversion/prewarm. The first step includes an extra 27.1s Cider conversion, but later steps are still very slow: around 50+ seconds per step, decode only ~2.5-2.8 tok/s.
-
Switching back to:
mano-cua config --set w8a8 off
restored the previous faster behavior.
Another thing I noticed:
The config help says w8a8 default is auto, but in visual/agents/local.py the code appears to use:
w8a8_mode = get_config("w8a8") or "off"
So if the config is unset, it actually behaves as off, not auto. This may be a docs/config mismatch.
Summary:
- The slowdown with w8a8=auto is reproducible.
- It affects every inference step, not only the first one.
- On this machine, w8a8 makes both prefill and decode much slower:
- prefill drops from ~1150 tok/s to ~130 tok/s
- decode drops from ~24 tok/s to ~2.5-2.8 tok/s
- The Cider conversion itself takes ~27.1s, but the main issue is that subsequent steps remain slow.
Yes, I can provide more details.
Environment:
Observed behavior:
With w8a8 unset/off:
After setting:
mano-cua config --set w8a8 auto
The slowdown was reproducible across multiple steps, not only the first step.
Logs:
[cider] Converted 252 layers to CiderLinear in 27.1s
Step 1:
Step 2:
Step 3:
So it is not just the first step being slow due to model conversion/prewarm. The first step includes an extra 27.1s Cider conversion, but later steps are still very slow: around 50+ seconds per step, decode only ~2.5-2.8 tok/s.
Switching back to:
mano-cua config --set w8a8 off
restored the previous faster behavior.
Another thing I noticed:
The config help says w8a8 default is auto, but in visual/agents/local.py the code appears to use:
w8a8_mode = get_config("w8a8") or "off"
So if the config is unset, it actually behaves as off, not auto. This may be a docs/config mismatch.
Summary: