Skip to content

关于w8a8在mac M5 24G上的问题补充 #31

Description

@Gaeroce

Yes, I can provide more details.

Environment:

  • Hardware: MacBook Pro with Apple M5, 24GB RAM
  • mano-cua: 1.1.1
  • Local model: Mano-P w8a16, path: ~/.mano/models/Mano-P/w8a16
  • Python env: ~/.mano/venv
  • cider: 0.7.0
  • cider.is_available(): True
  • vlm_service: OK
  • Current config was restored to w8a8=off after the test.

Observed behavior:

  1. With w8a8 unset/off:

    • Performance was much faster.
    • Previous observed speed was around:
      • prefill: ~1150 tok/s
      • decode: ~24 tok/s
      • peak memory: ~6.4GB
    • Each GUI step was still not instant, but much more usable.
  2. After setting:
    mano-cua config --set w8a8 auto

    The slowdown was reproducible across multiple steps, not only the first step.

    Logs:

    • At startup:
      [cider] Converted 252 layers to CiderLinear in 27.1s

    Step 1:

    • decode: 67 tokens, 2.5 tok/s, peak_mem=6.3GB
    • step time: 48.0s
    • prefill: ~130 tok/s

    Step 2:

    • decode: 61 tokens, 2.8 tok/s, peak_mem=6.3GB
    • step time: 54.8s
    • prefill: ~128-130 tok/s

    Step 3:

    • decode: 64 tokens, 2.8 tok/s
    • step time: 55.3s

So it is not just the first step being slow due to model conversion/prewarm. The first step includes an extra 27.1s Cider conversion, but later steps are still very slow: around 50+ seconds per step, decode only ~2.5-2.8 tok/s.

  1. Switching back to:
    mano-cua config --set w8a8 off

    restored the previous faster behavior.

Another thing I noticed:
The config help says w8a8 default is auto, but in visual/agents/local.py the code appears to use:

w8a8_mode = get_config("w8a8") or "off"

So if the config is unset, it actually behaves as off, not auto. This may be a docs/config mismatch.

Summary:

  • The slowdown with w8a8=auto is reproducible.
  • It affects every inference step, not only the first one.
  • On this machine, w8a8 makes both prefill and decode much slower:
    • prefill drops from ~1150 tok/s to ~130 tok/s
    • decode drops from ~24 tok/s to ~2.5-2.8 tok/s
  • The Cider conversion itself takes ~27.1s, but the main issue is that subsequent steps remain slow.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions