Skip to content

A2Fast: portable conv + memory optimizations (up to 1.6x on channels=8)#277

Merged
sdatkinson merged 7 commits into
sdatkinson:mainfrom
jfsantos:perf/a2-portable-optimizations
Jun 8, 2026
Merged

A2Fast: portable conv + memory optimizations (up to 1.6x on channels=8)#277
sdatkinson merged 7 commits into
sdatkinson:mainfrom
jfsantos:perf/a2-portable-optimizations

Conversation

@jfsantos

@jfsantos jfsantos commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Optimizes the A2-shape fast-path WaveNet (a2_fast.cpp) with portable, intrinsic-free changes. No change to model output; the generic WaveNet path is untouched.

Changes

  • T=4 frame-tiled conv (channels ≥ 4). Replaces the per-tap Eigen GEMM with a tap-major kernel that keeps a T×Channels accumulator local and reuses each weight column across 4 frames. The compiler emits SIMD broadcast-FMA (vmlaq_n_f32 on AArch64, AVX equivalents on x86) — the same weight amortization as a hand-written microkernel, without intrinsics.
  • Direct head-history writes. Layers write/accumulate activations straight into _head_history at the ring slot (IsFirst selects = vs +=), eliminating the separate _head_sum buffer, its per-block zeroing pass, and the head ring-write copy.
  • Dead terminal-residual elimination. The final layer's layer1x1 projection + residual store feed nothing downstream, so they're elided via an IsLast template parameter.
  • Single history arena. All 23 layers' ring buffers come from one contiguous allocation, spreading base addresses across cache sets to reduce L1 set-conflict misses on narrow caches (e.g. Cortex-A8).
  • Mirror-on-demand ring writes. Instead of a full mbs×Channels mirror refresh every block, predict each tap's next read range and mirror only the columns that actually overflow past pow2_size — frequently zero.
  • CMake: default to Release when no build type is set (an empty CMAKE_BUILD_TYPE compiles unoptimized, ~10x slower for Eigen-heavy code).

Benchmark (desktop, Apple Silicon, fast-path p50 µs/block, vs main)

Both builds forced to -DCMAKE_BUILD_TYPE=Release to isolate the code changes. The unchanged generic WaveNet ran as a control and stayed flat in both builds, confirming the deltas are real.

buffer a2lite (ch=3) a2full (ch=8)
32 1.00x 1.58x
48 1.04x 1.43x
64 1.04x 1.39x
128 1.07x 1.29x

Channels=8 sees 1.3–1.6x from the tiled conv. Channels=3 is roughly flat on desktop; the arena/mirror changes target narrow embedded caches (Cortex-A8/M7) and aren't captured by a desktop run.

Some of these optimizations were contributed by landon-chaos, as he found them out while optimizing Core to run on his pedals :)

João Felipe Santos and others added 6 commits June 4, 2026 13:57
Previously each of the 23 layers held its own std::vector<float> history.
Heap allocators can place these at addresses that share the same L1 cache
sets, causing conflict misses on the wide-dilation layers whose tap reads,
_layer_in writes, and head-history writes simultaneously compete for cache.

Replace with a single contiguous allocation (_history_arena) from which
each layer takes a raw pointer slice. Sequential placement naturally
distributes base addresses across cache sets because each slot size mod
the set stride is non-zero.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Landon McCoy <landon-chaos@users.noreply.github.com>
Previously each layer accumulated its post-activation output into a
separate _head_sum buffer, which was then zeroed before every block
(memset of Channels*num_frames floats) and copied into _head_history
by _head_ring_write (another memcpy of the same size) after all layers
finished.

Replace with direct writes: layer 0 assigns (IsFirst=true template
parameter), layers 1-22 accumulate (IsFirst=false). The IsFirst branch
resolves at compile time, so there is no runtime branch in the hot loop.

_head_ring_write is removed; process() now manages the head ring write
position directly, rewinding with a cheap memmove of kHeadKernelSize-1
columns when the end of the buffer is reached.

Also eliminates ztile.setZero() for the Channels=8 Eigen path: tap 0
now assigns into ztile (noalias() =) rather than accumulating on top of
a zero-initialised matrix, saving one Channels*num_frames write per layer.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Landon McCoy <landon-chaos@users.noreply.github.com>
Previously _ring_write always refreshed the full tail mirror — a memcpy
of maxBufferSize * Channels * sizeof(float) bytes per layer per block
(4 KB for Ch=8 / 128 frames, 92 KB total across 23 layers per buffer).

Most of the time no tap read crosses the pow2_size boundary, so the
mirror copy was pure overhead. Replace with mirror-on-demand: after each
write, compute each tap's next-call read range and mirror only the
columns that actually overflow past pow2_size. Zero copy is emitted
whenever no tap straddles the wrap.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Landon McCoy <landon-chaos@users.noreply.github.com>
Replaces the Eigen noalias() GEMM loop with an explicit T=4 frame-tiled,
tap-major C++ kernel. Structure of the inner body:

    a[f][o] += Wcol[o] * h[f]     (o = output channel, f = frame in tile)

Wcol = W[:,cp] is stride-1 in o, so the o loop vectorizes. h[f] is a
scalar loaded from the (column-major) history buffer, so the compiler
emits a broadcast-scalar FMA instruction — vmlaq_n_f32 on AArch64, the
AVX equivalent on x86 — without requiring architecture-specific intrinsics.

The T=4 tiling amortizes the two weight register loads (W[:,cp] = 8
floats = 2 Q-regs on AArch64) across four independent accumulator chains,
matching the weight-reuse strategy of the explicit NEON microkernel.

A scalar tail handles buffer sizes that are not multiples of 4; in
practice audio buffer sizes (64, 128, 256) are always multiples of 4.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Landon McCoy <landon-chaos@users.noreply.github.com>
Allman braces for the new if constexpr / dispatcher blocks, no
single-line loops, and collapse the manually-aligned declarations.
No behavioral change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@sdatkinson sdatkinson left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CMakeLists.txt but otherwise LGTM

Comment thread NAM/wavenet/a2_fast.cpp
// broadcast) with weight vector Wcol reused across all T frames — the same
// weight amortisation as an explicit NEON microkernel, without intrinsics.
// Equivalent SIMD broadcast-FMA instructions are emitted on x86 (AVX) and
// other SIMD targets automatically.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

neat :)

Comment thread CMakeLists.txt Outdated

set(CMAKE_MODULE_PATH "${CMAKE_SOURCE_DIR}/cmake")

# Default to a Release build: without this, an empty CMAKE_BUILD_TYPE compiles

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I prefer defaulting to Debug--slower performance is very noticeable; losing your assertions when running tests, less so :)

@sdatkinson sdatkinson left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@sdatkinson sdatkinson merged commit baf1bf8 into sdatkinson:main Jun 8, 2026
5 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants