A2Fast: portable conv + memory optimizations (up to 1.6x on channels=8) by jfsantos · Pull Request #277 · sdatkinson/NeuralAmpModelerCore

jfsantos · 2026-06-08T17:54:30Z

Optimizes the A2-shape fast-path WaveNet (a2_fast.cpp) with portable, intrinsic-free changes. No change to model output; the generic WaveNet path is untouched.

Changes

T=4 frame-tiled conv (channels ≥ 4). Replaces the per-tap Eigen GEMM with a tap-major kernel that keeps a T×Channels accumulator local and reuses each weight column across 4 frames. The compiler emits SIMD broadcast-FMA (vmlaq_n_f32 on AArch64, AVX equivalents on x86) — the same weight amortization as a hand-written microkernel, without intrinsics.
Direct head-history writes. Layers write/accumulate activations straight into _head_history at the ring slot (IsFirst selects = vs +=), eliminating the separate _head_sum buffer, its per-block zeroing pass, and the head ring-write copy.
Dead terminal-residual elimination. The final layer's layer1x1 projection + residual store feed nothing downstream, so they're elided via an IsLast template parameter.
Single history arena. All 23 layers' ring buffers come from one contiguous allocation, spreading base addresses across cache sets to reduce L1 set-conflict misses on narrow caches (e.g. Cortex-A8).
Mirror-on-demand ring writes. Instead of a full mbs×Channels mirror refresh every block, predict each tap's next read range and mirror only the columns that actually overflow past pow2_size — frequently zero.
CMake: default to Release when no build type is set (an empty CMAKE_BUILD_TYPE compiles unoptimized, ~10x slower for Eigen-heavy code).

Benchmark (desktop, Apple Silicon, fast-path p50 µs/block, vs `main`)

Both builds forced to -DCMAKE_BUILD_TYPE=Release to isolate the code changes. The unchanged generic WaveNet ran as a control and stayed flat in both builds, confirming the deltas are real.

buffer	a2lite (ch=3)	a2full (ch=8)
32	1.00x	1.58x
48	1.04x	1.43x
64	1.04x	1.39x
128	1.07x	1.29x

Channels=8 sees 1.3–1.6x from the tiled conv. Channels=3 is roughly flat on desktop; the arena/mirror changes target narrow embedded caches (Cortex-A8/M7) and aren't captured by a desktop run.

Some of these optimizations were contributed by landon-chaos, as he found them out while optimizing Core to run on his pedals :)

Previously each of the 23 layers held its own std::vector<float> history. Heap allocators can place these at addresses that share the same L1 cache sets, causing conflict misses on the wide-dilation layers whose tap reads, _layer_in writes, and head-history writes simultaneously compete for cache. Replace with a single contiguous allocation (_history_arena) from which each layer takes a raw pointer slice. Sequential placement naturally distributes base addresses across cache sets because each slot size mod the set stride is non-zero. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Co-Authored-By: Landon McCoy <landon-chaos@users.noreply.github.com>

Previously each layer accumulated its post-activation output into a separate _head_sum buffer, which was then zeroed before every block (memset of Channels*num_frames floats) and copied into _head_history by _head_ring_write (another memcpy of the same size) after all layers finished. Replace with direct writes: layer 0 assigns (IsFirst=true template parameter), layers 1-22 accumulate (IsFirst=false). The IsFirst branch resolves at compile time, so there is no runtime branch in the hot loop. _head_ring_write is removed; process() now manages the head ring write position directly, rewinding with a cheap memmove of kHeadKernelSize-1 columns when the end of the buffer is reached. Also eliminates ztile.setZero() for the Channels=8 Eigen path: tap 0 now assigns into ztile (noalias() =) rather than accumulating on top of a zero-initialised matrix, saving one Channels*num_frames write per layer. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Co-Authored-By: Landon McCoy <landon-chaos@users.noreply.github.com>

Previously _ring_write always refreshed the full tail mirror — a memcpy of maxBufferSize * Channels * sizeof(float) bytes per layer per block (4 KB for Ch=8 / 128 frames, 92 KB total across 23 layers per buffer). Most of the time no tap read crosses the pow2_size boundary, so the mirror copy was pure overhead. Replace with mirror-on-demand: after each write, compute each tap's next-call read range and mirror only the columns that actually overflow past pow2_size. Zero copy is emitted whenever no tap straddles the wrap. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Co-Authored-By: Landon McCoy <landon-chaos@users.noreply.github.com>

Replaces the Eigen noalias() GEMM loop with an explicit T=4 frame-tiled, tap-major C++ kernel. Structure of the inner body: a[f][o] += Wcol[o] * h[f] (o = output channel, f = frame in tile) Wcol = W[:,cp] is stride-1 in o, so the o loop vectorizes. h[f] is a scalar loaded from the (column-major) history buffer, so the compiler emits a broadcast-scalar FMA instruction — vmlaq_n_f32 on AArch64, the AVX equivalent on x86 — without requiring architecture-specific intrinsics. The T=4 tiling amortizes the two weight register loads (W[:,cp] = 8 floats = 2 Q-regs on AArch64) across four independent accumulator chains, matching the weight-reuse strategy of the explicit NEON microkernel. A scalar tail handles buffer sizes that are not multiples of 4; in practice audio buffer sizes (64, 128, 256) are always multiples of 4. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Co-Authored-By: Landon McCoy <landon-chaos@users.noreply.github.com>

…s.txt to compile to Release by default.

Allman braces for the new if constexpr / dispatcher blocks, no single-line loops, and collapse the manually-aligned declarations. No behavioral change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

sdatkinson

CMakeLists.txt but otherwise LGTM

sdatkinson · 2026-06-08T22:09:51Z

+    // broadcast) with weight vector Wcol reused across all T frames — the same
+    // weight amortisation as an explicit NEON microkernel, without intrinsics.
+    // Equivalent SIMD broadcast-FMA instructions are emitted on x86 (AVX) and
+    // other SIMD targets automatically.


sdatkinson · 2026-06-08T22:12:05Z


 set(CMAKE_MODULE_PATH "${CMAKE_SOURCE_DIR}/cmake")

+# Default to a Release build: without this, an empty CMAKE_BUILD_TYPE compiles


I think I prefer defaulting to Debug--slower performance is very noticeable; losing your assertions when running tests, less so :)

sdatkinson

LGTM!

João Felipe Santos and others added 6 commits June 4, 2026 13:57

Dead terminal-residual elimination optimization + update to CMakeList…

78f92a2

…s.txt to compile to Release by default.

Reformat A2Fast optimizations to repo clang-format style

02d2277

Allman braces for the new if constexpr / dispatcher blocks, no single-line loops, and collapse the manually-aligned declarations. No behavioral change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

sdatkinson requested changes Jun 8, 2026

View reviewed changes

Default to Debug build when no build type is specified

14caf5c

sdatkinson approved these changes Jun 8, 2026

View reviewed changes

sdatkinson merged commit baf1bf8 into sdatkinson:main Jun 8, 2026
5 of 6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A2Fast: portable conv + memory optimizations (up to 1.6x on channels=8)#277

A2Fast: portable conv + memory optimizations (up to 1.6x on channels=8)#277
sdatkinson merged 7 commits into
sdatkinson:mainfrom
jfsantos:perf/a2-portable-optimizations

jfsantos commented Jun 8, 2026 •

edited

Loading

Uh oh!

sdatkinson left a comment

Uh oh!

sdatkinson Jun 8, 2026

Uh oh!

sdatkinson Jun 8, 2026

Uh oh!

sdatkinson left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		set(CMAKE_MODULE_PATH "${CMAKE_SOURCE_DIR}/cmake")

		# Default to a Release build: without this, an empty CMAKE_BUILD_TYPE compiles

Conversation

jfsantos commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Benchmark (desktop, Apple Silicon, fast-path p50 µs/block, vs main)

Uh oh!

sdatkinson left a comment

Choose a reason for hiding this comment

Uh oh!

sdatkinson Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

sdatkinson Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

sdatkinson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jfsantos commented Jun 8, 2026 •

edited

Loading

Benchmark (desktop, Apple Silicon, fast-path p50 µs/block, vs `main`)