nn.LayerNorm two-pass variance produces ~27% more HLO reduce ops under vmap + grad

`eqx.nn.LayerNorm` uses two-pass variance (`jnp.mean` + `jnp.var`). Under `vmap` + `jax.grad`, this produces ~27% more HLO reduce ops and runs 1.45x slower than an identical model using `jax.nn.standardize` (one-pass variance: `E[x²] - E[x]²`).

## Benchmark

8-block residual MLP, `DIM=64`, `BATCH=256`, `SEQ=32`, double `vmap` (batch × seq):

| Variant | Time | HLO lines | Reduces | Broadcasts |
|---------|------|-----------|---------|------------|
| (a) `eqx.nn.LayerNorm` | 49.7 ms | 1385 | 132 | 342 |
| (b) `LayerNormNew` (drop-in) | 34.2 ms | 1011 | 104 | 238 |

Full script [here](https://gist.github.com/CatchemAL/7d9301a389e710681d1d5bac6f44b506#file-eqx_layernorm_vmap-py) to show timings.

## Algorithm

`eqx.nn.LayerNorm.__call__` (in `_normalisation.py`) computes:

```python
mean = jnp.mean(x, keepdims=True)
variance = jnp.var(x, keepdims=True)        # internally: mean((x - mean)²)
variance = jnp.maximum(0.0, variance)
inv = jax.lax.rsqrt(variance + self.eps)
out = (x - mean) * inv
```

`jnp.var` is two-pass, `jax.nn.standardize` uses one-pass variance:

```python
mean = jnp.mean(x, axis, keepdims=True)
variance = jnp.mean(jnp.square(x), axis, keepdims=True) - jnp.square(mean)
```

## Suggested fix

In `_normalisation.py`, replace the two-pass variance:

```python
mean = jnp.mean(x, keepdims=True)
variance = jnp.var(x, keepdims=True)
variance = jnp.maximum(0.0, variance)
inv = jax.lax.rsqrt(variance + self.eps)
out = (x - mean) * inv
```

with `jax.nn.standardize`:

```python
out = jax.nn.standardize(x, axis=range(len(x.shape)), epsilon=self.eps)
```

This is a one-line change. `jax.nn.standardize` already handles the clipping (`jnp.clip(variance, 0)`), so `jnp.maximum` is no longer needed.

## Confirmation of Correctness
Here is an suite of [50 unit tests](https://gist.github.com/CatchemAL/c7e92e05538cf67903303894b6648a9a#file-layer_norm_tests-py) confirming identical outputs - just faster.

## Versions

- equinox 0.13.4
- jax 0.9.0
- CPU (no GPU)

I would be happy to prepare a PR if useful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

nn.LayerNorm two-pass variance produces ~27% more HLO reduce ops under vmap + grad #1196

Benchmark

Algorithm

Suggested fix

Confirmation of Correctness

Versions

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Variant	Time	HLO lines	Reduces	Broadcasts
(a) `eqx.nn.LayerNorm`	49.7 ms	1385	132	342
(b) `LayerNormNew` (drop-in)	34.2 ms	1011	104	238

Uh oh!

nn.LayerNorm two-pass variance produces ~27% more HLO reduce ops under vmap + grad #1196

Description

Benchmark

Algorithm

Suggested fix

Confirmation of Correctness

Versions

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions