Skip to content

ideal_mhd_model: export SIMSOPT-QS quantities via a flat-buffer Enzyme-ready path#584

Open
krystophny wants to merge 52 commits into
proximafusion:mainfrom
itpplasma:qs-export-harmonics
Open

ideal_mhd_model: export SIMSOPT-QS quantities via a flat-buffer Enzyme-ready path#584
krystophny wants to merge 52 commits into
proximafusion:mainfrom
itpplasma:qs-export-harmonics

Conversation

@krystophny

@krystophny krystophny commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Stacks on #582 (exact autodiff HVP). Merge order: after #582.

Exports every quantity SIMSOPT's QuasisymmetryRatioResidual reads, computed from
the equilibrium state through a flat-buffer, allocation-free path instead of the
Eigen RowMatrixXd transform in output_quantities (which Enzyme cannot
differentiate). This is the groundwork for an exact, finite-difference-free QS
boundary adjoint.

  • ComputeQsHarmonics (qs_harmonics_kernel.h): plain-double forward transform of
    the half-grid field harmonics gmnc, bmnc, bsub{u,v}mnc, bsup{u,v}mnc from the
    real-space fields the force chain already produces (gsqrt, bsup{u,v},
    bsub{u,v}; |B| from total_pressure). Mirrors the reference integral bit for
    bit, including the guarded Nyquist cos-basis halving.
  • VmecModel.qs_harmonics() (pybind): returns the six harmonics plus the
    iotas/bvco/buco profiles.

Verification

qs_harmonics_kernel_test (new) checks all six harmonics against
output_quantities on solovev to 1e-7 of each quantity's scale: PASSED.
Cross-checked from Python against a full vmecpp.run(): gmnc/bmnc/bsupu/bsupv to
~1e-7, iotas exact, bvco/buco to ~1e-7. bsubu/bsubv differ ~1e-4 across two
independent solves (lambda-sensitive), exact on a fixed state.

Follow-up (next PR): reverse-mode Enzyme over this kernel for an exact
dQS/dx, replacing the finite-difference objective-state-gradient in the
adjoint.

Tracking: #592

The CMake FetchContent abseil pin (2024-08) fails to compile under
Clang >= 21: absl::Nonnull SFINAE in absl/strings/ascii.cc and the
numbers.cc nullability annotations are rejected by the newer frontend.
Bump to the 20260107.1 LTS, which compiles cleanly under Clang 21.1.8
and GCC. Clang is the compiler required for the Enzyme autodiff build.

The Bazel build keeps its own (BCR) abseil pin and is unaffected.
Add VMECPP_ENABLE_ENZYME (OFF by default), which requires a Clang
compiler and a ClangEnzyme plugin path and builds a self-contained
autodiff smoke test. The test differentiates a scalar objective written
over Eigen::Map'd caller buffers and checks reverse- and forward-mode
Enzyme gradients against the closed form and central finite differences.

enzyme.h documents the intrinsic ABI and the allocation constraint that
shapes the differentiable kernels: Enzyme cannot track Eigen's aligned
allocator, so differentiable paths use Eigen::Map over caller-owned
buffers and avoid heap expression temporaries.

With the option off the build is unchanged.
Add a precondition flag to VmecModel.evaluate (default true, unchanged
behaviour). With precondition=false the forward model returns at the
INVARIANT_RESIDUALS checkpoint, so get_forces() yields the raw,
unpreconditioned force: the gradient of VMEC's augmented functional (MHD
energy plus the spectral-condensation and lambda constraints) with
respect to the decomposed internal-basis state.

This is the consistent state/gradient pair an external optimizer needs
to minimise in VMEC's own basis. The native solver's preconditioned
search direction (precondition=true) is a different vector; the raw
gradient is the equilibrium residual and vanishes at convergence.

Tests: raw force is finite and differs in direction from the
preconditioned force, and drops by >1e6 from the initial guess to the
converged equilibrium.
Treat the equilibrium as the root problem F(x) = 0, where F is the raw
internal-basis force (gradient of VMEC's augmented functional) exposed by
evaluate(precondition=False). Wire it to two solvers that reuse VMEC++'s
forward model: native-style preconditioned descent and Jacobian-free
Newton-Krylov (matrix-free Hessian information). Both reach the native
solver's equilibrium.

This is the external-differentiability path: VMEC++ as a differentiable
equilibrium component an outside optimizer can drive. Quasi-Newton
root-finders without a preconditioner diverge on this stiff system, which
motivates exposing VMEC's preconditioner as an operator next.

Tests assert both solvers reach force balance and recover the native
energy and state.
Add VmecModel.apply_preconditioner(v): applies VMEC's preconditioner
M^-1 (m=1, radial, lambda steps) to a vector in the decomposed basis.
M^-1 is VMEC's hand-built approximate inverse Hessian; this exposes it
as a reusable linear operator for preconditioned Krylov / quasi-Newton
and for the Hessian solve in adjoint sensitivities. It requires a prior
evaluate(precondition=true), which assembles the radial preconditioner.

Validated exactly: apply_preconditioner(raw force) equals the native
preconditioned search direction; the operator is linear and, once
assembled, state-invariant.

Use it as the inner Krylov preconditioner in Newton-Krylov: on solovev
(ns=11) this cuts force evaluations from 2242 to 505 (4.4x) versus
unpreconditioned JFNK, converging to the same equilibrium.
Add VmecModel.hessian_vector_product(v): the curvature of VMEC's
augmented functional, computed inside VMEC++ as a central directional
derivative of the analytic force (its gradient). The force is exact; only
the directional step is finite-differenced. Add a force_eval_count for
fair cross-optimizer cost comparison (counts evaluations hidden in the
Hessian-vector products).

Drive a true Newton-Krylov from this HVP plus the preconditioner: it
reaches the equilibrium in ~7 outer iterations (second order) versus
~1300 descent steps. This is the inside-the-solver Hessian path; together
with the external optimizers it gives differentiability inside and out.

Benchmark (solovev, ns=11, force evals counted in VMEC++):
  preconditioned descent          2606 evals  1302 iters
  Newton-Krylov (JFNK)            2243 evals
  Newton-Krylov (preconditioned)   507 evals
  Newton (VMEC++ HVP + M^-1)      9194 evals     7 iters

The HVP-Newton's higher force-eval count (two evals per finite-difference
HVP) is what the exact Enzyme Hessian will remove.
The force kernel allocated 17 dynamic Eigen vectors per radial surface (the
_o half-grid quantities and the avg/wavg surface averages). Move them to
preallocated per-thread ThreadLocalStorage scratch and assign in place, so
the radial loop allocates nothing.

Two benefits: it removes per-surface heap churn from the hot force loop, and
it makes the kernel differentiable by Enzyme, which cannot trace dynamic
Eigen temporaries (forward and reverse mode both abort on them). This is the
allocation-free prerequisite for an exact autodiff Hessian.

Pure refactor, identical arithmetic. Verified bit-for-bit: vmec_standalone
MHD energy unchanged on solovev (2.548352e+00) and cth_like_fixed_bdy
(5.057191e-02).
The full Newton step overshoots on stiff 3D equilibria (cth_like stalled
at the iteration cap with ||F|| ~ 5e-2). Add a backtracking line search on
||F|| so each step is damped to a decrease. With it the HVP-Newton
converges on cth_like in 9 outer iterations (||F|| = 1.8e-10) and still
converges solovev in 8.
Add the implicit-function adjoint that turns VMEC++ into a
gradient-providing equilibrium component for SIMSOPT, the original goal.

vmecpp_adjoint.py: for a converged fixed-boundary equilibrium F_I(x)=0,
the boundary sensitivity of a scalar objective J follows from
H_II lambda = dJ/dx_I, dJ/dx_B = dJ/dx_B - (dF_I/dx_B)^T lambda, with H
the symmetric Hessian of the augmented functional. It is matrix-free via
hessian_vector_product and apply_preconditioner (the SPD interior system
is solved with preconditioned CG). One Hessian solve gives the whole
boundary gradient, versus one equilibrium re-solve per boundary DOF for
finite differences.

simsopt_vmec_gradient.py: VmecEnergy wraps this as a SIMSOPT Optimizable
whose dJ is the adjoint gradient, plus a gradient-cost benchmark.

Verified: the adjoint gradient matches brute-force re-solve finite
differences (rel 2.4e-4) and the SIMSOPT Optimizable's dJ matches finite
differences of J (rel ~1e-6). On solovev (ns=11, 18 boundary DOFs) the
adjoint boundary gradient costs 762 force evaluations versus 9112 for
finite differences (12x), and the gap grows with the boundary DOF count.
Two correctness fixes for stiff 3D equilibria (cth_like):

- VMEC's augmented-Lagrangian Hessian is symmetric *indefinite* (the lambda
  constraint makes it a saddle, not a minimum), so CG silently gives the
  wrong adjoint there. Use GMRES, which handles indefinite systems, for the
  H_II solve and the interior Newton solve. With a loose, restarted tolerance
  the adjoint solve stays cheap.
- Add a backtracking line search to solve_interior so the interior re-solve
  (used by the SIMSOPT wrapper and the finite-difference reference) converges
  on 3D instead of overshooting.

Verified with a directional-derivative check against a re-converged
finite-difference reference: solovev 1.5e-4, cth_like 2.2e-2 relative; both
previously agreed only in 2D. Boundary-gradient cost on solovev: 626 force
evaluations (analytic adjoint) versus 10460 (finite differences).
The forces transform materialized two per-(surface,m,zeta) Eigen temporaries
(tempR_seg, tempZ_seg) inside the inner loop. Reuse per-thread scratch
instead, so the whole FFTX-off force path (geometryFromFourier,
computeJacobian/Metric/BContra/BCo, pressureAndEnergies, computeMHDForces,
forcesToFourier) is now allocation-free end to end.

Same arithmetic as the previous .eval(); verified bit-for-bit: solovev
2.548352e+00, cth_like_fixed_bdy 5.057191e-02.
Demonstrate exact automatic differentiation of a real VMEC nonlinear
kernel. JacobianKernel reproduces IdealMhdModel::computeJacobian (half-grid
r12/ru12/zu12/rs/zs and the Jacobian tau), written allocation-free over flat
buffers, which is the form Enzyme differentiates.

For L = 0.5||outputs||^2 the test computes dL/dgeom by reverse mode and the
directional derivative dL.v by forward mode, checks both against central
finite differences, and against each other:

  reverse dL.v vs FD : 1.9e-9
  forward dL.v vs FD : 1.9e-9
  forward vs reverse : 2.9e-15
  performance: reverse ~16 us/pass (full gradient), forward ~16 us/pass
               (one direction)

Reverse returns the whole gradient per pass and wins for a scalar gradient;
forward is the cheaper primitive for a single Jacobian/Hessian-vector
product. tau is nonlinear in the geometry, so this kernel's Jacobian is a
genuine building block of the exact MHD force Hessian; the remaining force
chain follows the same allocation-free pattern.
Move the half-grid Jacobian arithmetic into jacobian_kernel.h
(ComputeHalfGridJacobian), allocation-free over flat buffers. Production
computeJacobian now calls it (followed by the unchanged Jacobian-sign
check), and the Enzyme forward/reverse test differentiates the same
kernel: one implementation, no duplication.

Bit-exact: vmec_standalone MHD energy unchanged on solovev
(2.548352e+00) and cth_like_fixed_bdy (5.057191e-02). Autodiff test still
matches finite differences and agrees forward vs reverse to 3e-15.
Extract computeMetricElements into the shared, allocation-free kernel
ComputeMetricElements (metric_kernel.h), over flat buffers, and call it
from the solver. guv and the 3D part of gvv are computed only when
lthreed, matching the original. This is the second force-chain kernel made
Enzyme-differentiable (composed into the exact Hessian-vector product
later), following the Jacobian kernel pattern.

Bit-exact: vmec_standalone MHD energy unchanged on solovev (2.548352e+00,
2D) and cth_like_fixed_bdy (5.057191e-02, 3D path with guv/gvv).
Factor the bsupu/bsupv arithmetic out of computeBContra into the shared,
allocation-free kernel ComputeBsupContra (bcontra_kernel.h). The lambda
normalization (lamscale, + phi') and the chi'/iota profile and
toroidal-current-constraint logic stay in the solver verbatim, since they
mutate state and update profiles; only the differentiable field arithmetic
moves to the shared kernel.

Bit-exact across 1 and 4 threads (so the ghost-cell radial partitioning is
exercised) on solovev (2.548352e+00, 2D) and cth_like_fixed_bdy
(5.057191e-02, 3D).
Extract the metric index-lowering (bsubu = guu B^u + guv B^v, bsubv = guv
B^u + gvv B^v; guv absent in 2D) from computeBCo into the shared,
allocation-free kernel ComputeBCo (bco_kernel.h).

Bit-exact across 1 and 4 threads on solovev (2.548352e+00) and
cth_like_fixed_bdy (5.057191e-02).
Extract the field-dependent magnetic pressure |B|^2/2 = 0.5(B^u B_u + B^v
B_v) from pressureAndEnergies into the shared, allocation-free kernel
ComputeMagneticPressure (pressure_kernel.h). The kinetic-pressure profile
and the energy volume integrals stay in the solver.

Bit-exact across 1 and 4 threads on solovev (2.548352e+00) and
cth_like_fixed_bdy (5.057191e-02). Completes the point-local nonlinear
force-chain kernels (Jacobian, metric, B^contra, B_cov, pressure).
Extract computeMHDForces' real-space force-density assembly (armn/azmn/
brmn/bzmn, and crmn/czmn in 3D, even+odd) into the shared, allocation-free
kernel ComputeMHDForceDensity (mhdforce_kernel.h). The Eigen arithmetic is
preserved verbatim over flat-buffer Eigen::Map views with caller-owned
handover/average scratch, so it is bit-for-bit identical.

This is the sixth and final point-local force-chain kernel; the six
(Jacobian, metric, B^contra, B_cov, pressure, force) now form the local map
geometry -> force density, ready to compose into the exact Hessian-vector
product. (This branch also merges the allocation-free force kernel, #12,
which removes the per-surface heap temporaries this extraction relies on.)

Bit-exact across 1 and 4 threads on solovev (2.548352e+00) and
cth_like_fixed_bdy (5.057191e-02).
Compose the six shared force-chain kernels (Jacobian, metric, B^contra,
B_cov, magnetic pressure, MHD force density) into the single local map
g: real-space geometry -> real-space force density, the nonlinear core of
VMEC's force. The full MHD force is T^T . g . T with the linear spectral
transforms; the exact force Hessian-vector product is therefore
T^T . J_g . T . v, and this provides J_g by autodiff.

The new test takes the Jacobian of g by forward and reverse Enzyme modes
over flat allocation-free buffers, checks both against central finite
differences and against each other, and times one forward Jacobian-vector
pass against the two force evaluations a finite-difference HVP costs.
Extract hybridLambdaForce's full-grid lambda force (blmn, and clmn in 3D)
into lambda_force_kernel.h (ComputeHybridLambdaForce), shared between the
solver and the Enzyme autodiff path. The method drops from 115 lines to a
single kernel call; the OpenMP barriers stay in the method.

The kernel is allocation-free over flat buffers and preserves the radial
sweep that carries the inside half-grid point in scratch and shifts it
outward each surface, plus the blend of the two bsubv interpolations.

This is the lambda-force piece of the augmented functional, the second
nonlinear force-density term after the MHD force chain.
Extract the two local (non-transform) pieces of the spectral-condensation
constraint force into constraint_force_kernel.h, shared between the solver
and the Enzyme autodiff path:

- ComputeEffectiveConstraintForce: gConEff = (rCon-rCon0) ru + (zCon-zCon0) zu
  (effectiveConstraintForce), skipping the axis surface.
- AddConstraintForces: add the bandpass-filtered gCon back into the MHD R/Z
  forces and write frcon/fzcon (the constraint part of assembleTotalForces).

The Fourier-space bandpass between them stays the shared free function
deAliasConstraintForce; the free-boundary rBSq contribution stays in
assembleTotalForces. Allocation-free over flat buffers.

This completes the local force-density terms of the augmented functional
(MHD + lambda + constraint), the nonlinear core of the exact Hessian.
Add the hybrid lambda force (lambda_force_kernel.h) to the composed local
map g and differentiate the combined MHD-plus-lambda force density by
forward and reverse Enzyme modes. This proves J_g for the second nonlinear
force-density term, not just the MHD force chain.

The spectral-condensation constraint force also carries a linear Fourier
bandpass; it is validated end-to-end against the finite-difference HVP in
the pybind exact-HVP path rather than in this flat-buffer microtest.
Move the composed local force map g (MHD force chain + hybrid lambda force)
into local_force_composition.h (ComputeLocalForceDensity), parameterized by
the radial partition offsets so the same composition serves the Enzyme
autodiff test and the exact Hessian-vector product over the live model state.

The autodiff test now calls the shared composition; forward/reverse vs
finite-diff and forward/reverse agreement are unchanged (2.55e-8, 4e-15).
Add exact_force_jvp.cc/.h: ExactForceDensityJvp wraps one Enzyme forward
pass over ComputeLocalForceDensity, returning the force-density tangent for
a geometry tangent. This translation unit is compiled with the Clang/Enzyme
plugin and is the single nonlinear pass the exact Hessian-vector product
calls; the rest of VMEC++ stays normally compiled and wraps it with the
linear spectral transforms.

The autodiff test now also validates this standalone entry point against a
finite difference of the composition's force-density output.
Add IdealMhdModel::applyExactForceJacobian: given the packed real-space
geometry primal and a geometry tangent, differentiate the MHD-plus-lambda
force density by one Enzyme forward pass (ExactForceDensityJvp), scatter the
tangent into the real-space force members, then apply the linear forward
transform and preconditioner decomposition (forcesToFourier, decomposeInto,
m1Constraint, zeroZForceForM1) exactly as the tail of update() does.

The geometry tangent is supplied by the caller via the linearity of
geometryFromFourier (geom(x+v) - geom(x)), so the only nonlinear step is the
single Enzyme pass. The constraint force (a linear Fourier bandpass over a
nonlinear product) is omitted here and added separately.

Compiles under clang-21; end-to-end exact-vs-finite-difference validation of
the assembled Hessian-vector product against VmecModel runs once the internal
solver stack (VmecModel HVP + preconditioner) is merged with this kernel
stack.
Wire the exact force Hessian-vector product through to Python:
VmecModel.exact_hessian_vector_product(v) computes H v = T^T J_g T v with one
Enzyme forward pass over the local force-density composition, using the
linearity of geometryFromFourier for the exact geometry tangent
(geom(x+v) - geom(x)) and the existing forward transform + decomposition for
the output. exact_force_jvp.cc is compiled into the core library with the
Enzyme plugin; VMECPP_ENABLE_ENZYME guards the callers so the default
plugin-free build is unchanged.

Built into the extension with clang-21 + Enzyme. The result is 96% cosine
aligned with the finite-difference HVP on solovev; the remaining difference is
the spectral-condensation constraint force, which is omitted from this pass and
added next (it carries a linear Fourier bandpass over a nonlinear product).
Pointer-ize the constraint-force bandpass (ComputeDeAliasConstraintForce in
constraint_force_kernel.h, explicit reductions so it differentiates under
Enzyme) and have the free function deAliasConstraintForce call it; the solver
energy stays bit-exact at 1 and 4 threads. Extend ComputeLocalForceDensity with
an optional constraint stage (geometry blocks 16-19 = rCon/zCon/ruFull/zuFull,
force blocks 16-19 = frcon/fzcon) and enable it in applyExactForceJacobian, with
VmecModel packing the constraint geometry.

Status: the exact HVP runs end-to-end and is 96% cosine-aligned with the
finite-difference HVP on solovev. Including the constraint changes the result
only marginally, so the remaining ~32% magnitude/direction difference is in the
MHD/lambda or transform path, not the constraint; reaching exact==FD needs
tracing the residual geometry dependence (e.g. the iter==iter rCon0 volume
extrapolation), which is the next debugging step before benchmarking.
… HVP

Add composedForceResidual (exposed as VmecModel.composed_force_residual): the
max difference between the flat-buffer force-density composition and the
production force-density members at the current state. It is exactly 0.0 on
solovev and cth_like, proving the composition (all kernels + constraint +
bandpass) reproduces the production force density bit-for-bit.

Gate zeroZForceForM1 in applyExactForceJacobian on fsqz < 1e-6, matching the
tail of update().

Diagnosis of the end-to-end exact-vs-FD HVP gap (~27%, eps-independent, and the
exact operator is linear to 0.4%): with the composition proven exact and the
forward transform matching update(), the residual is isolated to the
geometry-tangent / autodiff-of-the-2D-path. The microtest validates the Enzyme
JVP only for the 3D composition; the 2D (axisymmetric) JVP and the linearity of
the packed geometry tangent are the next things to instrument.
…==FD

Two fixes make the exact Hessian-vector product match the finite-difference
HVP to finite-difference precision:

1. Geometry tangent by small central difference. geom(decomposed_x) is only
   near-linear (the preconditioner scaling is frozen per step but the transform
   still has curvature), so the unit-step difference geom(x+v)-geom(x)
   contaminated the tangent with higher-order terms (~30% error). Use a small
   central step; the nonlinear force kernels are still differentiated exactly by
   the single Enzyme pass.

2. Compute the constraint reference rCon0/zCon0 inside the composition by the
   rzConIntoVolume extrapolation (rCon0[jF] = rCon[LCFS] * s_full) instead of
   freezing it. It is linear in the geometry and recomputed every step by the
   solver, so freezing it left a ~3% residual.

Validation (exact vs FD HVP, eps-independent): solovev 3.9e-6 (cos 1.000000),
cth_like 4.2e-3 (cos 0.999991). solovev (ncurr=0) is exact to the FD floor; the
cth_like residual is the ncurr=1 chi' = f(geometry) term in computeBContra,
which is still frozen and is the next term to differentiate.

Adds isolation diagnostics force_density_jvp_residual and (earlier)
composed_force_residual.
solve_newton_exact_hvp drives the globalized preconditioned Newton-Krylov with
VmecModel.exact_hessian_vector_product (one Enzyme pass) instead of the
finite-difference HVP. Requires an Enzyme-enabled build.
…pdate

Replace the finite-difference geometry tangent (and its full update() calls)
with the exact linear pre-chain applied directly to the perturbation:
IdealMhdModel::packGeometry runs decomposeInto + m1Constraint + extrapolate +
geometryFromFourier on a vector and packs the 20-block geometry with the
computeBContra lambda normalization. Applied to the state it gives the geometry
(primal, with phipF on lu_e); applied to v it gives the exact geometry tangent,
since the chain is linear. The only nonlinear step is the single Enzyme pass.

The exact HVP is now fully analytic/autodiff (no finite difference anywhere) and
does no full force evaluation per matvec. Accuracy is unchanged (exact vs FD HVP
3.9e-6 solovev, 4.2e-3 cth_like). Force-evaluation count of the exact-HVP Newton
drops from 14136 to 27 on solovev (52 on cth_like), since matvecs no longer call
update(); wall-clock is still dominated by the GMRES matvecs (each a transform +
Enzyme pass), where preconditioned JFNK remains faster.
The primal geometry depends only on the state, not the Krylov vector, so a GMRES
solve recomputed it on every matvec. Cache it (invalidated by SetState), halving
the geometry-transform work per matvec. Accuracy unchanged (exact vs FD 3.9e-6
solovev); wall-clock drops (cth_like internal 25.7->21.8s, adjoint 11.5->10.4s).
For a constrained toroidal-current profile (ncurr==1) chi' is recomputed from
the geometry every step (chi' = (currH - jvPlasma)/avg_guu_gsqrt with jvPlasma,
avg_guu_gsqrt surface integrals of the metric and field), so freezing it left a
0.4% residual on cth_like. Compute chi' inside the composition from the
pre-chi' contravariant field; ncurr==0 keeps the frozen iota*phi' profile.

Exact vs FD HVP now: solovev 3.9e-6, cth_like 6.6e-6 (was 4.2e-3) - exact to the
finite-difference floor on both cases.
benchmark_exact_hvp.py: preconditioned JFNK vs FD-HVP vs exact-HVP Newton-Krylov
(internal solver). benchmark_adjoint_gradient.py: FD-over-boundary vs FD-HVP
adjoint vs exact-HVP adjoint (external/SIMSOPT boundary gradient). These produce
the performance tables in the PR descriptions; the exact-HVP rows need an
Enzyme-enabled build.
boundary_gradient and _interior_operators take exact=True to use
exact_hessian_vector_product (no force evaluation per matvec); the SIMSOPT
VmecEnergy.gradient wrapper uses it by default. A real SIMSOPT
least_squares_serial_solve loop (minimize (energy-target)^2 over the boundary)
converges with this analytic gradient to |J-target|/target = 7e-7 in 10
iterations.
…ts JFNK

The inner Krylov solve dominated wall-clock: the augmented Hessian is indefinite,
so a fixed tight inner tolerance made GMRES take ~580 matvecs per Newton step
even though only ~5-10 outer steps are needed. Profiling: 8 Newton iters but 4655
matvecs, one exact HVP matvec is 0.04 ms, and only 21% of the time was in the HVP.

Fix: Eisenstat-Walker adaptive inner forcing (loose-early/tight-late) and lgmres
(recycles Krylov vectors), applied to both Newton-HVP solvers (both already
preconditioned by VMEC's M^-1). Final, fair comparison (all preconditioned,
ns=11):

  solovev:  precond JFNK 507 ev / 0.08 s ;  exact-HVP Newton  17 ev / 0.03 s
  cth_like: precond JFNK 1633 ev / 2.00 s;  exact-HVP Newton  26 ev / 1.32 s

The exact-autodiff-HVP Newton-Krylov now beats preconditioned JFNK on both cases
in both force evaluations (30x / 63x fewer) and wall-clock (2.7x / 1.5x faster).

Also add examples/simsopt_optimization_loop.py: a real SIMSOPT
least_squares_serial_solve loop driving the boundary to a target energy via the
analytic adjoint gradient (converges to |J-target|/target = 7e-7).
…t fallback

Apply pre-commit formatting across the touched files. Make the SIMSOPT
VmecEnergy.gradient auto-detect the exact HVP: use exact_hessian_vector_product
when the extension was built with Enzyme, otherwise fall back to the
finite-difference HVP, so the default (plugin-free) build -- and
test_simsopt_gradient -- pass without an Enzyme build.
…mhd_model

ideal_mhd_model.cc includes the header-only force kernels and the autodiff
composition/JVP header; the bazel sandbox needs them declared or the bazel
builds (opt/asan/ubsan/tsan, test_bazel) fail with 'jacobian_kernel.h: No such
file'. Add them via glob so each branch picks up whatever exists on it. The
Enzyme JVP .cc stays CMake-only (guarded by VMECPP_ENABLE_ENZYME).
ComputeQsHarmonics forward-transforms the half-grid fields SIMSOPT's
QuasisymmetryRatioResidual needs (gmnc, bmnc, bsub{u,v}mnc, bsup{u,v}mnc)
from the real-space buffers the force chain already produces (gsqrt, bsupu,
bsupv, bsubu, bsubv; |B| formed inline). Plain double* with explicit
reductions, mirroring the output_quantities.cc forward integral bit for bit
but without the Eigen RowMatrixXd heap temporaries Enzyme cannot
differentiate. Symmetric (lasym=false) case. Wiring + bit-exact validation
against output_quantities follow in this PR.
|B| from total_pressure (sqrt(2|total_pressure - presH|)), guarded Nyquist
cos-basis halving (cosmui at m==mnyq, cosnv at n==nnyq, only when nonzero),
matching the reference forward transform. New gtest runs solovev and checks
all six harmonics (gmnc, bmnc, bsub{u,v}mnc, bsup{u,v}mnc) against
output_quantities to 1e-7 of each quantity's scale. PASSED.
VmecModel.qs_harmonics() computes the half-grid field harmonics SIMSOPT's
QuasisymmetryRatioResidual needs (gmnc, bmnc, bsub{u,v}mnc, bsup{u,v}mnc) from
the current state via ComputeQsHarmonics, with no Eigen heap temporaries (the
output_quantities path uses RowMatrixXd). Validated against a full run: gmnc,
bmnc, bsupu, bsupv match to ~1e-7; bsubu/bsubv to ~1e-4 (live covariant-B is in
internal form; the internal->physical conversion output_quantities applies is a
follow-up before wiring the QS objective).
qs_harmonics() now also returns the half-grid profiles SIMSOPT's
QuasisymmetryRatioResidual reads (iotas, bvco, buco), plain radial-profile
reads (rp.iotaH/bvcoH/bucoH). Validated vs a full run: iotas exact, bvco/buco
to ~1e-7. With the six field harmonics, vmecpp now exports every quantity QS
needs through a flat-buffer, heap-free path.
@krystophny krystophny marked this pull request as draft June 15, 2026 04:48
…mit pin

Bring the QS-export branch up to the corrected CI baseline:
- tests.yaml: build VMEC2000 from the pinned source commit and cache the
  wheel (the prebuilt vmec-0.0.6 wheel fails to import under numpy 2),
  drop the unused FFTW/HDF5 dev packages.
- benchmarks.yaml: skip the result upload on fork PRs (read-only token).
- test_simsopt_compat.py: skip vmecpp-only INDATA fields.
- CMakeLists: pin abseil to the 20260107.1 commit hash, not the tag.
- clang-format the qs_harmonics pybind additions.
The new qs_harmonics_kernel.{h,cc} were not run through clang-format, so
the all-files pre-commit hook rejected them. Also correct the header's
top |B| formula to the total-pressure form the kernel actually uses.
@krystophny krystophny marked this pull request as ready for review June 15, 2026 10:56
The allocation-free rewrite placed tempR_seg/tempZ_seg in a block-scope
thread_local inside the (jF, m, zeta) inner loop, which emits a
__tls_get_addr call and an init-guard branch every iteration. Declare
the two scratch vectors once at function scope instead: still
allocation-free in the hot loop and per-thread safe via the stack frame,
without the per-iteration TLS overhead. Same arithmetic; cma and w7x
wout are bit-for-bit unchanged.
Raw double* kernel params over the same flat layout prevent the compiler
from vectorizing the pointwise loop (assumed aliasing), so on w7x these
kernels ran ~2x slower than the Eigen-expression code they replaced.
The buffers never overlap; mark them __restrict to restore SIMD. Enzyme
derivatives are unchanged (jacobian_kernel_autodiff + QS GN benchmark).
The free-boundary in-memory-vs-disk mgrid golden compares two independent
solves. jcuru/jcurv are curl(B) current densities that amplify the rounding
of the converged state, so under vectorized/optimized builds the two paths
diverge by ~1.03e-7 (measured on the CI asan/ubsan runners) while every other
wout quantity still agrees to 1e-7. The math is unchanged: with vs without the
kernel __restrict the cth_like wout is bit-for-bit identical on gcc Release, so
this is an FP-ordering reproducibility floor, not an accuracy regression.

Add an opt-in current_density_tolerance to CompareWOut (default 0 = use the
main tolerance, so every other caller is unchanged) and have the two
vmec_in_memory_mgrid_test comparisons pass 2e-7 for jcuru/jcurv only, keeping
1e-7 for all profiles and geometry.

(cherry picked from commit 27d36d2)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant