CUDA_ERROR_ILLEGAL_ADDRESS when TorchScript module returns CPU tensors with setOutputsForces(True)
Summary
On the CUDA platform, TorchForce with setOutputsForces(True) crashes
the OpenMM context with CUDA_ERROR_ILLEGAL_ADDRESS (700) whenever the
TorchScript module's forward() returns a forces tensor that is not
resident on the active CUDA device. The kernel-side cause is
getTensorPointer() returning a CPU data pointer that
CudaCalcTorchForceKernel::addForces then passes into
addForcesKernel as if it were a device pointer.
Same call site, same kernel arrangement, both still present at master
HEAD as of this report.
Environment
- openmm-torch: 1.5.1 (
cuda129py312he4cb518_5 from conda-forge)
- openmm: 8.5.1
- pytorch: 2.10.0 / libtorch: 2.10.0 (
cuda129_mkl_hd6d2a1f_303)
- cuda-cudart: 12.9.79 (cuda-version 12.9)
- python: 3.12.13
- driver: NVIDIA 575.51.03
- GPU: Quadro GV100, compute capability 7.0
- OS: Rocky Linux 9 (kernel 5.14.0-611.54.1.el9_7.x86_64)
I also re-read platforms/cuda/src/CudaTorchKernels.cpp from master
(post-v1.5.1); getTensorPointer is unchanged, so the bug is still
live on main.
Minimal reproducer
"""Repro: CPU-resident forces from a setOutputsForces(True) module
crash the CUDA platform with CUDA_ERROR_ILLEGAL_ADDRESS."""
import numpy as np
import torch as pt
import openmm
from openmmtorch import TorchForce
class CPUOutputForce(pt.nn.Module):
"""Returns forces on CPU regardless of positions device."""
def forward(self, positions):
n = positions.shape[0]
energy = pt.tensor([0.0]) # CPU
forces = pt.zeros(n, 3) # CPU
forces[:, 0] = 5.0
return energy, forces
module = pt.jit.script(CPUOutputForce())
module.save("/tmp/cpu_out.pt")
system = openmm.System()
system.addParticle(1.0)
tforce = TorchForce("/tmp/cpu_out.pt")
tforce.setOutputsForces(True)
system.addForce(tforce)
platform = openmm.Platform.getPlatformByName("CUDA")
ctx = openmm.Context(system, openmm.VerletIntegrator(0.001), platform)
ctx.setPositions(np.zeros((1, 3)))
print(f"platform={ctx.getPlatform().getName()}")
state = ctx.getState(getForces=True, getEnergy=True) # crashes here
print(f"forces[0]={state.getForces(asNumpy=True)[0]}")
Actual output
platform=CUDA
Traceback (most recent call last):
File "/tmp/repro_naive.py", line 43, in <module>
state = ctx.getState(getForces=True, getEnergy=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".../openmm/openmm.py", line 6157, in getState
state = _openmm.Context_getState(self, types, enforcePeriodicBox, groups_mask)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
openmm.OpenMMException: Failed to synchronize the CUDA context:
CUDA_ERROR_ILLEGAL_ADDRESS (700) at .../platforms/cuda/src/CudaTorchKernels.cpp:184
The reported line in CudaTorchKernels.cpp is the second
cuCtxSynchronize() inside addForces() — the one that runs after
cu.executeKernel(addForcesKernel, ...). It is therefore the kernel
launch that fails, but the failure surfaces at the next sync.
Expected output
platform=CUDA
forces[0]=[1.19502868 0. 0.] # 5 kJ/mol/nm in +x, expressed in OpenMM defaults
…which is what the control reproducer below produces.
Control: same setup, returned tensors moved to positions.device
"""Control: identical setup but tensors are allocated on positions.device.
Runs cleanly, integrates correctly, no crash."""
import numpy as np
import torch as pt
import openmm
from openmm.unit import kilocalorie_per_mole, nanometer
from openmmtorch import TorchForce
class DeviceAwareForce(pt.nn.Module):
def forward(self, positions):
n = positions.shape[0]
energy = pt.zeros(1, device=positions.device, dtype=positions.dtype)
forces = pt.zeros(n, 3, device=positions.device, dtype=positions.dtype)
forces[:, 0] = 5.0
return energy, forces
module = pt.jit.script(DeviceAwareForce())
module.save("/tmp/dev_aware.pt")
system = openmm.System()
system.addParticle(1.0)
tforce = TorchForce("/tmp/dev_aware.pt")
tforce.setOutputsForces(True)
system.addForce(tforce)
platform = openmm.Platform.getPlatformByName("CUDA")
ctx = openmm.Context(system, openmm.VerletIntegrator(0.001), platform)
ctx.setPositions(np.zeros((1, 3)))
state = ctx.getState(getForces=True, getEnergy=True)
f = state.getForces(asNumpy=True).value_in_unit(kilocalorie_per_mole / nanometer)
print(f"forces[0]={f[0]}")
ctx.getIntegrator().step(25)
pos = ctx.getState(getPositions=True).getPositions(asNumpy=True).value_in_unit(nanometer)
print(f"after 25 steps pos[0]={pos[0]}")
Output:
platform=CUDA
forces[0]=[1.19502868 0. 0.]
after 25 steps pos[0]=[0.001625 0. 0.]
So the difference between crashing and working is exactly one
attribute of the tensors forward() returns: their .device.
Root cause
CudaTorchKernels.cpp:123-132 (getTensorPointer):
static void* getTensorPointer(OpenMM::CudaContext& cu, torch::Tensor& tensor) {
void* data;
if (cu.getUseDoublePrecision()) {
data = tensor.to(torch::kFloat64).data_ptr<double>();
} else {
data = tensor.to(torch::kFloat32).data_ptr<float>();
}
return data;
}
Tensor::to(ScalarType) changes dtype only — it does not move the
tensor to a CUDA device. If the input tensor lives on CPU,
data_ptr<float>() returns a host address. That address is then
threaded through to addForcesKernel as forceData:
void CudaCalcTorchForceKernel::addForces(torch::Tensor& forceTensor) {
int numParticles = cu.getNumAtoms();
void* forceData = getTensorPointer(cu, forceTensor); // ← host ptr if forceTensor is on CPU
CHECK_RESULT(cuCtxSynchronize(), "...");
{
ContextSelector selector(cu);
int paddedNumAtoms = cu.getPaddedNumAtoms();
int forceSign = (outputsForces ? 1 : -1);
void* forceArgs[] = {&forceData, &cu.getForce().getDevicePointer(),
&cu.getAtomIndexArray().getDevicePointer(),
&numParticles, &paddedNumAtoms, &forceSign};
cu.executeKernel(addForcesKernel, forceArgs, numParticles); // ← reads host addr as device ptr
CHECK_RESULT(cuCtxSynchronize(), "..."); // ← reports the illegal access
}
}
The accompanying addForces CUDA kernel reads forceData[…] as a
device array, hitting an unmapped page. CUDA latches the error on the
stream, and the next cuCtxSynchronize surfaces it.
The same shape of mistake almost certainly affects posData and
boxData upstream in prepareTorchInputs, but those tensors come from
OpenMM internally so they're always already on-device — the bug just
doesn't manifest there.
Suggested fix
getTensorPointer should pin both device and dtype with a single
.to() that specifies a TensorOptions (or an explicit
torch::Device):
static void* getTensorPointer(OpenMM::CudaContext& cu, torch::Tensor& tensor) {
const torch::Device device(torch::kCUDA, cu.getDeviceIndex());
if (cu.getUseDoublePrecision()) {
return tensor.to(device, torch::kFloat64).data_ptr<double>();
} else {
return tensor.to(device, torch::kFloat32).data_ptr<float>();
}
}
When the user's tensor is already on the correct device with the
correct dtype, .to() is a no-op (returns the same storage). When it
isn't, PyTorch performs the copy and the returned pointer is
guaranteed to be a device pointer.
If you'd rather not silently copy, an explicit precondition would also
be defensible:
TORCH_CHECK(tensor.device().is_cuda() && tensor.device().index() == cu.getDeviceIndex(),
"TorchForce: tensor returned to OpenMM must live on the active CUDA device (got ",
tensor.device(), ", expected cuda:", cu.getDeviceIndex(), ").");
— users with setOutputsForces(True) would then get a clear error
message instead of a CUDA-context abort.
A combined approach (transparent .to() on outputs, explicit
TORCH_CHECK on inputs) is what I'd recommend, but either fix
individually resolves the reported crash. Happy to send a PR.
Workaround for users hitting this today
Make your forward() return tensors on positions.device:
def forward(self, positions):
n = positions.shape[0]
energy = pt.zeros(1, device=positions.device, dtype=positions.dtype)
forces = pt.zeros(n, 3, device=positions.device, dtype=positions.dtype)
...
return energy, forces
This is also good practice for setOutputsForces(False) modules,
since otherwise PyTorch silently copies host↔device per call.
CUDA_ERROR_ILLEGAL_ADDRESSwhen TorchScript module returns CPU tensors withsetOutputsForces(True)Summary
On the CUDA platform,
TorchForcewithsetOutputsForces(True)crashesthe OpenMM context with
CUDA_ERROR_ILLEGAL_ADDRESS (700)whenever theTorchScript module's
forward()returns aforcestensor that is notresident on the active CUDA device. The kernel-side cause is
getTensorPointer()returning a CPU data pointer thatCudaCalcTorchForceKernel::addForcesthen passes intoaddForcesKernelas if it were a device pointer.Same call site, same kernel arrangement, both still present at
masterHEAD as of this report.
Environment
cuda129py312he4cb518_5from conda-forge)cuda129_mkl_hd6d2a1f_303)I also re-read
platforms/cuda/src/CudaTorchKernels.cppfrommaster(post-v1.5.1);
getTensorPointeris unchanged, so the bug is stilllive on
main.Minimal reproducer
Actual output
The reported line in
CudaTorchKernels.cppis the secondcuCtxSynchronize()insideaddForces()— the one that runs aftercu.executeKernel(addForcesKernel, ...). It is therefore the kernellaunch that fails, but the failure surfaces at the next sync.
Expected output
…which is what the control reproducer below produces.
Control: same setup, returned tensors moved to
positions.deviceOutput:
So the difference between crashing and working is exactly one
attribute of the tensors
forward()returns: their.device.Root cause
CudaTorchKernels.cpp:123-132(getTensorPointer):Tensor::to(ScalarType)changes dtype only — it does not move thetensor to a CUDA device. If the input
tensorlives on CPU,data_ptr<float>()returns a host address. That address is thenthreaded through to
addForcesKernelasforceData:The accompanying
addForcesCUDA kernel readsforceData[…]as adevice array, hitting an unmapped page. CUDA latches the error on the
stream, and the next
cuCtxSynchronizesurfaces it.The same shape of mistake almost certainly affects
posDataandboxDataupstream inprepareTorchInputs, but those tensors come fromOpenMM internally so they're always already on-device — the bug just
doesn't manifest there.
Suggested fix
getTensorPointershould pin both device and dtype with a single.to()that specifies aTensorOptions(or an explicittorch::Device):When the user's tensor is already on the correct device with the
correct dtype,
.to()is a no-op (returns the same storage). When itisn't, PyTorch performs the copy and the returned pointer is
guaranteed to be a device pointer.
If you'd rather not silently copy, an explicit precondition would also
be defensible:
— users with
setOutputsForces(True)would then get a clear errormessage instead of a CUDA-context abort.
A combined approach (transparent
.to()on outputs, explicitTORCH_CHECKon inputs) is what I'd recommend, but either fixindividually resolves the reported crash. Happy to send a PR.
Workaround for users hitting this today
Make your
forward()return tensors onpositions.device:This is also good practice for
setOutputsForces(False)modules,since otherwise PyTorch silently copies host↔device per call.