CUDA_ERROR_ILLEGAL_ADDRESS when TorchScript module returns CPU tensors with setOutputsForces(True)

# `CUDA_ERROR_ILLEGAL_ADDRESS` when TorchScript module returns CPU tensors with `setOutputsForces(True)`

## Summary

On the CUDA platform, `TorchForce` with `setOutputsForces(True)` crashes
the OpenMM context with `CUDA_ERROR_ILLEGAL_ADDRESS (700)` whenever the
TorchScript module's `forward()` returns a `forces` tensor that is not
resident on the active CUDA device. The kernel-side cause is
[`getTensorPointer()`][gtp] returning a CPU data pointer that
[`CudaCalcTorchForceKernel::addForces`][add] then passes into
`addForcesKernel` as if it were a device pointer.

Same call site, same kernel arrangement, both still present at `master`
HEAD as of this report.

[gtp]: https://github.com/openmm/openmm-torch/blob/master/platforms/cuda/src/CudaTorchKernels.cpp#L123-L132
[add]: https://github.com/openmm/openmm-torch/blob/master/platforms/cuda/src/CudaTorchKernels.cpp#L172-L188

## Environment

- openmm-torch: **1.5.1** (`cuda129py312he4cb518_5` from conda-forge)
- openmm: 8.5.1
- pytorch: 2.10.0 / libtorch: 2.10.0 (`cuda129_mkl_hd6d2a1f_303`)
- cuda-cudart: 12.9.79 (cuda-version 12.9)
- python: 3.12.13
- driver: NVIDIA 575.51.03
- GPU: Quadro GV100, compute capability 7.0
- OS: Rocky Linux 9 (kernel 5.14.0-611.54.1.el9_7.x86_64)

I also re-read `platforms/cuda/src/CudaTorchKernels.cpp` from `master`
(post-v1.5.1); `getTensorPointer` is unchanged, so the bug is still
live on `main`.

## Minimal reproducer

```python
"""Repro: CPU-resident forces from a setOutputsForces(True) module
crash the CUDA platform with CUDA_ERROR_ILLEGAL_ADDRESS."""
import numpy as np
import torch as pt
import openmm
from openmmtorch import TorchForce


class CPUOutputForce(pt.nn.Module):
    """Returns forces on CPU regardless of positions device."""
    def forward(self, positions):
        n = positions.shape[0]
        energy = pt.tensor([0.0])          # CPU
        forces = pt.zeros(n, 3)            # CPU
        forces[:, 0] = 5.0
        return energy, forces


module = pt.jit.script(CPUOutputForce())
module.save("/tmp/cpu_out.pt")

system = openmm.System()
system.addParticle(1.0)

tforce = TorchForce("/tmp/cpu_out.pt")
tforce.setOutputsForces(True)
system.addForce(tforce)

platform = openmm.Platform.getPlatformByName("CUDA")
ctx = openmm.Context(system, openmm.VerletIntegrator(0.001), platform)
ctx.setPositions(np.zeros((1, 3)))
print(f"platform={ctx.getPlatform().getName()}")

state = ctx.getState(getForces=True, getEnergy=True)   # crashes here
print(f"forces[0]={state.getForces(asNumpy=True)[0]}")
```

### Actual output

```
platform=CUDA
Traceback (most recent call last):
  File "/tmp/repro_naive.py", line 43, in <module>
    state = ctx.getState(getForces=True, getEnergy=True)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../openmm/openmm.py", line 6157, in getState
    state = _openmm.Context_getState(self, types, enforcePeriodicBox, groups_mask)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
openmm.OpenMMException: Failed to synchronize the CUDA context:
  CUDA_ERROR_ILLEGAL_ADDRESS (700) at .../platforms/cuda/src/CudaTorchKernels.cpp:184
```

The reported line in `CudaTorchKernels.cpp` is the **second**
`cuCtxSynchronize()` inside `addForces()` — the one that runs *after*
`cu.executeKernel(addForcesKernel, ...)`. It is therefore the kernel
launch that fails, but the failure surfaces at the next sync.

### Expected output

```
platform=CUDA
forces[0]=[1.19502868 0. 0.]   # 5 kJ/mol/nm in +x, expressed in OpenMM defaults
```

…which is what the control reproducer below produces.

## Control: same setup, returned tensors moved to `positions.device`

```python
"""Control: identical setup but tensors are allocated on positions.device.
Runs cleanly, integrates correctly, no crash."""
import numpy as np
import torch as pt
import openmm
from openmm.unit import kilocalorie_per_mole, nanometer
from openmmtorch import TorchForce


class DeviceAwareForce(pt.nn.Module):
    def forward(self, positions):
        n = positions.shape[0]
        energy = pt.zeros(1, device=positions.device, dtype=positions.dtype)
        forces = pt.zeros(n, 3, device=positions.device, dtype=positions.dtype)
        forces[:, 0] = 5.0
        return energy, forces


module = pt.jit.script(DeviceAwareForce())
module.save("/tmp/dev_aware.pt")

system = openmm.System()
system.addParticle(1.0)

tforce = TorchForce("/tmp/dev_aware.pt")
tforce.setOutputsForces(True)
system.addForce(tforce)

platform = openmm.Platform.getPlatformByName("CUDA")
ctx = openmm.Context(system, openmm.VerletIntegrator(0.001), platform)
ctx.setPositions(np.zeros((1, 3)))

state = ctx.getState(getForces=True, getEnergy=True)
f = state.getForces(asNumpy=True).value_in_unit(kilocalorie_per_mole / nanometer)
print(f"forces[0]={f[0]}")

ctx.getIntegrator().step(25)
pos = ctx.getState(getPositions=True).getPositions(asNumpy=True).value_in_unit(nanometer)
print(f"after 25 steps pos[0]={pos[0]}")
```

Output:

```
platform=CUDA
forces[0]=[1.19502868 0. 0.]
after 25 steps pos[0]=[0.001625 0. 0.]
```

So the difference between crashing and working is **exactly one
attribute** of the tensors `forward()` returns: their `.device`.

## Root cause

[`CudaTorchKernels.cpp:123-132` (`getTensorPointer`)][gtp]:

```cpp
static void* getTensorPointer(OpenMM::CudaContext& cu, torch::Tensor& tensor) {
    void* data;
    if (cu.getUseDoublePrecision()) {
        data = tensor.to(torch::kFloat64).data_ptr<double>();
    } else {
        data = tensor.to(torch::kFloat32).data_ptr<float>();
    }
    return data;
}
```

`Tensor::to(ScalarType)` changes **dtype only** — it does not move the
tensor to a CUDA device. If the input `tensor` lives on CPU,
`data_ptr<float>()` returns a host address. That address is then
threaded through to `addForcesKernel` as `forceData`:

```cpp
void CudaCalcTorchForceKernel::addForces(torch::Tensor& forceTensor) {
    int numParticles = cu.getNumAtoms();
    void* forceData = getTensorPointer(cu, forceTensor);       // ← host ptr if forceTensor is on CPU
    CHECK_RESULT(cuCtxSynchronize(), "...");
    {
        ContextSelector selector(cu);
        int paddedNumAtoms = cu.getPaddedNumAtoms();
        int forceSign = (outputsForces ? 1 : -1);
        void* forceArgs[] = {&forceData, &cu.getForce().getDevicePointer(),
                             &cu.getAtomIndexArray().getDevicePointer(),
                             &numParticles, &paddedNumAtoms, &forceSign};
        cu.executeKernel(addForcesKernel, forceArgs, numParticles);  // ← reads host addr as device ptr
        CHECK_RESULT(cuCtxSynchronize(), "...");                     // ← reports the illegal access
    }
}
```

The accompanying [`addForces` CUDA kernel][k] reads `forceData[…]` as a
device array, hitting an unmapped page. CUDA latches the error on the
stream, and the next `cuCtxSynchronize` surfaces it.

[k]: https://github.com/openmm/openmm-torch/blob/master/platforms/cuda/src/CudaTorchKernel.cu

The same shape of mistake almost certainly affects `posData` and
`boxData` upstream in `prepareTorchInputs`, but those tensors come from
OpenMM internally so they're always already on-device — the bug just
doesn't manifest there.

## Suggested fix

`getTensorPointer` should pin both device and dtype with a single
`.to()` that specifies a `TensorOptions` (or an explicit
`torch::Device`):

```cpp
static void* getTensorPointer(OpenMM::CudaContext& cu, torch::Tensor& tensor) {
    const torch::Device device(torch::kCUDA, cu.getDeviceIndex());
    if (cu.getUseDoublePrecision()) {
        return tensor.to(device, torch::kFloat64).data_ptr<double>();
    } else {
        return tensor.to(device, torch::kFloat32).data_ptr<float>();
    }
}
```

When the user's tensor is already on the correct device with the
correct dtype, `.to()` is a no-op (returns the same storage). When it
isn't, PyTorch performs the copy and the returned pointer is
guaranteed to be a device pointer.

If you'd rather not silently copy, an explicit precondition would also
be defensible:

```cpp
TORCH_CHECK(tensor.device().is_cuda() && tensor.device().index() == cu.getDeviceIndex(),
            "TorchForce: tensor returned to OpenMM must live on the active CUDA device (got ",
            tensor.device(), ", expected cuda:", cu.getDeviceIndex(), ").");
```

— users with `setOutputsForces(True)` would then get a clear error
message instead of a CUDA-context abort.

A combined approach (transparent `.to()` on outputs, explicit
`TORCH_CHECK` on inputs) is what I'd recommend, but either fix
individually resolves the reported crash. Happy to send a PR.

## Workaround for users hitting this today

Make your `forward()` return tensors on `positions.device`:

```python
def forward(self, positions):
    n = positions.shape[0]
    energy = pt.zeros(1, device=positions.device, dtype=positions.dtype)
    forces = pt.zeros(n, 3, device=positions.device, dtype=positions.dtype)
    ...
    return energy, forces
```

This is also good practice for `setOutputsForces(False)` modules,
since otherwise PyTorch silently copies host↔device per call.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA_ERROR_ILLEGAL_ADDRESS when TorchScript module returns CPU tensors with setOutputsForces(True) #177

`CUDA_ERROR_ILLEGAL_ADDRESS` when TorchScript module returns CPU tensors with `setOutputsForces(True)`

Summary

Environment

Minimal reproducer

Actual output

Expected output

Control: same setup, returned tensors moved to `positions.device`

Root cause

Suggested fix

Workaround for users hitting this today

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

CUDA_ERROR_ILLEGAL_ADDRESS when TorchScript module returns CPU tensors with setOutputsForces(True) #177

Description

CUDA_ERROR_ILLEGAL_ADDRESS when TorchScript module returns CPU tensors with setOutputsForces(True)

Summary

Environment

Minimal reproducer

Actual output

Expected output

Control: same setup, returned tensors moved to positions.device

Root cause

Suggested fix

Workaround for users hitting this today

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`CUDA_ERROR_ILLEGAL_ADDRESS` when TorchScript module returns CPU tensors with `setOutputsForces(True)`

Control: same setup, returned tensors moved to `positions.device`