Summary
On the GPU delegate, aten.floor, aten.trunc and aten.ceil execute as the identity function (input returned unchanged), and aten.round rounds ties away-from-zero instead of to-even. CPU executes all four correctly from the same .aimodel. Additionally torch.div(x, 1.0, rounding_mode="floor") is simplified to identity at conversion time (divisor-1 fold drops the rounding semantics), which removes the most natural in-graph workaround.
Environment
- coreai-torch 0.4.0, coreai-core 1.0.0b1 (cp312), torch 2.11.0
- macOS 27.0 (build 26A5353q), M4 Max
Minimal repro
import asyncio, shutil
from pathlib import Path
import torch
import coreai.runtime as rt
from coreai_torch import TorchConverter, get_decomp_table
class M(torch.nn.Module):
def forward(self, x):
return (x.floor(), x.trunc(), x.round(), x.ceil(),
torch.div(x, 1.0, rounding_mode="floor"),
torch.div(x * 2.0, 2.0, rounding_mode="floor")) # the one that works
x = torch.tensor([0.3, 1.7, -0.4, -1.6, 2.0, -2.0, 0.5, -0.5])
names = ["floor", "trunc", "round", "ceil", "floordiv1", "floordiv2x"]
ep = torch.export.export(M().eval(), (x,)).run_decompositions(get_decomp_table())
prog = TorchConverter().add_exported_program(exported_program=ep, input_names=["x"], output_names=names).to_coreai()
prog.optimize()
out = Path("/tmp/floor_probe.aimodel"); shutil.rmtree(out, ignore_errors=True)
prog.save_asset(out, rt.AIModelAssetMetadata())
async def run(unit):
opts = rt.SpecializationOptions.cpu_only() if unit == "cpu" else \
rt.SpecializationOptions.from_preferred_compute_unit_kind(rt.ComputeUnitKind.gpu())
m = await rt.AIModel.load(out, opts)
res = await m.load_function("main")({"x": rt.NDArray(x.numpy())})
print(unit, {n: res[n].numpy().tolist() for n in names})
asyncio.run(run("cpu"))
asyncio.run(run("gpu"))
Measured (gpu unit):
| op |
got |
expected |
| floor |
[0.3, 1.7, -0.4, -1.6, 2, -2, 0.5, -0.5] (identity) |
[0, 1, -1, -2, 2, -2, 0, -1] |
| trunc |
identity |
[0, 1, -0, -1, 2, -2, 0, -0] |
| ceil |
identity |
[1, 2, -0, -1, 2, -2, 1, -0] |
| round |
[0, 2, 0, -2, 2, -2, 1, -1] (ties away) |
[0, 2, -0, -2, 2, -2, 0, -0] (ties to even) |
| div(x, 1, floor) |
identity (folded at conversion) |
floor |
| div(2x, 2, floor) |
correct |
floor |
CPU: all correct.
Impact
Any model computing integer cell coordinates on the GPU (bilinear/grid sampling, quantization, positional bucketing) silently produces wrong results. Hit while porting RF-DETR — the deformable-attention sampling floor turned the whole decoder into noise on GPU while CPU was bit-clean.
Workaround
torch.div(x * 2.0, 2.0, rounding_mode="floor") — floor-div with divisor ≠ 1 lowers correctly, and the ×2/2 power-of-two scale is exact in floating point.
Summary
On the GPU delegate,
aten.floor,aten.truncandaten.ceilexecute as the identity function (input returned unchanged), andaten.roundrounds ties away-from-zero instead of to-even. CPU executes all four correctly from the same.aimodel. Additionallytorch.div(x, 1.0, rounding_mode="floor")is simplified to identity at conversion time (divisor-1 fold drops the rounding semantics), which removes the most natural in-graph workaround.Environment
Minimal repro
Measured (gpu unit):
[0.3, 1.7, -0.4, -1.6, 2, -2, 0.5, -0.5](identity)[0, 1, -1, -2, 2, -2, 0, -1][0, 1, -0, -1, 2, -2, 0, -0][1, 2, -0, -1, 2, -2, 1, -0][0, 2, 0, -2, 2, -2, 1, -1](ties away)[0, 2, -0, -2, 2, -2, 0, -0](ties to even)CPU: all correct.
Impact
Any model computing integer cell coordinates on the GPU (bilinear/grid sampling, quantization, positional bucketing) silently produces wrong results. Hit while porting RF-DETR — the deformable-attention sampling floor turned the whole decoder into noise on GPU while CPU was bit-clean.
Workaround
torch.div(x * 2.0, 2.0, rounding_mode="floor")— floor-div with divisor ≠ 1 lowers correctly, and the ×2/2 power-of-two scale is exact in floating point.