Skip to content

GPU delegate executes aten.floor/trunc/ceil as identity; round uses away-from-zero ties; div(x,1,floor) folds to identity #10

@john-rocky

Description

@john-rocky

Summary

On the GPU delegate, aten.floor, aten.trunc and aten.ceil execute as the identity function (input returned unchanged), and aten.round rounds ties away-from-zero instead of to-even. CPU executes all four correctly from the same .aimodel. Additionally torch.div(x, 1.0, rounding_mode="floor") is simplified to identity at conversion time (divisor-1 fold drops the rounding semantics), which removes the most natural in-graph workaround.

Environment

  • coreai-torch 0.4.0, coreai-core 1.0.0b1 (cp312), torch 2.11.0
  • macOS 27.0 (build 26A5353q), M4 Max

Minimal repro

import asyncio, shutil
from pathlib import Path
import torch
import coreai.runtime as rt
from coreai_torch import TorchConverter, get_decomp_table

class M(torch.nn.Module):
    def forward(self, x):
        return (x.floor(), x.trunc(), x.round(), x.ceil(),
                torch.div(x, 1.0, rounding_mode="floor"),
                torch.div(x * 2.0, 2.0, rounding_mode="floor"))  # the one that works

x = torch.tensor([0.3, 1.7, -0.4, -1.6, 2.0, -2.0, 0.5, -0.5])
names = ["floor", "trunc", "round", "ceil", "floordiv1", "floordiv2x"]
ep = torch.export.export(M().eval(), (x,)).run_decompositions(get_decomp_table())
prog = TorchConverter().add_exported_program(exported_program=ep, input_names=["x"], output_names=names).to_coreai()
prog.optimize()
out = Path("/tmp/floor_probe.aimodel"); shutil.rmtree(out, ignore_errors=True)
prog.save_asset(out, rt.AIModelAssetMetadata())

async def run(unit):
    opts = rt.SpecializationOptions.cpu_only() if unit == "cpu" else \
        rt.SpecializationOptions.from_preferred_compute_unit_kind(rt.ComputeUnitKind.gpu())
    m = await rt.AIModel.load(out, opts)
    res = await m.load_function("main")({"x": rt.NDArray(x.numpy())})
    print(unit, {n: res[n].numpy().tolist() for n in names})

asyncio.run(run("cpu"))
asyncio.run(run("gpu"))

Measured (gpu unit):

op got expected
floor [0.3, 1.7, -0.4, -1.6, 2, -2, 0.5, -0.5] (identity) [0, 1, -1, -2, 2, -2, 0, -1]
trunc identity [0, 1, -0, -1, 2, -2, 0, -0]
ceil identity [1, 2, -0, -1, 2, -2, 1, -0]
round [0, 2, 0, -2, 2, -2, 1, -1] (ties away) [0, 2, -0, -2, 2, -2, 0, -0] (ties to even)
div(x, 1, floor) identity (folded at conversion) floor
div(2x, 2, floor) correct floor

CPU: all correct.

Impact

Any model computing integer cell coordinates on the GPU (bilinear/grid sampling, quantization, positional bucketing) silently produces wrong results. Hit while porting RF-DETR — the deformable-attention sampling floor turned the whole decoder into noise on GPU while CPU was bit-clean.

Workaround

torch.div(x * 2.0, 2.0, rounding_mode="floor") — floor-div with divisor ≠ 1 lowers correctly, and the ×2/2 power-of-two scale is exact in floating point.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions