GPU delegate executes aten.floor/trunc/ceil as identity; round uses away-from-zero ties; div(x,1,floor) folds to identity

## Summary

On the GPU delegate, `aten.floor`, `aten.trunc` and `aten.ceil` execute as the **identity function** (input returned unchanged), and `aten.round` rounds ties away-from-zero instead of to-even. CPU executes all four correctly from the same `.aimodel`. Additionally `torch.div(x, 1.0, rounding_mode="floor")` is simplified to identity at conversion time (divisor-1 fold drops the rounding semantics), which removes the most natural in-graph workaround.

## Environment

- coreai-torch 0.4.0, coreai-core 1.0.0b1 (cp312), torch 2.11.0
- macOS 27.0 (build 26A5353q), M4 Max

## Minimal repro

```python
import asyncio, shutil
from pathlib import Path
import torch
import coreai.runtime as rt
from coreai_torch import TorchConverter, get_decomp_table

class M(torch.nn.Module):
    def forward(self, x):
        return (x.floor(), x.trunc(), x.round(), x.ceil(),
                torch.div(x, 1.0, rounding_mode="floor"),
                torch.div(x * 2.0, 2.0, rounding_mode="floor"))  # the one that works

x = torch.tensor([0.3, 1.7, -0.4, -1.6, 2.0, -2.0, 0.5, -0.5])
names = ["floor", "trunc", "round", "ceil", "floordiv1", "floordiv2x"]
ep = torch.export.export(M().eval(), (x,)).run_decompositions(get_decomp_table())
prog = TorchConverter().add_exported_program(exported_program=ep, input_names=["x"], output_names=names).to_coreai()
prog.optimize()
out = Path("/tmp/floor_probe.aimodel"); shutil.rmtree(out, ignore_errors=True)
prog.save_asset(out, rt.AIModelAssetMetadata())

async def run(unit):
    opts = rt.SpecializationOptions.cpu_only() if unit == "cpu" else \
        rt.SpecializationOptions.from_preferred_compute_unit_kind(rt.ComputeUnitKind.gpu())
    m = await rt.AIModel.load(out, opts)
    res = await m.load_function("main")({"x": rt.NDArray(x.numpy())})
    print(unit, {n: res[n].numpy().tolist() for n in names})

asyncio.run(run("cpu"))
asyncio.run(run("gpu"))
```

Measured (gpu unit):

| op | got | expected |
|---|---|---|
| floor | `[0.3, 1.7, -0.4, -1.6, 2, -2, 0.5, -0.5]` (identity) | `[0, 1, -1, -2, 2, -2, 0, -1]` |
| trunc | identity | `[0, 1, -0, -1, 2, -2, 0, -0]` |
| ceil | identity | `[1, 2, -0, -1, 2, -2, 1, -0]` |
| round | `[0, 2, 0, -2, 2, -2, 1, -1]` (ties away) | `[0, 2, -0, -2, 2, -2, 0, -0]` (ties to even) |
| div(x, 1, floor) | identity (folded at conversion) | floor |
| **div(2x, 2, floor)** | **correct** | floor |

CPU: all correct.

## Impact

Any model computing integer cell coordinates on the GPU (bilinear/grid sampling, quantization, positional bucketing) silently produces wrong results. Hit while porting RF-DETR — the deformable-attention sampling floor turned the whole decoder into noise on GPU while CPU was bit-clean.

## Workaround

`torch.div(x * 2.0, 2.0, rounding_mode="floor")` — floor-div with divisor ≠ 1 lowers correctly, and the ×2/2 power-of-two scale is exact in floating point.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU delegate executes aten.floor/trunc/ceil as identity; round uses away-from-zero ties; div(x,1,floor) folds to identity #10

Summary

Environment

Minimal repro

Impact

Workaround

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

op	got	expected
floor	`[0.3, 1.7, -0.4, -1.6, 2, -2, 0.5, -0.5]` (identity)	`[0, 1, -1, -2, 2, -2, 0, -1]`
trunc	identity	`[0, 1, -0, -1, 2, -2, 0, -0]`
ceil	identity	`[1, 2, -0, -1, 2, -2, 1, -0]`
round	`[0, 2, 0, -2, 2, -2, 1, -1]` (ties away)	`[0, 2, -0, -2, 2, -2, 0, -0]` (ties to even)
div(x, 1, floor)	identity (folded at conversion)	floor
div(2x, 2, floor)	correct	floor

GPU delegate executes aten.floor/trunc/ceil as identity; round uses away-from-zero ties; div(x,1,floor) folds to identity #10

Description

Summary

Environment

Minimal repro

Impact

Workaround

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions