Skip to content

fill_holes no-op path leaves native CUDA cache allocated #33

@luca-888

Description

@luca-888

Background

While investigating GPU memory growth in a long-running Trellis2 FastAPI service, we found that the process GPU baseline increased while torch.cuda.memory_reserved() stayed stable. This suggests native CUDA allocations outside the PyTorch allocator.

The issue was isolated to CuMesh::fill_holes().

May be the reason of microsoft/TRELLIS.2#136

Problem

CuMesh::fill_holes() can retain internal connectivity/boundary cache on no-op / early-return paths.

Typical case:

  1. Call fill_holes() once on an open mesh. The hole is filled successfully.
  2. Call fill_holes() again on the same mesh. The mesh is already closed, so this should be a no-op.
  3. The second call still rebuilds internal cache such as edges and boundary adjacency.
  4. The function returns early without calling clear_cache().

As a result, the CuMesh object keeps extra native CUDA memory until clear_cache() is called manually or the object is destroyed.

Repro Script

Environment:

  • GPU: NVIDIA H20
  • CUDA: 12.8
  • Python: 3.12.3
  • Mesh: generated 768 x 768 grid mesh
import gc
import torch
from cumesh import CuMesh


def mib(x):
    return x / 1024 / 1024


def used_mib():
    torch.cuda.synchronize()
    free, total = torch.cuda.mem_get_info()
    return mib(total - free)


def cleanup():
    gc.collect()
    torch.cuda.empty_cache()
    torch.cuda.synchronize()


def make_grid(n=768):
    y, x = torch.meshgrid(
        torch.linspace(0, 1, n + 1, device="cuda"),
        torch.linspace(0, 1, n + 1, device="cuda"),
        indexing="ij",
    )
    vertices = torch.stack([x, y, torch.zeros_like(x)], -1).reshape(-1, 3).contiguous()

    cy, cx = torch.meshgrid(
        torch.arange(n, device="cuda", dtype=torch.int64),
        torch.arange(n, device="cuda", dtype=torch.int64),
        indexing="ij",
    )
    base = cy * (n + 1) + cx
    f1 = torch.stack([base, base + 1, base + n + 2], -1)
    f2 = torch.stack([base, base + n + 2, base + n + 1], -1)
    faces = torch.stack([f1, f2], 2).reshape(-1, 3).to(torch.int32).contiguous()
    return vertices, faces


v, f = make_grid()
cleanup()
base = used_mib()

mesh = CuMesh()
mesh.init(v, f)
cleanup()
print(f"after init:        delta={used_mib() - base:+.2f} MiB E={mesh.num_edges} B={mesh.num_boundaries}")

mesh.fill_holes(9999.0)
cleanup()
print(f"after first fill:  delta={used_mib() - base:+.2f} MiB E={mesh.num_edges} B={mesh.num_boundaries}")

mesh.fill_holes(9999.0)
cleanup()
print(f"after second fill: delta={used_mib() - base:+.2f} MiB E={mesh.num_edges} B={mesh.num_boundaries}")

mesh.clear_cache()
cleanup()
print(f"after clear_cache: delta={used_mib() - base:+.2f} MiB E={mesh.num_edges} B={mesh.num_boundaries}")

Actual Result Before Fix

after init:        delta=+22.00 MiB E=0 B=0
after first fill:  delta=+24.00 MiB E=0 B=0
after second fill: delta=+162.00 MiB E=1774080 B=0
after clear_cache: delta=+24.00 MiB E=0 B=0

The second fill_holes() call should be a no-op, but E=1774080 shows that internal edge/connectivity cache is retained. Calling clear_cache() drops the memory back, confirming that the retained memory is CuMesh cache.

Expected Result After Fix

after init:        delta=+22.00 MiB E=0 B=0
after first fill:  delta=+24.00 MiB E=0 B=0
after second fill: delta=+24.00 MiB E=0 B=0
after clear_cache: delta=+24.00 MiB E=0 B=0

The second no-op fill_holes() call should not leave connectivity/boundary cache behind.

Closes #32

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions