Background
While investigating GPU memory growth in a long-running Trellis2 FastAPI service, we found that the process GPU baseline increased while torch.cuda.memory_reserved() stayed stable. This suggests native CUDA allocations outside the PyTorch allocator.
The issue was isolated to CuMesh::fill_holes().
May be the reason of microsoft/TRELLIS.2#136
Problem
CuMesh::fill_holes() can retain internal connectivity/boundary cache on no-op / early-return paths.
Typical case:
- Call
fill_holes() once on an open mesh. The hole is filled successfully.
- Call
fill_holes() again on the same mesh. The mesh is already closed, so this should be a no-op.
- The second call still rebuilds internal cache such as
edges and boundary adjacency.
- The function returns early without calling
clear_cache().
As a result, the CuMesh object keeps extra native CUDA memory until clear_cache() is called manually or the object is destroyed.
Repro Script
Environment:
- GPU: NVIDIA H20
- CUDA: 12.8
- Python: 3.12.3
- Mesh: generated
768 x 768 grid mesh
import gc
import torch
from cumesh import CuMesh
def mib(x):
return x / 1024 / 1024
def used_mib():
torch.cuda.synchronize()
free, total = torch.cuda.mem_get_info()
return mib(total - free)
def cleanup():
gc.collect()
torch.cuda.empty_cache()
torch.cuda.synchronize()
def make_grid(n=768):
y, x = torch.meshgrid(
torch.linspace(0, 1, n + 1, device="cuda"),
torch.linspace(0, 1, n + 1, device="cuda"),
indexing="ij",
)
vertices = torch.stack([x, y, torch.zeros_like(x)], -1).reshape(-1, 3).contiguous()
cy, cx = torch.meshgrid(
torch.arange(n, device="cuda", dtype=torch.int64),
torch.arange(n, device="cuda", dtype=torch.int64),
indexing="ij",
)
base = cy * (n + 1) + cx
f1 = torch.stack([base, base + 1, base + n + 2], -1)
f2 = torch.stack([base, base + n + 2, base + n + 1], -1)
faces = torch.stack([f1, f2], 2).reshape(-1, 3).to(torch.int32).contiguous()
return vertices, faces
v, f = make_grid()
cleanup()
base = used_mib()
mesh = CuMesh()
mesh.init(v, f)
cleanup()
print(f"after init: delta={used_mib() - base:+.2f} MiB E={mesh.num_edges} B={mesh.num_boundaries}")
mesh.fill_holes(9999.0)
cleanup()
print(f"after first fill: delta={used_mib() - base:+.2f} MiB E={mesh.num_edges} B={mesh.num_boundaries}")
mesh.fill_holes(9999.0)
cleanup()
print(f"after second fill: delta={used_mib() - base:+.2f} MiB E={mesh.num_edges} B={mesh.num_boundaries}")
mesh.clear_cache()
cleanup()
print(f"after clear_cache: delta={used_mib() - base:+.2f} MiB E={mesh.num_edges} B={mesh.num_boundaries}")
Actual Result Before Fix
after init: delta=+22.00 MiB E=0 B=0
after first fill: delta=+24.00 MiB E=0 B=0
after second fill: delta=+162.00 MiB E=1774080 B=0
after clear_cache: delta=+24.00 MiB E=0 B=0
The second fill_holes() call should be a no-op, but E=1774080 shows that internal edge/connectivity cache is retained. Calling clear_cache() drops the memory back, confirming that the retained memory is CuMesh cache.
Expected Result After Fix
after init: delta=+22.00 MiB E=0 B=0
after first fill: delta=+24.00 MiB E=0 B=0
after second fill: delta=+24.00 MiB E=0 B=0
after clear_cache: delta=+24.00 MiB E=0 B=0
The second no-op fill_holes() call should not leave connectivity/boundary cache behind.
Closes #32
Background
While investigating GPU memory growth in a long-running Trellis2 FastAPI service, we found that the process GPU baseline increased while
torch.cuda.memory_reserved()stayed stable. This suggests native CUDA allocations outside the PyTorch allocator.The issue was isolated to
CuMesh::fill_holes().May be the reason of microsoft/TRELLIS.2#136
Problem
CuMesh::fill_holes()can retain internal connectivity/boundary cache on no-op / early-return paths.Typical case:
fill_holes()once on an open mesh. The hole is filled successfully.fill_holes()again on the same mesh. The mesh is already closed, so this should be a no-op.edgesand boundary adjacency.clear_cache().As a result, the
CuMeshobject keeps extra native CUDA memory untilclear_cache()is called manually or the object is destroyed.Repro Script
Environment:
768 x 768grid meshActual Result Before Fix
The second
fill_holes()call should be a no-op, butE=1774080shows that internal edge/connectivity cache is retained. Callingclear_cache()drops the memory back, confirming that the retained memory is CuMesh cache.Expected Result After Fix
The second no-op
fill_holes()call should not leave connectivity/boundary cache behind.Closes #32