Skip to content

TMA trace generates massively duplicated addresses — missing coordinate operand capture causes all warps/CTAs to resolve identical sector addresses #164

@carroto

Description

@carroto

The trace_tma tool produces trace files pathologically inflated with duplicate DRAM sector addresses.
For example, I test it with LLM workload, iust one layer. dram.trace grows to 77 million lines, of which only ~2.4M (3.1%) are unique. By contrast, the general memory tracer produces ~2,000 unique lines for a 6-layer workload.
The dominant duplication pattern is 16× per sector address — matching the number of active warps/CTAs executing the TMA instruction.

The current instrumentation path captures only the descriptor:

  • nvbit_add_call_arg_tma_param_handle_and_size (host) — pushes uint8_t* handle + uint32_t size. No coordinate is captured.

  • Device side (inject_funcs.cu) — copies raw descriptor bytes into the channel record.

  • nvbit_parse_tma_dst_addrs / nvbit_parse_tma_src_addrs (host parsing) — accepts only opcode, tma_param_handle, tma_param_size. There is no parameter for runtime coordinates.

Because the coordinate register is never captured, the address resolver expands the descriptor's box geometry from default coordinates (0, 0, 0) for every access. Every warp in every CTA across every iteration that shares a descriptor produces identical addresses.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions