The trace_tma tool produces trace files pathologically inflated with duplicate DRAM sector addresses.
For example, I test it with LLM workload, iust one layer. dram.trace grows to 77 million lines, of which only ~2.4M (3.1%) are unique. By contrast, the general memory tracer produces ~2,000 unique lines for a 6-layer workload.
The dominant duplication pattern is 16× per sector address — matching the number of active warps/CTAs executing the TMA instruction.
The current instrumentation path captures only the descriptor:
-
nvbit_add_call_arg_tma_param_handle_and_size (host) — pushes uint8_t* handle + uint32_t size. No coordinate is captured.
-
Device side (inject_funcs.cu) — copies raw descriptor bytes into the channel record.
-
nvbit_parse_tma_dst_addrs / nvbit_parse_tma_src_addrs (host parsing) — accepts only opcode, tma_param_handle, tma_param_size. There is no parameter for runtime coordinates.
Because the coordinate register is never captured, the address resolver expands the descriptor's box geometry from default coordinates (0, 0, 0) for every access. Every warp in every CTA across every iteration that shares a descriptor produces identical addresses.
The trace_tma tool produces trace files pathologically inflated with duplicate DRAM sector addresses.
For example, I test it with LLM workload, iust one layer. dram.trace grows to 77 million lines, of which only ~2.4M (3.1%) are unique. By contrast, the general memory tracer produces ~2,000 unique lines for a 6-layer workload.
The dominant duplication pattern is 16× per sector address — matching the number of active warps/CTAs executing the TMA instruction.
The current instrumentation path captures only the descriptor:
nvbit_add_call_arg_tma_param_handle_and_size (host) — pushes uint8_t* handle + uint32_t size. No coordinate is captured.
Device side (inject_funcs.cu) — copies raw descriptor bytes into the channel record.
nvbit_parse_tma_dst_addrs / nvbit_parse_tma_src_addrs (host parsing) — accepts only opcode, tma_param_handle, tma_param_size. There is no parameter for runtime coordinates.
Because the coordinate register is never captured, the address resolver expands the descriptor's box geometry from default coordinates (0, 0, 0) for every access. Every warp in every CTA across every iteration that shares a descriptor produces identical addresses.