[VectorDistribute] Rework LDS operand promotion#24408
Conversation
There was a problem hiding this comment.
The changes in this file are mostly a code move to GPUNestedLayoutUtils to make computation of a derived_thread_config layout available as a shared helper.
| } | ||
|
|
||
| FailureOr<NestedLayoutAttr> | ||
| getDerivedThreadLayout(MLIRContext *context, ArrayRef<int64_t> workgroupSize, |
There was a problem hiding this comment.
This code was moved here from LLVMGPUConfigureTensorLayouts to make computation of a derived_thread_config layout available as a shared helper.
|
Failure of the gfx1100 pipeline test is expected, this PR needs to be rebased on top of #24402 once that lands. |
53bfe40 to
3753712
Compare
kuhar
left a comment
There was a problem hiding this comment.
Just a drive-by nit, I haven't had time to review the logic
3b0974d to
7d0aeb7
Compare
keshavvinayak01
left a comment
There was a problem hiding this comment.
Needs some changes.
| auto toLayout = IREE::VectorExt::ToLayoutOp::create(builder, op->getLoc(), | ||
| vector, *readLayout); | ||
|
|
||
| FailureOr<Value> copied = allocateTensorAndWriteVector( | ||
| builder, op->getLoc(), toLayout.getResult(), | ||
| allocationLayout.getUndistributedShape()); | ||
| if (failed(copied)) { | ||
| return failure(); | ||
| } | ||
|
|
||
| auto synced = | ||
| IREE::GPU::ValueBarrierOp::create(builder, op->getLoc(), *copied); | ||
| Value newRead = | ||
| readVectorFromTensor(builder, readType, synced.getResult(0)); |
There was a problem hiding this comment.
The expected IR is correct for the write-then-read roundtrip itself, but the new direct operand-promotion path does not insert the pre-write loop-iteration barrier that the existing shared_memory_conversion materialization path deliberately inserts. If promoted reads can appear in loop bodies and reuse the same workgroup allocation, this can race with reads from the previous iteration.
If we can prove these promoted read roundtrips never land in a loop, then I guess this is fine.
There was a problem hiding this comment.
I've added the barrier back (thanks for catching that) and updated the tests.
| funcOp.walk([](IREE::VectorExt::ToLayoutOp op) { | ||
| op.removeSharedMemoryConversionAttr(); | ||
| }); |
There was a problem hiding this comment.
A shared_memory_conversion = #iree_gpu.use_global_load_dma marker is propagated by the analysis, then erased, then ignored by the late materializer. The resulting IR may be valid, but it no longer performs the requested operand promotion.
Please do a lit-test to confirm this behaviour ?
There was a problem hiding this comment.
Added a test. I also have the follow-up adding support for use_global_load_dma already lined up locally, if you prefer, I can make it part of this PR.
There was a problem hiding this comment.
I think we should merge it here.
sommerlukas
left a comment
There was a problem hiding this comment.
Thanks for the feedback!
| funcOp.walk([](IREE::VectorExt::ToLayoutOp op) { | ||
| op.removeSharedMemoryConversionAttr(); | ||
| }); |
There was a problem hiding this comment.
Added a test. I also have the follow-up adding support for use_global_load_dma already lined up locally, if you prefer, I can make it part of this PR.
| auto toLayout = IREE::VectorExt::ToLayoutOp::create(builder, op->getLoc(), | ||
| vector, *readLayout); | ||
|
|
||
| FailureOr<Value> copied = allocateTensorAndWriteVector( | ||
| builder, op->getLoc(), toLayout.getResult(), | ||
| allocationLayout.getUndistributedShape()); | ||
| if (failed(copied)) { | ||
| return failure(); | ||
| } | ||
|
|
||
| auto synced = | ||
| IREE::GPU::ValueBarrierOp::create(builder, op->getLoc(), *copied); | ||
| Value newRead = | ||
| readVectorFromTensor(builder, readType, synced.getResult(0)); |
There was a problem hiding this comment.
I've added the barrier back (thanks for catching that) and updated the tests.
keshavvinayak01
left a comment
There was a problem hiding this comment.
Looks fine, please add the use_global_load_dma changes here.
Signed-off-by: Lukas Sommer <lukas.sommer@amd.com>
Signed-off-by: Lukas Sommer <lukas.sommer@amd.com>
Signed-off-by: Lukas Sommer <lukas.sommer@amd.com>
Signed-off-by: Lukas Sommer <lukas.sommer@amd.com>
Signed-off-by: Lukas Sommer <lukas.sommer@amd.com>
Signed-off-by: Lukas Sommer <lukas.sommer@amd.com>
Signed-off-by: Lukas Sommer <lukas.sommer@amd.com>
Signed-off-by: Lukas Sommer <lukas.sommer@amd.com>
1c43e95 to
06c7ae7
Compare
|
@keshavvinayak01 I've added the async DMA handling to the PR. |
keshavvinayak01
left a comment
There was a problem hiding this comment.
Let's split the async_dma extension work into a follow up PR.
Rework how promotion of operands to LDS works in the VectorDistribute pipeline.
So far,
linalg.copyoperations were inserted early in the pipeline. Now, we skip the insertion oflinalg.copyfor operands (we keep the behavior for results). Instead, the analysis from #24227 propagates the promotion types upwards from the compute operations that actually configure the promotion of operands up to the operations accessing the data in memory (transfer_read,gather) when we reachGPUVectorAlloc. Based on the propagated information, the necessary promotion to LDS can be inserted.Layout conflicts are still resolved through an LDS roundtrip at the conflict point.
Assisted-by: Codex