Skip to content

[VectorDistribute] Rework LDS operand promotion#24408

Open
sommerlukas wants to merge 8 commits into
iree-org:mainfrom
sommerlukas:new-lds-promotion
Open

[VectorDistribute] Rework LDS operand promotion#24408
sommerlukas wants to merge 8 commits into
iree-org:mainfrom
sommerlukas:new-lds-promotion

Conversation

@sommerlukas
Copy link
Copy Markdown
Contributor

Rework how promotion of operands to LDS works in the VectorDistribute pipeline.

So far, linalg.copy operations were inserted early in the pipeline. Now, we skip the insertion of linalg.copy for operands (we keep the behavior for results). Instead, the analysis from #24227 propagates the promotion types upwards from the compute operations that actually configure the promotion of operands up to the operations accessing the data in memory (transfer_read, gather) when we reach GPUVectorAlloc. Based on the propagated information, the necessary promotion to LDS can be inserted.

Layout conflicts are still resolved through an LDS roundtrip at the conflict point.

Assisted-by: Codex

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes in this file are mostly a code move to GPUNestedLayoutUtils to make computation of a derived_thread_config layout available as a shared helper.

}

FailureOr<NestedLayoutAttr>
getDerivedThreadLayout(MLIRContext *context, ArrayRef<int64_t> workgroupSize,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code was moved here from LLVMGPUConfigureTensorLayouts to make computation of a derived_thread_config layout available as a shared helper.

@sommerlukas
Copy link
Copy Markdown
Contributor Author

Failure of the gfx1100 pipeline test is expected, this PR needs to be rebased on top of #24402 once that lands.

Comment thread compiler/src/iree/compiler/Codegen/Common/GPU/GPUVectorAlloc.cpp Outdated
@sommerlukas sommerlukas requested a review from kuhar May 12, 2026 09:45
Copy link
Copy Markdown
Member

@kuhar kuhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a drive-by nit, I haven't had time to review the logic

Comment thread compiler/src/iree/compiler/Codegen/Common/GPU/GPUVectorAlloc.cpp Outdated
@sommerlukas sommerlukas requested a review from kuhar May 13, 2026 16:09
Copy link
Copy Markdown
Contributor

@keshavvinayak01 keshavvinayak01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs some changes.

Comment on lines +238 to +251
auto toLayout = IREE::VectorExt::ToLayoutOp::create(builder, op->getLoc(),
vector, *readLayout);

FailureOr<Value> copied = allocateTensorAndWriteVector(
builder, op->getLoc(), toLayout.getResult(),
allocationLayout.getUndistributedShape());
if (failed(copied)) {
return failure();
}

auto synced =
IREE::GPU::ValueBarrierOp::create(builder, op->getLoc(), *copied);
Value newRead =
readVectorFromTensor(builder, readType, synced.getResult(0));
Copy link
Copy Markdown
Contributor

@keshavvinayak01 keshavvinayak01 May 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expected IR is correct for the write-then-read roundtrip itself, but the new direct operand-promotion path does not insert the pre-write loop-iteration barrier that the existing shared_memory_conversion materialization path deliberately inserts. If promoted reads can appear in loop bodies and reuse the same workgroup allocation, this can race with reads from the previous iteration.

If we can prove these promoted read roundtrips never land in a loop, then I guess this is fine.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added the barrier back (thanks for catching that) and updated the tests.

Comment on lines +278 to +280
funcOp.walk([](IREE::VectorExt::ToLayoutOp op) {
op.removeSharedMemoryConversionAttr();
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A shared_memory_conversion = #iree_gpu.use_global_load_dma marker is propagated by the analysis, then erased, then ignored by the late materializer. The resulting IR may be valid, but it no longer performs the requested operand promotion.

Please do a lit-test to confirm this behaviour ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a test. I also have the follow-up adding support for use_global_load_dma already lined up locally, if you prefer, I can make it part of this PR.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should merge it here.

Copy link
Copy Markdown
Contributor Author

@sommerlukas sommerlukas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback!

Comment on lines +278 to +280
funcOp.walk([](IREE::VectorExt::ToLayoutOp op) {
op.removeSharedMemoryConversionAttr();
});
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a test. I also have the follow-up adding support for use_global_load_dma already lined up locally, if you prefer, I can make it part of this PR.

Comment on lines +238 to +251
auto toLayout = IREE::VectorExt::ToLayoutOp::create(builder, op->getLoc(),
vector, *readLayout);

FailureOr<Value> copied = allocateTensorAndWriteVector(
builder, op->getLoc(), toLayout.getResult(),
allocationLayout.getUndistributedShape());
if (failed(copied)) {
return failure();
}

auto synced =
IREE::GPU::ValueBarrierOp::create(builder, op->getLoc(), *copied);
Value newRead =
readVectorFromTensor(builder, readType, synced.getResult(0));
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added the barrier back (thanks for catching that) and updated the tests.

Copy link
Copy Markdown
Contributor

@keshavvinayak01 keshavvinayak01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine, please add the use_global_load_dma changes here.

Signed-off-by: Lukas Sommer <lukas.sommer@amd.com>
Signed-off-by: Lukas Sommer <lukas.sommer@amd.com>
Signed-off-by: Lukas Sommer <lukas.sommer@amd.com>
Signed-off-by: Lukas Sommer <lukas.sommer@amd.com>
Signed-off-by: Lukas Sommer <lukas.sommer@amd.com>
Signed-off-by: Lukas Sommer <lukas.sommer@amd.com>
Signed-off-by: Lukas Sommer <lukas.sommer@amd.com>
Signed-off-by: Lukas Sommer <lukas.sommer@amd.com>
@sommerlukas
Copy link
Copy Markdown
Contributor Author

@keshavvinayak01 I've added the async DMA handling to the PR.

Copy link
Copy Markdown
Contributor

@keshavvinayak01 keshavvinayak01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's split the async_dma extension work into a follow up PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants