Merge dev-new into master by pthomadakis · Pull Request #115 · pnnl/COMET

pthomadakis · 2025-08-22T19:15:34Z

No description provided.

…pace transformation pass

…ring passes

* Adding a mask domain to fix triangle counting performance (closes #88) * Adding saftey check

…3389a29722f29216a96 * Upgraded COMET and Triton to use LLVM v20.0.0 hash: 61f8a7f618901797ee8663389a29722f29216a96 * Changed triton submodule to track the correct commit * Moved and skipped all of the unsupported cometpy test cases [CI/CD]: * Try to explicitly set the paths to comet and llvm for cometpy * Build COMET backend again during cometpy test * Make cometpy test independent of backend

Updated README file

… most of the logic in the generated code * TC-to-TTGT optimization now works with dynamic inputs by implementing most of the logic in the generated code * Skip pytest cases that are known to be failing in CI/CD * Added clangd cache to git ignore

1. Better support for generating valid reductions for Triton, utilizing linalg operations as an intermediate step 2. Introduced better algorithm to infer the dimension that a tensor should be expanded to 3. More robust handling of multi-kernel cases 4. Handle most arith elementwise operations without introducing a conversion for each explicitly 5. Created a new, generic pass for gpu host that leverages gpu dialect operations instead of directly lowering to vendor-specific ones. 6. Vendor-specific lowering passes now operate on the gpu dialect-based IR operations. 7. Thread-block sizes are now passed from the generated code instead of hardcoding them in the GpuUtils.cpp 8. Removed code duplication 9. Handle cases where blocks of tensors are of different shape but elementwise operations are still valid (i.e., something like C[i][j] = A[i] + B[j], where i, j are different size blocks) 10. Handle cases where blocks of tensors have the same shape but cannot be directly included in elementwise operations (i.e., C[i][j] = A[i] + B[j], where i, j are same size blocks) 11. Generalized broadcasting operations 12. Added support for cases where tensors in a reduction are same rank but different shape or different orientation. 13. Make sure parallel loops are generated before lowering to GPU/FPGA 14. Separate AMD and NVIDIA GPU support into different ifdefs

… output the best permutation when sizes are static.

* Enabled and verified correctness for parallel TC-to-TTGT pass. * Added parallel tranpose for OptDenseTranspose pass * [COMETPY] FIxed bug that would prevent opt-dense-transpose from being passed to the backend Currrently falling behind numpy.

* [COMET] * Added the option to skip gpu data allocation and transfers, in case the data are already in the device. 2 options are provided: 1. An option to pass --prepare-gpu-host 2. Argument attributes that denot which inputs are already GPU-resident * GPU execution engine now also uses CUDA runtime API to offload context management. * Made hipGpuUtils consistent with CUDA [COMETPY] * Added support for in-place kernel input mutation, i.e: C[:] = A + B * Added interoperability with other GPU targeting frontends like cupy, pytorch, numba , allowing utilizing their ndarrays on the device with zero-copy

Sparse output not supported.

…pass preventing it from working

* Minor fix to allow chains of matrix multiplications to lower using ttgt -> micro kernel * Correctly choose the best permutation * Move import for cupy within the test function to prevent it from running when no GPUs are desired

…r operations between them. (#103) When it is safe, the operations of the parent scf.parallel loop will be moved to the child and a single scf.parallel loop will be generated that iterates both iteration spaces.

* linalg.matvec will now lower to parallel loops

1. Improved and generalized parsing and MLIR code generation 2. Can now parse limited set of Python operations like loops and loading/storing to numpy array elements 3. Handle scalars as inputs to cometpy kernels 4. Cache kernels to avoid JIT when possible 5. Explicitly set shape of output sparse tensors

* Further optimized TC-to-TTGT optimization: 1. The target hardware architecture is now only queried once during compile-time instead of at runtime. Even though the overhead might be small for smaller tensors it can play a significant role for larger ones since the microkernel wrapper might be called millions of times 2. The micro kernel is able to handle any size for the k (reduce) dimension, thus we do not need to emit a loop for it. 3. Replaced function calls in blis_interface with macros to ensure that they are inlined. 4. The values of alpha and beta from ta.mul are now passed directly to the microkernel, this allows us to avoid reading tensor C (the output) when beta == 0.0 5. When beta == 0.0 we no longer need to initialize the output tensor to zero since the micro kernel will not read it. 6. Made pointers in _mlir_ciface_linalg_matmul_viewsxs_viewsxs_viewsxs restrict On Junction, these optimizations decrease the runtime from ~5-6 sec to about ~1.6-2 sec when running intensli ([a, b, d] = [1024], [c] = [512]) with 32 threads. * Opt-dense tranpose pass now uses tensor slices properly

* Support explicitly setting block_size and parallel dims in GPU * Reenabed support for SPMM * Added attribute in IndexTreeIndicesOp that is used to map an iteration space to a GPU block * Modified ParallelLoopsToGpu to operation on scf::Forall instead of scf::Parallel * Changed mlir generator to automatically generate i32 indices for sparse matrices when targetting GPU devices * Updated cometPy to latestt GPU changes

… (cuPY, Pytorch) (#111)

* [GPU] Avoid redundant data transfers depending on the type of access required in a kernel (read/write/read-write) * [GPU] Added simple heuristic to detect cases where data are already in the device (because of a previous operation). Currently it's quite conservative as it does not track memref aliases and will just copy if an alias is detected.

* Introduced superbuild CMakeLists.txt * Minor other adjustments to streamline building COMET * Added automated build for GPU devices * Added support for FPGA in the new CMakeLists.txt * Updated README.md * Updated CI script

johnpzh · 2025-08-22T20:12:50Z

Great!

johnpzh

Great!

AK2000 added 30 commits October 14, 2024 17:47

Begin working on IndexTree transformations

4dac7bc

Fixing some of the problems introduced on merge

20f916a

Resolved including of device mapping attribute

864c9f8

Fixed type inclusion, parsing and printing

0447561

V1 - Lower TA to new IndexTree ops, but removed everything else

cf505bd

Fixes to TA to change how file is included

3f4e849

Creating new block for index tree

d6bc7fd

Implement domain inference pass, fix to index ordering

48fe5fd

[WIP] Fragile version of index tree to SCF lowering

2d6ba90

Fix carrying tensors inside loop, refactor domain concretization

972ee7e

Adding TA to index tree patterns for elementwise operations

c29ee1f

[WIP] Trying to implement intersection op lowering

8d44653

[WIP] Got domain intersection working, but only with dense output

f25fb52

[WIP] Minor fix to ordering of reduce args

35cf1c5

[WIP] Beginning support for sparse output tensors with new index tree

04b31ba

[WIP] Inlined itree op, got hacky version of removing set op working

1c06a26

[WIP] Included lowering to LLVM, lowering print op does not work

00801b1

[WIP] Almost got print op lowering working

352da6a

[WIP] Fixed bufferization

9436939

[WIP] Generate symbolic pass for sparse tensor declarations

ac98adf

[WIP] Lots of changes for first try at symbolic domain pass and works…

935079d

…pace transformation pass

[WIP] Broke everythink trying to redo tensor conversion infrastructure

0c2d98d

Changing alot to create new sparse tensor types, and appropriate lowe…

bb32d70

…ring passes

Fixing some problems with tests, ad pure ops

ac27cfb

Fixed inconsistencies in test suite

52495d4

Fixing more of the test cases

27dd373

Fixed dense transpose and print elapsed time

8bc6187

Fixing errors in typing

3f777dd

Fixing errors in typing and set op

8cb0d2a

Adding back ttgt pass

dc0e445

pthomadakis and others added 26 commits March 27, 2025 10:55

[GPU] Make sure CUDA context is destroyed before exiting

3b73299

[GPU] Ignore unsupported test cases for now

7339990

Adding a Fill and Zero Mask node to the index tree dialect (#89)

91f742b

* Adding a mask domain to fix triangle counting performance (closes #88) * Adding saftey check

Fixed or silenced all warnings

5c6eda1

[GPU] Enabled AMDGPU execution

525bc0b

Make sure FPGA and GPU can co-exist

f581fe4

Updated README file

Updated CuGpuUtils to match HIP new design

cf771a2

Fixed by in TTGT optimization and changed code generation to directly…

ccf03b9

… output the best permutation when sizes are static.

[3D-CSF] Fixed bug in lowering CSF format for 3D tensor (issue #80)

d5fcce9

Parallel TC-to-TTGT and dense transpose

f922015

* Enabled and verified correctness for parallel TC-to-TTGT pass. * Added parallel tranpose for OptDenseTranspose pass * [COMETPY] FIxed bug that would prevent opt-dense-transpose from being passed to the backend Currrently falling behind numpy.

Enabled parallel execution with fusion.

c440786

Sparse output not supported.

Disable parallel + fusion again as there is another bug in it -> scf …

1d8c997

…pass preventing it from working

Fixes for TCtoTTGT pass

c9fd9d9

* Minor fix to allow chains of matrix multiplications to lower using ttgt -> micro kernel * Correctly choose the best permutation * Move import for cupy within the test function to prevent it from running when no GPUs are desired

GPU lowering can now support nested scf.parallel operations with othe…

3feccf9

…r operations between them. (#103) When it is safe, the operations of the parent scf.parallel loop will be moved to the child and a single scf.parallel loop will be generated that iterates both iteration spaces.

linalg.matvec will now lower to parallel loops

4e0c29d

* linalg.matvec will now lower to parallel loops

Removed redudant tensor format attributes (e.g. Dense,CSR) (#106)

d0a0ba8

[COMETPY][GPU] Fixed bug in interoperability with other GPU frontends…

89aaa1d

… (cuPY, Pytorch) (#111)

Automate build (#114)

7c5ba6c

* Introduced superbuild CMakeLists.txt * Minor other adjustments to streamline building COMET * Added automated build for GPU devices * Added support for FPGA in the new CMakeLists.txt * Updated README.md * Updated CI script

pthomadakis requested a review from johnpzh August 22, 2025 19:16

johnpzh approved these changes Aug 22, 2025

View reviewed changes

pthomadakis merged commit b80a7a7 into master Aug 22, 2025
5 of 6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Merge dev-new into master#115

Merge dev-new into master#115
pthomadakis merged 171 commits into
masterfrom
dev-new

pthomadakis commented Aug 22, 2025

Uh oh!

johnpzh commented Aug 22, 2025

Uh oh!

johnpzh left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

pthomadakis commented Aug 22, 2025

Uh oh!

johnpzh commented Aug 22, 2025

Uh oh!

johnpzh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants