Skip to content

Merge dev-new into master#115

Merged
pthomadakis merged 171 commits into
masterfrom
dev-new
Aug 22, 2025
Merged

Merge dev-new into master#115
pthomadakis merged 171 commits into
masterfrom
dev-new

Conversation

@pthomadakis

Copy link
Copy Markdown
Collaborator

No description provided.

AK2000 added 30 commits October 14, 2024 17:47
pthomadakis and others added 26 commits March 27, 2025 10:55
* Adding a mask domain to fix triangle counting performance (closes #88)

* Adding saftey check
…3389a29722f29216a96

* Upgraded COMET and Triton to use LLVM v20.0.0 hash: 61f8a7f618901797ee8663389a29722f29216a96
* Changed triton submodule to track the correct commit
* Moved and skipped all of the unsupported cometpy test cases

[CI/CD]:
* Try to explicitly set the paths to comet and llvm for cometpy
* Build COMET backend again during cometpy test
* Make cometpy test independent of backend
… most of the logic in the generated code

* TC-to-TTGT optimization now works with dynamic inputs by implementing most of the logic in the generated code
* Skip pytest cases that are known to be failing in CI/CD
* Added clangd cache to git ignore
1. Better support for generating valid reductions for Triton, utilizing linalg operations as an intermediate step
2. Introduced better algorithm to infer the dimension that a tensor should be expanded to
3. More robust handling of multi-kernel cases
4. Handle most arith elementwise operations without introducing a conversion for each explicitly
5. Created a new, generic pass for gpu host that leverages gpu dialect operations instead of directly lowering to vendor-specific ones.
6. Vendor-specific lowering passes now operate on the gpu dialect-based IR operations.
7. Thread-block sizes are now passed from the generated code instead of hardcoding them in the GpuUtils.cpp
8. Removed code duplication
9. Handle cases where blocks of tensors are of different shape  but elementwise operations are still valid (i.e., something like C[i][j] = A[i] + B[j], where i, j are different size blocks)
10. Handle cases where blocks of tensors have the same shape but cannot be directly included in elementwise operations (i.e., C[i][j] = A[i] + B[j], where i, j are same size blocks)
11. Generalized broadcasting operations
12. Added support for cases where tensors in a reduction are same rank but different shape or different orientation.
13. Make sure parallel loops are generated before lowering to GPU/FPGA
14. Separate AMD and NVIDIA GPU support into different ifdefs
… output the best permutation when sizes are static.
* Enabled and verified correctness for parallel TC-to-TTGT pass.
* Added parallel tranpose for OptDenseTranspose pass
* [COMETPY] FIxed bug that would prevent opt-dense-transpose from being passed to the backend

Currrently falling behind numpy.
* [COMET]
* Added the option to skip gpu data allocation and transfers, in case the data are already in the device.
  2 options are provided:
  1. An option to pass --prepare-gpu-host
  2. Argument attributes that denot which inputs are already GPU-resident
* GPU execution engine now also uses CUDA runtime API to offload context management.
* Made hipGpuUtils consistent with CUDA

[COMETPY]
* Added support for in-place kernel input mutation, i.e:
C[:] = A + B
* Added interoperability with other GPU targeting frontends like cupy, pytorch, numba , allowing utilizing their ndarrays on the device with zero-copy
Sparse output not supported.
* Minor fix to allow chains of matrix multiplications to lower using ttgt -> micro kernel

* Correctly choose the best permutation

* Move import for cupy within the test function to prevent it from running when no GPUs are desired
…r operations between them. (#103)

When it is safe, the operations of the parent scf.parallel loop will be moved to the child and a single scf.parallel loop will be generated that iterates both iteration spaces.
* linalg.matvec will now lower to parallel loops
1.  Improved and generalized parsing and MLIR code generation
2. Can now parse limited set of Python operations like loops and loading/storing to numpy array elements
3. Handle scalars as inputs to cometpy kernels
4. Cache kernels to avoid JIT when possible
5. Explicitly set shape of output sparse tensors
* Further optimized TC-to-TTGT optimization:
1. The target hardware architecture is now only queried once during compile-time instead of at runtime. Even though the overhead might be small for smaller tensors it can play a significant role for larger ones since the microkernel wrapper might be called millions of times
2. The micro kernel is able to handle any size for the k (reduce) dimension, thus we do not need to emit a loop for it.
3. Replaced function calls in blis_interface with macros to ensure that they are inlined.
4.  The values of alpha and beta from ta.mul are now passed directly to the microkernel, this allows us to avoid reading tensor C (the output) when beta == 0.0
5. When beta == 0.0 we no longer need to initialize the output tensor to zero since the micro kernel will not read it.
6. Made pointers in _mlir_ciface_linalg_matmul_viewsxs_viewsxs_viewsxs restrict

On Junction, these optimizations decrease the runtime from ~5-6 sec to about ~1.6-2 sec when running intensli ([a, b, d] = [1024], [c] = [512]) with 32 threads.


* Opt-dense tranpose pass now uses tensor slices properly
* Support explicitly setting block_size and parallel dims in GPU
* Reenabed support for SPMM
* Added attribute in IndexTreeIndicesOp  that is used to map an iteration space to a GPU block
* Modified ParallelLoopsToGpu to operation on scf::Forall instead of scf::Parallel
* Changed mlir generator to automatically generate i32
indices for sparse matrices when targetting GPU devices
* Updated cometPy to latestt GPU changes
* [GPU] Avoid redundant data transfers depending on the type of access required in a kernel (read/write/read-write)

* [GPU] Added simple heuristic to detect cases where data are already in the device (because of a previous operation).
Currently it's quite conservative as it does not track memref aliases and will just copy if an alias is detected.
* Introduced superbuild CMakeLists.txt
* Minor other adjustments to streamline building  COMET
* Added automated build for GPU devices
* Added support for FPGA in the new CMakeLists.txt
* Updated README.md
* Updated CI script
@pthomadakis pthomadakis requested a review from johnpzh August 22, 2025 19:16
@johnpzh

johnpzh commented Aug 22, 2025

Copy link
Copy Markdown
Collaborator

Great!

@johnpzh johnpzh left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

@pthomadakis pthomadakis merged commit b80a7a7 into master Aug 22, 2025
5 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants