Merge dev-new into master#115
Merged
Merged
Conversation
…pace transformation pass
* Adding a mask domain to fix triangle counting performance (closes #88) * Adding saftey check
…3389a29722f29216a96 * Upgraded COMET and Triton to use LLVM v20.0.0 hash: 61f8a7f618901797ee8663389a29722f29216a96 * Changed triton submodule to track the correct commit * Moved and skipped all of the unsupported cometpy test cases [CI/CD]: * Try to explicitly set the paths to comet and llvm for cometpy * Build COMET backend again during cometpy test * Make cometpy test independent of backend
Updated README file
… most of the logic in the generated code * TC-to-TTGT optimization now works with dynamic inputs by implementing most of the logic in the generated code * Skip pytest cases that are known to be failing in CI/CD * Added clangd cache to git ignore
1. Better support for generating valid reductions for Triton, utilizing linalg operations as an intermediate step 2. Introduced better algorithm to infer the dimension that a tensor should be expanded to 3. More robust handling of multi-kernel cases 4. Handle most arith elementwise operations without introducing a conversion for each explicitly 5. Created a new, generic pass for gpu host that leverages gpu dialect operations instead of directly lowering to vendor-specific ones. 6. Vendor-specific lowering passes now operate on the gpu dialect-based IR operations. 7. Thread-block sizes are now passed from the generated code instead of hardcoding them in the GpuUtils.cpp 8. Removed code duplication 9. Handle cases where blocks of tensors are of different shape but elementwise operations are still valid (i.e., something like C[i][j] = A[i] + B[j], where i, j are different size blocks) 10. Handle cases where blocks of tensors have the same shape but cannot be directly included in elementwise operations (i.e., C[i][j] = A[i] + B[j], where i, j are same size blocks) 11. Generalized broadcasting operations 12. Added support for cases where tensors in a reduction are same rank but different shape or different orientation. 13. Make sure parallel loops are generated before lowering to GPU/FPGA 14. Separate AMD and NVIDIA GPU support into different ifdefs
… output the best permutation when sizes are static.
* Enabled and verified correctness for parallel TC-to-TTGT pass. * Added parallel tranpose for OptDenseTranspose pass * [COMETPY] FIxed bug that would prevent opt-dense-transpose from being passed to the backend Currrently falling behind numpy.
* [COMET] * Added the option to skip gpu data allocation and transfers, in case the data are already in the device. 2 options are provided: 1. An option to pass --prepare-gpu-host 2. Argument attributes that denot which inputs are already GPU-resident * GPU execution engine now also uses CUDA runtime API to offload context management. * Made hipGpuUtils consistent with CUDA [COMETPY] * Added support for in-place kernel input mutation, i.e: C[:] = A + B * Added interoperability with other GPU targeting frontends like cupy, pytorch, numba , allowing utilizing their ndarrays on the device with zero-copy
Sparse output not supported.
…pass preventing it from working
* Minor fix to allow chains of matrix multiplications to lower using ttgt -> micro kernel * Correctly choose the best permutation * Move import for cupy within the test function to prevent it from running when no GPUs are desired
…r operations between them. (#103) When it is safe, the operations of the parent scf.parallel loop will be moved to the child and a single scf.parallel loop will be generated that iterates both iteration spaces.
* linalg.matvec will now lower to parallel loops
1. Improved and generalized parsing and MLIR code generation 2. Can now parse limited set of Python operations like loops and loading/storing to numpy array elements 3. Handle scalars as inputs to cometpy kernels 4. Cache kernels to avoid JIT when possible 5. Explicitly set shape of output sparse tensors
* Further optimized TC-to-TTGT optimization: 1. The target hardware architecture is now only queried once during compile-time instead of at runtime. Even though the overhead might be small for smaller tensors it can play a significant role for larger ones since the microkernel wrapper might be called millions of times 2. The micro kernel is able to handle any size for the k (reduce) dimension, thus we do not need to emit a loop for it. 3. Replaced function calls in blis_interface with macros to ensure that they are inlined. 4. The values of alpha and beta from ta.mul are now passed directly to the microkernel, this allows us to avoid reading tensor C (the output) when beta == 0.0 5. When beta == 0.0 we no longer need to initialize the output tensor to zero since the micro kernel will not read it. 6. Made pointers in _mlir_ciface_linalg_matmul_viewsxs_viewsxs_viewsxs restrict On Junction, these optimizations decrease the runtime from ~5-6 sec to about ~1.6-2 sec when running intensli ([a, b, d] = [1024], [c] = [512]) with 32 threads. * Opt-dense tranpose pass now uses tensor slices properly
* Support explicitly setting block_size and parallel dims in GPU * Reenabed support for SPMM * Added attribute in IndexTreeIndicesOp that is used to map an iteration space to a GPU block * Modified ParallelLoopsToGpu to operation on scf::Forall instead of scf::Parallel * Changed mlir generator to automatically generate i32 indices for sparse matrices when targetting GPU devices * Updated cometPy to latestt GPU changes
… (cuPY, Pytorch) (#111)
* [GPU] Avoid redundant data transfers depending on the type of access required in a kernel (read/write/read-write) * [GPU] Added simple heuristic to detect cases where data are already in the device (because of a previous operation). Currently it's quite conservative as it does not track memref aliases and will just copy if an alias is detected.
* Introduced superbuild CMakeLists.txt * Minor other adjustments to streamline building COMET * Added automated build for GPU devices * Added support for FPGA in the new CMakeLists.txt * Updated README.md * Updated CI script
Collaborator
|
Great! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.