-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Insights: NVIDIA/cutlass
Overview
Could not load contribution data
Please try again later
2 Pull requests merged by 2 people
-
Fix SM90 beta=1 hang and stream-K launch errors
#2172 merged
Mar 13, 2025 -
Blockwise Improvement and Programmatic Dependent Launch
#2161 merged
Mar 10, 2025
5 Pull requests opened by 5 people
-
Update mma_atom.hpp
#2159 opened
Mar 9, 2025 -
fix: the bug of example_simt_canonical
#2160 opened
Mar 10, 2025 -
Fix sm100 gemm wrong static constexpr that breaks compilation on Windows
#2167 opened
Mar 13, 2025 -
Fix CUTE_DEVICE for cast_smem_ptr_to_unit
#2171 opened
Mar 13, 2025 -
[Doc]Fix typo of cute document. "test"-->"text"
#2174 opened
Mar 14, 2025
6 Issues closed by 6 people
-
[QST] `wgmma_sm90.cu` tutorial pipelining
#2173 closed
Mar 14, 2025 -
[QST] Permute in K mode for consistent LDSM results
#2140 closed
Mar 14, 2025 -
[DOC] Possible typos in fundamental_types.md document
#2091 closed
Mar 14, 2025 -
[BUG] "Got cutlass error: Error Internal at: " even though compilation is successful
#2157 closed
Mar 13, 2025 -
[BUG] Unable to run CUTLASS example 65_distributed_gemm
#2097 closed
Mar 13, 2025 -
[QST] Quickstart guide for 5080.
#2165 closed
Mar 12, 2025
8 Issues opened by 7 people
-
[BUG] Segmentation fault with high N, C, K values in Conv2dFprop
#2175 opened
Mar 14, 2025 -
[BUG] Cute gemm fails to dispatch to batched outer product
#2170 opened
Mar 13, 2025 -
[QST]How can I implement W4A16 gemm kernel with cute api?
#2168 opened
Mar 13, 2025 -
[QST] Adding new parameter to Conv2dFprop in Python
#2166 opened
Mar 12, 2025 -
[QST] Variable size Gemm that can be Cuda graphed
#2164 opened
Mar 12, 2025 -
[QST] if(runtime_value) (cute::copy or cute::gemm) Generates NaNs
#2163 opened
Mar 12, 2025 -
[BUG] Possible race condition in sm90_gemm_array_tma_warpspecialized_cooperative
#2162 opened
Mar 11, 2025 -
[DOC] CUTLASS INT4 GEMM: Missing SM89 Dispatch Configuration for L40S/4090
#2158 opened
Mar 8, 2025
30 Unresolved conversations
Sometimes conversations happen on old items that aren’t yet closed. Here is a list of all the Issues and Pull Requests with unresolved conversations.
-
[QST] Why tma_load.get_slice(0) here always need 0?
#1929 commented on
Mar 8, 2025 • 0 new comments -
[QST] Getting a template error trying to use cutlass's depthwise 2D convolution with pytorch
#2077 commented on
Mar 8, 2025 • 0 new comments -
[QST] in implicit gemm conv, why does not support split-k when group !=1 ?
#2049 commented on
Mar 9, 2025 • 0 new comments -
[QST] bfloat16 x int8 GEMM
#1936 commented on
Mar 11, 2025 • 0 new comments -
[BUG] Unaligned access in test/unit/gemm/threadblock/batched_gemv.cu
#2003 commented on
Mar 11, 2025 • 0 new comments -
[BUG] Tmem tiled copy with non power-of-2 size fails to compile
#2094 commented on
Mar 12, 2025 • 0 new comments -
[QST] How to apply StreamK to hopper warp specialized GEMM
#2075 commented on
Mar 12, 2025 • 0 new comments -
[QST] Adding a flag in Tensor Ref Class
#2080 commented on
Mar 12, 2025 • 0 new comments -
[QST]The Persistent Tile Scheduler in CUTLASS?
#1685 commented on
Mar 12, 2025 • 0 new comments -
[QST] How to define a new custom kernel
#1930 commented on
Mar 12, 2025 • 0 new comments -
[QST] Global variable inside conv2d kernel
#1987 commented on
Mar 12, 2025 • 0 new comments -
[QST] Question about example 69
#2096 commented on
Mar 13, 2025 • 0 new comments -
[QST] Why does GenerateSM90_TensorOp_16b_WGMMA_alignx_gemm not generate C.element = DataType.void?
#2144 commented on
Mar 13, 2025 • 0 new comments -
[QST]From index into a coordinate (or coordniate into a index), it has two different implementations, how should one distinguish and understand the scenarios for their use?
#2128 commented on
Mar 13, 2025 • 0 new comments -
[QST]How Does TMA Work in CUTLASS for Writing from Shared Memory to Global Memory?
#2008 commented on
Mar 13, 2025 • 0 new comments -
[BUG] wmma should be enabled w/ clang.
#2006 commented on
Mar 13, 2025 • 0 new comments -
[QST]Behavior of TMA Store and Wait Mechanism in CUTLASS
#2002 commented on
Mar 13, 2025 • 0 new comments -
[BUG] Funcionality TensorOp 80+ s8 * s8 + s32 => {s32, s8} not working
#1981 commented on
Mar 13, 2025 • 0 new comments -
[QST] Is there a Cutlass GEMM example to read inputs with custom padding?
#1922 commented on
Mar 13, 2025 • 0 new comments -
[BUG] Logic issue in nondeterministic reduction mode of Stream-K tile scheduler.
#2027 commented on
Mar 13, 2025 • 0 new comments -
[QST] Are there plans to add specialisations for Sm90?
#1123 commented on
Mar 14, 2025 • 0 new comments -
[BUG] calling cast_smem_ptr_to_uint(device fn) from make_gmma_desc(host device fn) is not allowed
#1997 commented on
Mar 14, 2025 • 0 new comments -
[QST]Is the Key Difference Between mbarrier and barrier Their Handling of Producer-Consumer Count?
#1999 commented on
Mar 14, 2025 • 0 new comments -
[QST]How to Handle Synchronization with Different Thread Counts for Producer and Consumer in CUTLASS?
#1998 commented on
Mar 14, 2025 • 0 new comments -
[BUG]Is tcgen05.fence supported by Cutlass-3.8.0 ?
#2098 commented on
Mar 14, 2025 • 0 new comments -
[QST] Permutation layout for contiguous stores
#2127 commented on
Mar 14, 2025 • 0 new comments -
[QST] Question about UniversalFMA
#2101 commented on
Mar 14, 2025 • 0 new comments -
[QST] FP8 with row-wise scaling on Ada-Lovelace
#1937 commented on
Mar 14, 2025 • 0 new comments -
Use Compile-Time Constexpr If
#2035 commented on
Mar 9, 2025 • 0 new comments -
Allow cluster sizes across m,n,k to be reported in cutlass profiler
#2078 commented on
Mar 14, 2025 • 0 new comments