Cutlass
CUDA Templates for Linear Algebra Subroutines and Solvers
|
Block Swizzle provides the mapping logic between a block in the physical memory of Matrix C and Thread Block Identiy Block Swizzle effective maps blocks in leading dimension order (column major) with thread block in leading dimension order (blockIdx.x) blockIdx.z is mapped with batch_count for batched GEMM