Cutlass
CUDA Templates for Linear Algebra Subroutines and Solvers
|
Files | |
file | clear_accumulators.h [code] |
Defines abstractions for efficiently clearing accumulator tiles. | |
file | device_gemm.h [code] |
device level GEMM implemented by more than one kernels. | |
file | device_gemm_traits.h [code] |
file | dgemm_traits.h [code] |
Defines structural traits of double-precision GEMM. | |
file | fp16_sgemm_multiply_add.h [code] |
Template implementing matrix multiply-add operations on fragments. | |
file | fp16_sgemm_traits.h [code] |
Defies structural properties of single-precision GEMM where any number of the input/output could be fp16 or fp32. The accumulator type stays in fp32. | |
file | gemm.h [code] |
Implements a software-pipelined efficient GEMM. | |
file | gemm_config.h [code] |
Defines properties of GEMM computation that impose some constraints on caller. | |
file | gemm_coord.h [code] |
GemmCoord is a structure derived from Coord<4> that specifies a location within the coordinate system of a GEMM problem. | |
file | gemm_desc.h [code] |
Implements a software-pipelined efficient GEMM. | |
file | gemm_epilogue.h [code] |
Implements the epilogue phase of the GEMM kernel that efficiently updates global memory with the computed matrix product. | |
file | gemm_epilogue_traits.h [code] |
Defines structural properties of the GEMM epilogue. | |
file | gemm_global_stream.h [code] |
Implements efficient loading of the thread block-level tile from global memory and storing to shared memory. | |
file | gemm_global_tile.h [code] |
Defines iterators for efficiently loading and storing to global memory. | |
file | gemm_operand.h [code] |
Defines constant expressions for mapping GEMM problem size and strides onto pitch-linear memory. | |
file | gemm_shared_stream.h [code] |
Defines abstractions for managing loading and storing fragments to shared memory in the efficient GEMM pipeline. | |
file | gemm_shared_tile.h [code] |
Defines iterators for efficiently loading and storing tiles to and from shared memory. | |
file | gemm_stream_pair.h [code] |
Defines a pair of GEMM tile streams. | |
file | gemm_traits.h [code] |
Defines structural properties of complete GEMM computation. | |
file | hgemm_global_tile.h [code] |
Tile traits used to construct global tile iterator for HGEMM. This is intended to partition the thread block-level tile into 2D subtiles loaded by the threads and facilitate memory accesses larger than 16 bits. | |
file | hgemm_multiply_add.h [code] |
Specialization implementing multiply-add operation on half-precision floating point fragments. | |
file | hgemm_swizzle.h [code] |
Transposes a tile of 16b elements. Used by HGEMM to construct a K-strided layout in shared memory for multiplicands. | |
file | hgemm_traits.h [code] |
Defies structural properties of half-precision GEMM computation. | |
file | igemm_epilogue.h [code] |
Defines the epilogue phase of the GEMM computation for IGEMM, supporting integer and floating-point output matrix formats. | |
file | igemm_global_tile.h [code] |
Implements tile iterators to partition the thread block tile into 2D subtiles and efficiently load each. Applies permute transformation to construct 'interleaved K-strided' data layout in which 4-element dot products from the same K index are arranged in consecutive locations within shared memory. | |
file | igemm_multiply_add.h [code] |
Implements matrix multiply accumulate operation of 8-bit integer data using DP4A instruction. | |
file | igemm_swizzle.h [code] |
Transposes a fragment of data containing packed 8-bit integer elements. | |
file | igemm_traits.h [code] |
Defies structural properties of mixed-precision integer GEMM. Multiplicands are assumed to be packed 8bit integers, accumulators are assumed to be 32b signed integers, and output formats vary. | |
file | linear_scaling.h [code] |
Implements the BLAS linear scaling function alpha*AB + beta*C. | |
file | linear_scaling_device_ptr.h [code] |
Implements the BLAS linear scaling function alpha*AB + beta*C. | |
file | scalar_or_pointer.h [code] |
Implements the BLAS linear scaling function alpha*AB + beta*C. | |
file | sgemm_traits.h [code] |
Defies structural properties of single-precision GEMM. | |
file | thread_multiply_add.h [code] |
Template implementing matrix multiply-add operations on fragments. | |
file | gemm/threadblock_swizzle.h [code] |
Defies functors for mapping blockIdx to partitions of the GEMM computation. | |
file | wmma_gemm_epilogue_traits.h [code] |
Defines structural properties of WMMA GEMM's epilogue phase. | |
file | wmma_gemm_global_tile.h [code] |
Defines tile iterator traits for loading thread block-level tile from global memory. | |
file | wmma_gemm_multiply_add.h [code] |
Implements warp-level matrix multiply-accumulate operation using CUDA WMMA API. | |
file | wmma_gemm_shared_tile.h [code] |
Defines iterator traits for efficiently loading and storing fragment to and from shared memory, specialized for WMMA GEMM. | |
file | wmma_gemm_traits.h [code] |
Defies structural properties of GEMM targeting WMMA API in CUDA. | |