Compressed sparse tiles for memory-efficient unstructured and semi-structured sparsity

Mike Lasby; Max Zimmer; Sebastian Pokutta; Erik Schultheis

Compressed sparse tiles for memory-efficient unstructured and semi-structured sparsity

Mike Lasby, Max Zimmer, Sebastian Pokutta, Erik Schultheis

Published: 05 Mar 2025, Last Modified: 22 Apr 2025SLLMEveryoneRevisionsBibTeXCC BY 4.0

Track: long paper (up to 4 pages)

Keywords: sparsity, compression, llm, pruning, N:M, efficiency

TL;DR: We propose CS256, a novel sparse format for LLM compression that matches the performance of unstructured sparsity while being more hardware-friendly.

Abstract: Storing the weights of Large Language Models (LLMs) in GPU memory for local inference is challenging due to their size. While quantization has proven successful in reducing the memory footprint of LLMs, unstructured pruning introduces overhead by requiring the non-pruned weights' location to be encoded. This overhead hinders the efficient combination of quantization and unstructured pruning, especially for smaller batch sizes common in inference scenarios. To address this, we propose the \textsc{CS256} storage format, which offers a better balance between space efficiency and hardware acceleration compared to existing formats. CS256 partitions the weight matrix into tiles and uses a hierarchical indexing scheme to locate non-zero values, reducing the overhead associated with storing sparsity patterns. Our preliminary results with one-shot pruning of LLMs show that CS256 matches the performance of unstructured sparsity while being more hardware-friendly. Our code is available at: https://github.com/mklasby/llm-compressor/tree/mklasby-cs256

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Submission Number: 70

Loading