Compressed sparse tiles for memory-efficient unstructured and semi-structured sparsity

Published: 05 Mar 2025, Last Modified: 22 Apr 2025SLLMEveryoneRevisionsBibTeXCC BY 4.0
Track: long paper (up to 4 pages)
Keywords: sparsity, compression, llm, pruning, N:M, efficiency
TL;DR: We propose CS256, a novel sparse format for LLM compression that matches the performance of unstructured sparsity while being more hardware-friendly.
Abstract: Storing the weights of Large Language Models (LLMs) in GPU memory for local inference is challenging due to their size. While quantization has proven successful in reducing the memory footprint of LLMs, unstructured pruning introduces overhead by requiring the non-pruned weights' location to be encoded. This overhead hinders the efficient combination of quantization and unstructured pruning, especially for smaller batch sizes common in inference scenarios. To address this, we propose the \textsc{CS256} storage format, which offers a better balance between space efficiency and hardware acceleration compared to existing formats. CS256 partitions the weight matrix into tiles and uses a hierarchical indexing scheme to locate non-zero values, reducing the overhead associated with storing sparsity patterns. Our preliminary results with one-shot pruning of LLMs show that CS256 matches the performance of unstructured sparsity while being more hardware-friendly. Our code is available at: https://github.com/mklasby/llm-compressor/tree/mklasby-cs256
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 70
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview