Thanos: A Block-wise Pruning Algorithm for Efficient Large Language Model Compression

Thanos: A Block-wise Pruning Algorithm for Efficient Large Language Model Compression

ICLR 2026 Conference Submission25458 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM Compression, Pruning, Wanda, SparseGPT, Deep Learning, AI

TL;DR: We developed a novel pruning method for LLMs that compresses matrices in a block-wise manner.

Abstract: This paper presents Thanos, a novel weight-pruning algorithm designed to reduce the memory footprint and enhance the computational efficiency of large language models (LLMs) by removing redundant weights while maintaining accuracy. Thanos introduces a block-wise pruning strategy with adaptive masks that dynamically adjust to weight importance, enabling flexible sparsity patterns and structured formats, such as n:m sparsity, optimized for hardware acceleration. Experimental evaluations demonstrate that Thanos achieves state-of-the-art performance in structured pruning and outperforms existing methods in unstructured pruning. By providing an efficient and adaptable approach to model compression, Thanos offers a practical solution for deploying large models in resource-constrained environments. The algorithm is publicly available for further research and application.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 25458

Loading