TicketLLM: Next-Generation Sparse and Low-bit Transformers with Supermask-based Method

TicketLLM: Next-Generation Sparse and Low-bit Transformers with Supermask-based Method

TMLR Paper5320 Authors

07 Jul 2025 (modified: 13 Oct 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Strong Lottery Tickets (SLTs) are subnetworks within a randomly weighted network uncovered by a binary mask called supermask. They offer a promising approach to model compression by eliminating the need to store weights since their effective subnetwork can be regenerated from a fixed random seed and the supermask. However, extending this approach to large language models (LLMs) is non-trivial due to limited scalability and inefficient training dynamics of existing SLT methods. To address these challenges, we propose Adaptive Supermask (Ada-Sup), a scalable and efficient method for discovering high-quality multi-bit supermasks through an innovative quantization-based approach. Building on this method, we introduce TicketLLM, a low-bit and sparse Transformer-based LLM architecture powered by Ada-Sup. Experimental results show that Ada-Sup can discover high-quality supermasks with significantly reduced training costs compared to previous methods in both binary and multi-bit settings. Furthermore, TicketLLM outperforms BitNet b1.58 on a 1.3B parameter model with the same memory per connection, achieving 0.6% reduction in perplexity (from 13.62 to 13.54) while operating at a higher sparsity level (around 50% vs. around 33%). These results highlight the potential of supermask-based methods as a promising approach for building lightweight LLMs.

Submission Length: Regular submission (no more than 12 pages of main content)

Assigned Action Editor: ~Brian_Kingsbury1

Submission Number: 5320

Loading