Unlocking the Potential of Extremely Low-Bit Sparse Transformers through Adaptive Multi-bit Supermasks and Random Weights
Keywords: Strong Lottery Ticket Hypothesis, Lottery Ticket Hypothesis, Large Language Models, Efficient Neural Networks, Pruning, Quantization
TL;DR: We present the first exploration of the Strong Lottery Ticket Hypothesis (SLTH) in Transformer-based LLMs, unlocking the potential of SLT for low-bit sparse Transformers.
Abstract: We propose Adaptive Supermask (Ada-Sup), a scalable and efficient method for discovering high-quality multi-bit supermasks in an extended Strong Lottery Ticket framework. Building on this methods, we introduce TicketLLM, a Transformer-based model that combines pruning, quantization, and random weights to enable compact low-bit sparse representations. Experimental results show that Ada-Sup can find high quality supermasks with significantly reduced training cost in comparison to previous methods, both for binary and multi-bit supermask settings. Furthermore, TicketLLM outperforms BitNet b1.58 on a 1.3B parameter model with the same memory per connection, achieving 0.08 lower perplexity despite operating at a higher sparsity level (50\% vs. 33\%).
These results demonstrate the potential of leveraging supermask and random weights as a practical and powerful alternative for building lightweight, scalable LLMs.
Submission Number: 24
Loading