Don't Pay Attention, PLANT It: Pretraining Attention via Learning-to-Rank

Don't Pay Attention, PLANT It: Pretraining Attention via Learning-to-Rank

ICLR 2026 Conference Submission15590 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: attention, pretraining, finetuning, classification

Abstract: State-of-the-art Extreme Multi-Label Text Classification models rely on multi-label attention to focus on key tokens in input text, but learning good attention weights is challenging. We introduce PLANT — Pretrained and Leveraged Attention — a plug-and-play strategy for initializing attention. PLANT works by planting label-specific attention using a pretrained Learning-to-Rank model guided by mutual information gain. This architecture-agnostic approach integrates seamlessly with large language model backbones (e.g., Mistral, LLaMA, DeepSeek, and Phi-3). PLANT outperforms state-of-the-art methods across tasks such as ICD coding, legal topic classification, and content recommendation. Gains are especially pronounced in few-shot settings, with substantial improvements on rare labels. Ablation studies confirm that attention initialization is a key driver of these gains.

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Submission Number: 15590

Loading