Hybrid Attribution Priors for Explainable and Robust Model Training

Zhuoran Zhang; Feng Zhang; Shangyuan Li; Yang Shi; Yuanxing Zhang; Wei Chen; Tengjiao Wang; Kam-Fai Wong

Hybrid Attribution Priors for Explainable and Robust Model Training

Zhuoran Zhang, Feng Zhang, Shangyuan Li, Yang Shi, Yuanxing Zhang, Wei Chen, Tengjiao Wang, Kam-Fai Wong

19 Sept 2025 (modified: 13 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Explanation-Guided Learning, Text Classification, Large Language Models

Abstract: Small language models (SLMs) are widely used in tasks requiring low latency and lightweight deployment, especially classification. With growing emphasis on interpretability and robustness, explanation-guided learning offers an effective framework by incorporating attribution-based supervision during training. However, how to derive general and reliable attribution priors remains an open challenge. Upon analyzing representative attribution methods in classification tasks, we find that while these methods reliably highlight class-relevant tokens, they tend to focus on common keywords shared by semantically similar classes. Since these classes are already prone to confusion under standard training, the attributions fail to provide sufficiently discriminative cues, limiting their ability to enhance model differentiation. To address this challenge, we introduce Class-Aware Attribution Prior (CAP), a novel attribution prior extraction framework designed to guide language models in capturing fine-grained class distinctions, thereby producing more salient and discriminative attribution priors. Building on this, we propose CAP$_{Hybrid}$, which integrates priors from CAP and existing attribution techniques to form a more comprehensive and balanced supervisory signal. By aligning the model’s self-attribution with these enriched priors, our approach encourages the model to capture diverse decision-relevant features. Extensive experiments across full-data, few-shot, and adversarial settings demonstrate that our method consistently enhances both interpretability and robustness.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 15692

Loading