SHARP: Structured Hierarchical Attention Rank Projection for Efficient Language Model Distillation

Jieui Kang; Soeun Choi; EunjoungYoo; Yeonhee Kim; Jaehyeong Sim

SHARP: Structured Hierarchical Attention Rank Projection for Efficient Language Model Distillation

Jieui Kang, Soeun Choi, EunjoungYoo, Yeonhee Kim, Jaehyeong Sim

18 Sept 2025 (modified: 14 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Model, Optimization, Knowledge Distillation, Multi-granularity Learning, Compression

TL;DR: SHARP addresses gradient interference in multi-granularity knowledge distillation by projecting token/head/layer-level attention patterns into orthogonal rank spaces, achieving 3-7% performance improvement over existing methods.

Abstract: Knowledge distillation has emerged as a crucial technique for compressing large language models into more deployable versions. While existing approaches focus on transferring knowledge at different length-based linguistic granularities (e.g., tokens, phrases, sequences), they often fail to capture the intrinsic hierarchical attention mechanisms that modern language models utilize. We propose SHARP (Structured Hierarchical Attention Rank Projection), a novel distillation framework that effectively transfers knowledge across different architectural granularities of transformer models. Our approach introduces an orthogonal rank space projection mechanism that decomposes attention patterns into token-level, head-level, and layer-level representations, enabling parallel optimization pathways across different granularities while preventing gradient interference between complementary features. Through extensive experiments on both natural language generation (NLG) and understanding (NLU) tasks with teacher models ranging from 350M to 6.7B parameters distilled to a 125M parameter student model, we demonstrate that SHARP consistently outperforms existing distillation methods, achieving an average 5.2\% average improvement in perplexity across NLG tasks across all tasks, with gains reaching 7.2\% for larger teacher models (6.7B). The method shows particularly strong performance on NLG tasks, with consistent improvements across all model scales.

Primary Area: transfer learning, meta learning, and lifelong learning

Submission Number: 10790

Loading