Beyond Proxy Metrics: A New Evaluation Framework for LLM Compression by Directly Measuring  Generative Faithfulness

Wenhao Li; Gen Luo; Daohai Yu; Yuxin Zhang; JingJing Xie; Fei Chao; Yifan Wu; Rongrong Ji

Beyond Proxy Metrics: A New Evaluation Framework for LLM Compression by Directly Measuring Generative Faithfulness

Wenhao Li, Gen Luo, Daohai Yu, Yuxin Zhang, JingJing Xie, Fei Chao, Yifan Wu, Rongrong Ji

16 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Efficient LLM, Model Compression, Benchmark

Abstract: Current evaluation methods for Large Language Model (LLM) compression, which rely on proxy metrics like perplexity and curated benchmarks, often correlate poorly with real-world generative performance. This discrepancy creates a significant gap between reported scores and practical utility. To address this, we introduce a new evaluation framework that dispenses with such proxies by directly measuring a compressed model's generative faithfulness to its uncompressed counterpart on real-world user queries. The core of our framework is Conditional Generation Accuracy (CGA), a novel metric that employs a teacher-forcing paradigm to assess the compressed model's ability to replicate the original model's next-token prediction at each step, conditioned on the ground-truth prefix. This approach effectively avoids the cascading errors that confound traditional text-similarity measures. We apply this framework to a comprehensive evaluation of nine mainstream compression methods across models from 7B to 32B parameters and context lengths up to 24K tokens. Our results establish a clear performance hierarchy and reveal distinct scaling laws with respect to model size and context length. For instance, while most methods' performance improves with model size, we find that quantization and KV cache dropping degrade with longer contexts, whereas a sparse attention baseline uniquely improves. Our work provides a more rigorous and reliable foundation for benchmarking LLM compression. To promote transparent and reproducible progress, we have open-sourced our benchmark code at https://anonymous.4open.science/r/llm-fidbench/README.md and will launch a leaderboard.

Primary Area: datasets and benchmarks

Submission Number: 7657

Loading