Can Compressed LLMs Truly Act? An Empirical Evaluation of Agentic Capabilities in LLM Compression

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We present Agent Compression Benchmark to evaluate the Agentic Performance of Compressed LLMs.
Abstract: Post-training compression reduces the computational and memory costs of large language models (LLMs), enabling resource-efficient deployment. However, existing compression benchmarks focus narrowly on language modeling (e.g., perplexity) and natural language understanding tasks (e.g., GLUE accuracy), ignoring the agentic capabilities—workflow, tool use/function call, long-context understanding and real-world application. We introduce the Agent Compression Benchmark (ACBench), the first comprehensive benchmark for evaluating how compression impacts LLMs' agentic abilities. ACBench spans (1) 12 tasks across 4 capabilities (e.g., WorfBench for workflow generation, Needle-in-Haystack for long-context retrieval), (2) 4-bit quantization (GPTQ, AWQ) and 50% pruning (Wanda, SparseGPT), and (3) 15 models, including small (Gemma-2B), standard (Qwen2.5-7B), and distilled reasoning LLMs (DeepSeek-R1-Distill). Our experiments reveal compression tradeoffs: 4-bit quantization preserves workflow generation and tool use (1%--3% drop) but degrades real-world application accuracy by 10%--15%. We introduce ERank, Top-k Ranking Correlation and Energy to systematize analysis. ACBench provides actionable insights for optimizing LLM compression in agentic scenarios, bridging the gap between algorithmic efficiency and real-world applicability.
Lay Summary: Large language models (LLMs) like ChatGPT and Gemini are incredibly powerful but require massive computing resources, making them expensive and slow to run. To make them more efficient, researchers use **compression techniques**—methods that shrink these models while trying to preserve their intelligence. However, most existing benchmarks only test whether compressed models can still answer questions or write text well. But what if we want these models to **act as autonomous agents**—planning tasks, using tools, understanding long conversations, or solving real-world problems? This paper introduces **ACBench**, the first benchmark designed to measure how well compressed LLMs perform in **agent-like scenarios**. The study evaluates: - **Workflow planning** (breaking complex tasks into steps), - **Tool use** (calling APIs or external software), - **Long-context understanding** (remembering and retrieving information from long documents), and - **Real-world applications** (handling tasks in robotics, gaming, or finance). The results show that **4-bit quantization** (a compression method) works well for planning and tool use, with only a small performance drop (1-3%). However, it struggles more in real-world tasks, where accuracy can drop by **10-15%**. The paper also introduces new ways to analyze compression effects, such as: - **ERank** (measuring how much the model’s internal structure changes), - **Top-k ranking correlation** (checking if compressed models make similar predictions to original ones), and - **Energy-based analysis** (evaluating confidence levels in responses). Key findings: ✅ **Quantization (e.g., GPTQ, AWQ) works better than pruning** for maintaining agent-like abilities. ❌ **Distilled models** (from DeepSeek R1 series) often perform worse in agent tasks, despite being good at reasoning. 📉 **Long-context understanding degrades** when models are heavily compressed, especially beyond 32K tokens. This research helps developers choose the best compression methods for AI agents, balancing efficiency with performance. It also highlights that **not all compression techniques are equal**—some preserve reasoning and planning abilities better than others. For more details, check out the full paper and code at: [https://github.com/pprp/ACBench](https://github.com/pprp/ACBench).
Link To Code: https://github.com/pprp/ACBench
Primary Area: Deep Learning->Large Language Models
Keywords: Agent, LLM, Quantization, Pruning
Submission Number: 6400
Loading