Maximizing LLM Efficiency Through Optimization Strategies

Published: 04 Jul 2025, Last Modified: 22 Jul 2025KDD 2025 Workshop on Inference Optimization for GenAI PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Inference optimization, Model Pruning, Compression
TL;DR: This paper evaluates inference optimization techniques for large language models, demonstrating that strategic combinations of methods can maintain performance with dramatically reduced computational costs.
Abstract: As Large Language Models (LLMs) scale in size, their capabilities dramatically improve, but this dimensional expansion simultaneously introduces substantial computational barriers to efficient inference. While various optimization methods exist including model pruning, knowledge distillation, and quantization, their effectiveness and interaction effects remain insufficiently characterized across deployment scenarios. In this work, we perform comprehensive comparisons between inference optimization techniques for LLMs, systematically evaluating their impact on model performance and computational efficiency. Our experiments with Llama3 and Qwen models reveal that knowledge distillation effectively mitigates performance degradation from pruning while caching and hardware acceleration provide complementary benefits. Most significantly, we find that optimally combining these approaches enables smaller models to achieve performance comparable to models 4 x larger while reducing inference latency by up to 100 x.
Submission Number: 1
Loading