Energy-Efficient Inference with Small Language Models: A Comparative Study on Code Generation, Classification, and Environmental Impact
Abstract: Large language models (LLMs) are widespread in enterprise applications for code completion, email classification, and sentiment analysis. Although these models have good performance, their high computational requirements make them consume high energy in inference. Can smaller language models (SLMs) with three billion parameters (Qwen2.5-3B-Instruct)
perform similarly in structured high-frequency tasks while providing significantly lower environmental impact?
We tested an SLM on three enterprise workloads: code generation with HumanEval benchmark (164 tasks), HR email routing (1,339 examples), and binary sentiment analysis (100 samples). We recorded output quality, inference latency, throughput, GPU memory utilization and energy consumption. The SLM achieved 72.6% pass rate on code generation and 86% on sentiment analysis at 388–647× less energy per query than GPT-4o on code tasks and 210–1,333× less on classification tasks. Scaling to organizational context, replacing LLMs with task-specific SLMs for 10,000 daily code completions, 100,000 sentiment queries, and 50,000 monthly email classifications would save 52,642 kWh annually, reducing CO2 emissions by 23.1 metric tons. An SLM-first deployment strategy is a practical way to
attain sustainable AI with significant energy savings.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Binhang_Yuan1
Submission Number: 7785
Loading