Energy-Efficient Inference with Small Language Models: A Comparative Study on Code Generation, Classification, and Environmental Impact

Energy-Efficient Inference with Small Language Models: A Comparative Study on Code Generation, Classification, and Environmental Impact

05 Mar 2026 (modified: 12 Apr 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large language models (LLMs) are widespread in enterprise applications for code completion, email classification, and sentiment analysis. Although these models have good performance, their high computational requirements make them consume high energy in inference. Can smaller language models (SLMs) with three billion parameters (Qwen2.5-3B-Instruct) perform similarly in structured high-frequency tasks while providing significantly lower environmental impact? We tested an SLM on three enterprise workloads: code generation with HumanEval benchmark (164 tasks), HR email routing (1,339 examples), and binary sentiment analysis (872 samples from SST-2). We recorded output quality, inference latency, throughput, GPU memory utilisation and energy consumption using direct power measurements via NVIDIA SMI. For controlled comparison, we also evaluated a 14-billion-parameter LLM (Phi-3-medium-4k-instruct, 4-bit quantized) under identical hardware and experimental conditions on the same three tasks. The SLM achieved 73.2% Pass@1 on code generation, 90.5% on sentiment analysis, and 79.9% on email classification. Under identical conditions, the SLM consumed 3-11× less energy per query than the 14B LLM across all tasks, while achieving higher accuracy. When compared against cloud-deployed frontier models using published inference cost estimates, the energy reduction scales to 388-1,333×, reflecting compounded effects of model scale and datacenter overhead. Scaling to organizational context, replacing a locally-hosted 14B LLM with an SLM for 10,000 daily code completions, 100,000 sentiment queries, and 50,000 monthly email classifications would save approximately 5,500 kWh annually, reducing CO2 emissions by 2.4 metric tons. Savings relative to cloud-deployed frontier models would be proportionally larger. An SLM-first deployment strategy is a practical way to attain sustainable AI with significant energy savings.

Submission Type: Regular submission (no more than 12 pages of main content)

Changes Since Last Submission: We addressed all reviewer concerns: (1) added controlled 14B LLM baseline (Phi-3-medium) under identical conditions, (2) replaced proxy energy with direct NVIDIA SMI power measurements, (3) expanded sentiment dataset from 100 to 872 samples (full SST-2), (4) corrected annual impact calculations to reflect measured energy differences, and (5) clarified the distinction between measured local savings (3-11x) and estimated cloud comparisons (388-1333x from literature).

Assigned Action Editor: ~Binhang_Yuan1

Submission Number: 7785

Loading