Towards Automatically Optimizing Retrieval Augmented AI Systems

Published: 30 Oct 2025, Last Modified: 04 Nov 2025MLForSys2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Retrieval-Augmented Generation, RAG, Energy-Efficient Inference, ML for Systems, Configuration Optimization, Pareto Efficiency, Energy Profiling, Large Language Models, Pipeline Auto-Tuning
Abstract: Large Language Models (LLMs) are increasingly deployed in real-world systems, with Retrieval-Augmented Generation (RAG) a dominant production workload. Yet LLM deployments are energy-intensive, as inference accounts for over 90% of the model lifecycle in cloud workloads. We show that RAG workflows with near-identical accuracy can differ drastically in energy consumption—a property we call “workflow fungibility.” For example, pairing Llama3-8B with stronger retrievers matches the accuracy of Llama3-70B while using over 5× less energy. To study this effect, we profile retrieval and generation configurations across FinanceBench and FRAMES, mapping the joint accuracy–energy landscape. Our results reveal configurations within ≤3% accuracy that differ by up to 20.2× in energy, exposing large hidden opportunities for efficiency. We further demonstrate that lightweight regressors can predict accuracy from a small set of configuration knobs, enabling prediction-guided pruning of the design space. These findings establish workflow fungibility as a key lever for sustainable RAG, and point toward systematic, energy-aware configuration as a critical direction for retrieval-based LLM systems.
Submission Number: 65
Loading