Abstract: Generative AI and LLMs in particular are heavily used nowadays for various document processing tasks such as question answering and document summarization. Enterprises are incurring huge costs when operating or using LLMs for their respective use cases.
In this work, we propose optimizing the usage costs of LLMs in a quality-aware manner for document summarization tasks. Specifically, we propose to exploit the variability of LLM performances across different types and formats of data to maximize the output quality while maintaining expected costs under a budget and latency within a threshold. This presents two challenges: 1) estimating the output quality of LLMs at runtime without invoking each LLM, 2) optimally allocating queries to LLMs such that the objectives are optimized and constraints are satisfied. We propose a model to predict the output quality of LLMs on text summarization, followed by an LP rounding algorithm to optimize the selection of LLMs. We study the problems both theoretically and empirically. Our methods reduce costs by $40\%- 90\%$ while improving quality by $4\%-7\%$. In addition to the quantitative results, we further show that our model quality estimation aligns majorly with human preferences through a user study.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Lijun_Zhang1
Submission Number: 4163
Loading