Abstract: Understanding the reliability of large language models (LLMs) has recently garnered significant attention. Given LLMs' propensity to hallucinate, as well as their high sensitivity to prompt design, it is already challenging to predict the performance of an individual LLM. However, the problem becomes more complex for compound LLM systems such as cascades, where in addition to each model's standalone performance, we must understand how the error rates of different models interact. In this paper, we present a probabilistic model for the joint performance distribution of a sequence of LLMs, which enables a framework for rationally tuning the confidence thresholds of a LLM cascade using continuous optimization. Compared to selecting confidence thresholds using Bayesian optimization, our parametric Markov-copula model yields more favorable error-cost trade-offs, improving the area under the error-cost curve by 4.3% on average for cascades with $k\geq 3$ models. In the low-sample regime with $n \leq 30$ training examples, the performance improvement widens to 10.2%, suggesting that our framework's inductive assumptions about the interactions between the error rates of different LLMs enhance sample efficiency. Overall, our Markov-copula model provides a rational basis for tuning LLM cascade performance and points to the potential of probabilistic methods in analyzing LLM systems.
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission:
We thank reviewers jx1j, Z6N8, and 6MPw for their helpful feedback. Our second revision improves the manuscript in the following ways:
- We added Figure 5 clarifying the AUC performance metric for evaluating a cascade's error-cost trade-off
- We added Appendix F verifying the rank correlations between LLMs by recomputing them conditional on correct/incorrect answers
- We added a Bayesian optimization baseline to Section 4.4
- We added Section 4.4.1 evaluating performance of our method with very few training examples ($n \leq 30$)
- We added Section 4.4.2 analyzing the sensitivity of Rational Tuning’s performance gains to the Cramér-von Mises goodness-of-fit statistics
- We added Section 4.4.4 giving practical tips for applying our Rational Tuning framework
- We updated the abstract to reflect performance comparisons with the new Bayesian optimization baseline, as well as our method’s strong performance in the low-sample regime
Assigned Action Editor: Aditya Menon
Submission Number: 4225
Loading