3D Optimization for AI Inference Scaling: Balancing Accuracy, Cost, and Latency

Published: 11 Nov 2025, Last Modified: 16 Jan 2026DAI OralEveryoneRevisionsBibTeXCC BY 4.0
Keywords: AI inference, Large language models (LLMs), Inference scaling, Inference efficiency, Optimization
Abstract: AI inference scaling is often tuned through 1D heuristics (a fixed reasoning passes) or 2D bivariate trade-offs (e.g., performance vs.\ compute), which fail to consider cost and latency constraints. We introduce a 3D optimization framework that jointly calibrates accuracy, cost, and latency within a unified decision space, enabling constraints-aware inference scaling. Using Monte Carlo simulations across three representative scenarios and nine simulated large language models, we evaluate four optimization methods to address the 3D multi-objective optimization (MOO) problem. Framing inference scaling in MOO shapes a feasible space that 1D and 2D optimizations fail to capture, enabling environment-adaptive selection of the inference scaling~$k$. Results show that knee-point optimization achieves the best balance, while accuracy-maximization remains favorable when precision is prioritized. The framework establishes a theoretical foundation for deployment-aware inference scaling across diverse operational contexts.
Submission Number: 13
Loading