Keywords: Agent Orchestration, Constrained Reinforcement Learning, Cost-Aware Decision Making, Parameter-Efficient Learning, Low-Rank Adaptation
TL;DR: We propose LoRA-Guided PPO as a hybrid approach for cost-aware agent orchestration. Across two benchmarks, our hybrid achieves the lowest cost-per-success among high-success methods, outperforming both supervised LoRA and PPO alone.
Abstract: A fundamental challenge in multi-agent reasoning systems is budget-aware allocation: deciding which sub-agents to invoke across multiple steps while balancing success against computational and monetary cost. We formalize this setting as a cost-constrained sequential decision problem and propose a hybrid policy that integrates parameter-efficient pretraining with reinforcement learning. Specifically, a LoRA adapter captures cost-sensitive priors from heuristic traces, and Proximal Policy Optimization (PPO) fine-tunes only this low-rank subspace. Restricting updates to the adapter stabilizes optimization, improves sample efficiency, and preserves allocation thrift while enabling sequential credit assignment. Empirical evaluation suggests that this framework can add value in efficiency-accuracy trade-offs. On a ToolBench-style benchmark, the hybrid achieves perfect success while reducing cost-per-success (CPS) by 12\% relative to PPO (21.40 vs. 24.30). In the synthetic FlightPlanner setting, it achieves the lowest CPS (7.71) among high-success methods, compared with Rule-based (10.58) and supervised LoRA (11.70). Our results demonstrate that combining parameter-efficient fine-tuning with RL yields controllers that are both adaptive and budget-aware, providing a practical recipe for efficient reasoning under real-world resource constraints.
Submission Number: 184
Loading