Keywords: Large Language Models, Black-Box LLMs, LLM Reasoning, Reinforcement Learning
Abstract: For large language models deployed through black-box APIs, recurring inference
costs often dominate one-time training costs, motivating composed agentic systems
that amortize expensive reasoning into reusable intermediate representations. We
study a broad class of such systems, termed Guide–Core Policies (GCoP), in which
a guide model generates a structured strategy that is executed by a black-box core
model. This abstraction subsumes base, supervised, and advisor-style approaches,
which differ primarily in how the guide is trained. We formalize GCoP under
a cost-sensitive utility objective and show that end-to-end performance is
governed by guide-averaged executability: the probability that a strategy can be
faithfully followed by the core. Our analysis reveals that existing instantiations of
GCoP often fail to optimize executability under deployment constraints, leading to
brittle strategies and inefficient computation. Guided by these insights, we propose
ExecTune, a principled training recipe that combines teacher-guided acceptance
sampling, supervised fine-tuning, and structure-aware reinforcement learning
to directly optimize syntactic validity, execution success, and cost efficiency.
Across mathematical reasoning and code-generation benchmarks, GCoP with
ExecTune improves accuracy by up to **9.2%** over prior state-of-the-art baselines
while reducing inference cost by up to **22.4%**. GCOP with ExecTune enables
Claude Haiku-3.5 to surpass Sonnet-3.5 on math and code tasks and comes within
**1.7%** absolute accuracy of Sonnet 4 at **38%** lower cost. Beyond efficiency, GCoP
enables modular adaptation by updating guides without retraining the core.
Submission Number: 115
Loading