An Architecture Search Framework for Inference-Time Techniques

Jon Saad-Falcon; Adrian Gamarra Lafuente; Shlok Natarajan; Nahum Maru; Hristo Todorov; Etash Kumar Guha; E. Kelly Buchanan; Mayee F Chen; Neel Guha; Christopher Re; Azalia Mirhoseini

An Architecture Search Framework for Inference-Time Techniques

Jon Saad-Falcon, Adrian Gamarra Lafuente, Shlok Natarajan, Nahum Maru, Hristo Todorov, Etash Kumar Guha, E. Kelly Buchanan, Mayee F Chen, Neel Guha, Christopher Re, Azalia Mirhoseini

Published: 08 Mar 2025, Last Modified: 13 Apr 2025SSI-FM OralEveryoneRevisionsBibTeXCC BY 4.0

Keywords: inference-time techniques, test-time scaling, machine learning, natural language processing

TL;DR: We present Archon, a modular framework for designing inference-time architectures that outperforms top language models like OpenAI's o1, GPT-4o, and Claude 3.5 Sonnet on various benchmarks by optimally combining LLMs and inference techniques.

Abstract: Inference-time techniques, such as repeated sampling or iterative revisions, are emerging as powerful ways to enhance large-language models (LLMs) at test time. However, best practices for developing systems that combine these techniques remain underdeveloped due to our limited understanding of the utility of each technique across models and tasks, the interactions between them, and the massive search space for combining them. To address these challenges, we introduce Archon, a modular and automated framework for optimizing the process of selecting and combining inference-time techniques and LLMs. Given a compute budget and a set of available LLMs, Archon explores a large design space to discover optimized configurations tailored to target benchmarks. It can design custom or general-purpose architectures that advance the Pareto frontier of accuracy vs. maximum token budget compared to top-performing baselines. Across instruction-following, reasoning, and coding tasks, we show that Archon can leverage additional inference compute budget to design systems that outperform frontier models such as OpenAI’s o1, GPT-4o, and Claude 3.5 Sonnet by an average of 15.1%.

Submission Number: 6

Loading