An Architecture Search Framework for Inference-Time Techniques

Jon Saad-Falcon; Adrian Gamarra Lafuente; Shlok Natarajan; Nahum Maru; Hristo Todorov; Etash Kumar Guha; E. Kelly Buchanan; Mayee F Chen; Neel Guha; Christopher Re; Azalia Mirhoseini

An Architecture Search Framework for Inference-Time Techniques

Jon Saad-Falcon, Adrian Gamarra Lafuente, Shlok Natarajan, Nahum Maru, Hristo Todorov, Etash Kumar Guha, E. Kelly Buchanan, Mayee F Chen, Neel Guha, Christopher Re, Azalia Mirhoseini

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We present Archon, a modular framework for designing inference-time architectures that outperforms top language models like OpenAI's o1, GPT-4o, and Claude 3.5 Sonnet on various benchmarks by optimally combining LLMs and inference techniques.

Abstract: Inference-time techniques, such as repeated sampling or iterative revisions, are emerging as powerful ways to enhance large-language models (LLMs) at test time. However, best practices for developing systems that combine these techniques remain underdeveloped due to our limited understanding of the utility of each technique across models and tasks, the interactions between them, and the massive search space for combining them. To address these challenges, we introduce Archon, a modular and automated framework for optimizing the process of selecting and combining inference-time techniques and LLMs. Given a compute budget and a set of available LLMs, Archon explores a large design space to discover optimized configurations tailored to target benchmarks. It can design custom or general-purpose architectures that advance the Pareto frontier of accuracy vs. maximum token budget compared to top-performing baselines. Across instruction-following, reasoning, and coding tasks, we show that Archon can leverage additional inference compute budget to design systems that outperform frontier models such as OpenAI’s o1, GPT-4o, and Claude 3.5 Sonnet by an average of 15.1%.

Lay Summary: Large language models like GPT-4 or Claude are powerful — but they get even better if you let them generate multiple answers, vote on the best one, or combine ideas from different models. These tricks, called inference-time techniques, can boost accuracy at test time without retraining the model. The challenge is that there are many such techniques, and figuring out which ones to use — and in what order — is a complex puzzle. That’s where Archon comes in. Archon is an open-source framework that acts like an AI systems architect. You give it a compute budget, a set of models, and a target task — like solving math problems or writing code — and it automatically designs the best way to combine techniques like generation ensembling, ranking, fusion, and verification. Archon uses smart search algorithms (like Bayesian optimization) to explore thousands of possible configurations and find high-performing combinations. It can even tailor its design for specific tasks or discover general-purpose strategies that work across many benchmarks. In our experiments, Archon beat state-of-the-art systems like GPT-4o and Claude 3.5 Sonnet by over 15%, while using fewer compute resources. By helping users get more out of existing models, Archon makes advanced AI both more powerful and more accessible.

Link To Code: https://github.com/ScalingIntelligence/Archon

Primary Area: Deep Learning->Large Language Models

Keywords: inference-time techniques, test-time scaling, machine learning, natural language processing

Submission Number: 12728

Loading