TL;DR: We propose SOAR, a method that learns program induction by integrating language models into a self-improving evolutionary loop, outperforming previous results among program induction methods on the ARC-AGI benchmark.
Abstract: Many program synthesis tasks prove too challenging for even state-of-the-art language models to solve in single attempts. Search-based evolutionary methods offer a promising alternative by exploring solution spaces iteratively, but their effectiveness remain limited by the fixed capabilities of the underlying generative model.
We propose SOAR, a method that learns program synthesis by integrating language models into a self-improving evolutionary loop.
SOAR alternates between (1) an evolutionary search that uses an LLM to sample and refine candidate solutions, and (2) a hindsight learning phase that converts search attempts into valid problem-solution pairs used to fine-tune the LLM's sampling and refinement capabilities—enabling increasingly effective search in subsequent iterations.
On the challenging ARC-AGI benchmark, SOAR achieves significant performance gains across model scales and iterations, leveraging positive transfer between the sampling and refinement finetuning tasks. These improvements carry over to test-time adaptation, enabling SOAR to solve 52\% of the public test set.
Lay Summary: Creating computer programs automatically, a task known as program synthesis, is often too difficult for even the most advanced language models to accomplish in one go. While search-based methods that try out many possibilities offer an alternative, they are held back by the unchanging abilities of the AI model they rely on.
We introduce SOAR, a new method that helps language models learn to write programs more effectively. SOAR works by having the language model go through a two-step process repeatedly. First, it searches for and improves potential program solutions. Second, it learns from these attempts, using them as examples to fine-tune its own ability to find and refine solutions in the future. This creates a cycle where the AI continuously gets better at the task.
When tested on ARC-AGI, a challenging benchmark, SOAR showed significant improvements. The AI's ability to both generate initial program ideas and to refine them got better with each cycle. This allowed SOAR to successfully solve 52% of the publicly available test tasks, demonstrating its effectiveness.
Primary Area: Deep Learning->Large Language Models
Keywords: Program synthesis; Self-Improvement; Search; ARC; LLM
Submission Number: 11517
Loading