CodeSteer: Symbolic-Augmented Language Models via Code/Text Guidance

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: Method CodeSteer to augment LLM capabilities by guiding LLM code/text generation, and SymBench for evaluation of symbolic tasks.
Abstract: Existing methods fail to effectively steer Large Language Models (LLMs) between textual reasoning and code generation, leaving symbolic computing capabilities underutilized. We introduce CodeSteer, an effective method for guiding LLM code/text generation. We construct a comprehensive benchmark SymBench comprising 37 symbolic tasks with adjustable complexity and also synthesize datasets of 12k multi-turn guidance/generation trajectories and 5.5k guidance comparison pairs. We fine-tune the Llama-3-8B model with a newly designed multi-turn supervised fine-tuning (SFT) and direct preference optimization (DPO). The resulting model, CodeSteerLLM, augmented with the proposed symbolic and self-answer checkers, effectively guides the code/text generation of larger models. Augmenting GPT-4o with CodeSteer raises its average performance score from 53.3 to 86.4, even outperforming the existing best LLM OpenAI o1 (82.7), o1-preview (74.8), and DeepSeek R1 (76.8) across all 37 tasks (28 seen, 9 unseen). Trained for GPT-4o, CodeSteer demonstrates superior generalizability, providing an average 41.8 performance boost on Claude, Mistral, and GPT-3.5. CodeSteer-guided LLMs fully harness symbolic computing to maintain strong performance on highly complex tasks. Models, Datasets, and Codes are available at https://github.com/yongchao98/CodeSteer-v1.0 and https://huggingface.co/yongchao98.
Lay Summary: When reading through the papers on LLM Based Agents, I always feel curious why many testing tasks are apparently more suitable to be solved with code search/reasoning but we just ask LLMs to solve them via pure text generation. For example, the question '9.11 and 9.9 which is bigger' can be easily solved if we just prompt LLMs to use coding to answer. LLMs like ChatGPT can write essays or solve math problems, but they often struggle to decide when to use plain language versus actual computer code. Our project, CodeSteer, helps these models make better decisions between code and text when solving complex problems. We created a set of 37 tasks called SymBench, which test symbolic reasoning—such as solving puzzles or manipulating equations. We also generated a large amount of training data to help models learn when to switch between code and text. We then fine-tuned a powerful model using this data and added extra tools to help it double-check its answers. When we combined CodeSteer with OpenAI’s GPT-4o, the model’s performance significantly improved—beating the best existing systems. Even when used with other models like Claude and GPT-3.5, CodeSteer made them much better at solving symbolic problems. Our work shows that with the right guidance, AI models can fully use symbolic computing to solve difficult tasks more reliably and efficiently.
Link To Code: https://github.com/yongchao98/CodeSteer-v1.0
Primary Area: Deep Learning->Large Language Models
Keywords: Large Language Models, Code Interpreter, Code/text generation, Symbolic computing, Model fine-tuning, Model reasoning and planning
Submission Number: 9207
Loading