TACO: Learning Multi-modal Models to Reason and Act with Synthetic Chains-of-Thought-and-Action

Zixian Ma; Jianguo Zhang; Zhiwei Liu; Jieyu Zhang; Juntao Tan; Manli Shu; Juan Carlos Niebles; Shelby Heinecke; Huan Wang; Caiming Xiong; Ranjay Krishna; silvio savarese

TACO: Learning Multi-modal Models to Reason and Act with Synthetic Chains-of-Thought-and-Action

Zixian Ma, Jianguo Zhang, Zhiwei Liu, Jieyu Zhang, Juntao Tan, Manli Shu, Juan Carlos Niebles, Shelby Heinecke, Huan Wang, Caiming Xiong, Ranjay Krishna, silvio savarese

Published: 05 Mar 2025, Last Modified: 20 Mar 2025Reasoning and Planning for LLMs @ ICLR2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: multimodal models, reasoning, tool use

TL;DR: TACO: Learning Multi-modal Models to Reason and Act with Synthetic Chains-of-Thought-and-Action

Abstract: While open-source multi-modal language models perform well on simple question answering tasks, they often fail on complex questions that require multiple capa- bilities, such as fine-grained recognition, visual grounding, and reasoning, and that demand multi-step solutions. We present TACO, a family of multi-modal large action models designed to improve performance on such complex, multi- step and multi-modal tasks. During inference, TACO produces chains-of-thought- and–action (CoTA), executes intermediate steps by invoking external tools such as OCR, depth estimation and calculator, then integrates both the thoughts and action outputs to produce coherent responses. To train TACO, we create a large dataset of 1M+ synthetic CoTA traces generated with GPT-4o and Python programs. We then experiment with various data filtering and mixing techniques and obtain a final subset of 293K high-quality CoTA examples. This dataset enables TACO to learn complex reasoning and action paths, surpassing existing models trained on instruct tuning data with only direct answers. Our model TACO outperforms the instruction-tuned baseline across 8 benchmarks, achieving a 3.9% improvement on average, with gains up to 20% in MMVet tasks involving OCR, mathematical reasoning and spatial reasoning.

Submission Number: 37

Loading