Keywords: multimodal models, reasoning, tool use
TL;DR: TACO: Learning Multi-modal Models to Reason and Act with Synthetic Chains-of-Thought-and-Action
Abstract: While open-source multi-modal language models perform well on simple question
answering tasks, they often fail on complex questions that require multiple capa-
bilities, such as fine-grained recognition, visual grounding, and reasoning, and
that demand multi-step solutions. We present TACO, a family of multi-modal
large action models designed to improve performance on such complex, multi-
step and multi-modal tasks. During inference, TACO produces chains-of-thought-
and–action (CoTA), executes intermediate steps by invoking external tools such as
OCR, depth estimation and calculator, then integrates both the thoughts and action
outputs to produce coherent responses. To train TACO, we create a large dataset
of 1M+ synthetic CoTA traces generated with GPT-4o and Python programs. We
then experiment with various data filtering and mixing techniques and obtain a
final subset of 293K high-quality CoTA examples. This dataset enables TACO to
learn complex reasoning and action paths, surpassing existing models trained on
instruct tuning data with only direct answers. Our model TACO outperforms the
instruction-tuned baseline across 8 benchmarks, achieving a 3.9% improvement
on average, with gains up to 20% in MMVet tasks involving OCR, mathematical
reasoning and spatial reasoning.
Submission Number: 37
Loading