Modality-Swap Distillation: Rendering Textual Reasoning into Visual Supervision

Modality-Swap Distillation: Rendering Textual Reasoning into Visual Supervision

ICLR 2026 Conference Submission25623 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reasoning+LLM

Abstract: Visual reasoning over structured data such as tables is a critical capability for modern vision-language models (VLMs), yet current benchmarks remain limited in scale, diversity, or reasoning depth, especially when it comes to rendered table images. Addressing this gap, we introduce \textbf{Visual-TableQA}, a large-scale, open-domain multimodal dataset specifically designed to evaluate and enhance visual reasoning over complex tabular data. Our generation pipeline is \textbf{modular, scalable, and fully autonomous}, involving multiple reasoning LLMs collaborating across distinct roles: generation, validation, and inspiration. \textbf{Visual-TableQA} comprises 2.5k richly structured LaTeX-rendered tables and 9k reasoning-intensive QA pairs, all produced at a cost of under \$100. To promote diversity and creativity, our pipeline performs \textbf{multi-model collaborative data generation} via \textbf{cross-model prompting (‘inspiration’)} and LLM-jury filtering. Stronger models seed layouts and topics that weaker models elaborate, collectively distilling diverse reasoning patterns and visual structures into the dataset. Empirical results show that models fine-tuned on \textbf{Visual-TableQA} generalize robustly to external benchmarks, outperforming several proprietary models despite the dataset’s synthetic nature. The full pipeline and resources are publicly available.

Supplementary Material: pdf

Primary Area: transfer learning, meta learning, and lifelong learning

Submission Number: 25623

Loading