Beyond NL2Code: A Systematic Survey of Multimodal Code Intelligence

Beyond NL2Code: A Systematic Survey of Multimodal Code Intelligence

TMLR Paper7832 Authors

08 Mar 2026 (modified: 25 Mar 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: While Large Language Models (LLMs) have revolutionized text-to-code synthesis, conventional text-centric paradigms fail to capture the dense spatial hierarchies and structural constraints inherent in real-world visual contexts, such as user interfaces and scientific plots. To bridge this gap, Multimodal Code Intelligence has emerged as a pivotal domain, empowering Vision-Language Models (VLMs) to translate visual perception into precise executable code. This paper presents a structured taxonomy of this rapidly evolving landscape, systematically categorizing the literature into four foundational domains: Graphical User Interfaces, Scientific Visualization, Structured Graphics, and Frontiers Frameworks. Within this framework, we systematically analyze tasks ranging from mainstream web and chart synthesis to complex emerging scenarios, such as programmatic visual manipulation and code-to-video generation. Through rigorous analysis of existing benchmarks and methodologies, we identify four pivotal technical shifts that may shape future research: the transition from imitation-based training to reward-driven optimization, the progression from static synthesis toward dynamic interaction, the evolution toward unified, general-purpose models, and the evolution from chat-based systems into autonomous agents. We envision this systematic survey as a foundational guide to accelerate future advancements in multimodal code intelligence. The trajectory of this field is rapidly shifting from merely extracting basic functional logic to synthesizing high-fidelity, aesthetically refined, and dynamically interactive outputs through iterative refinement. Ultimately, we posit that code constitutes the universal action space for multimodal general intelligence. By empowering AI systems to seamlessly translate complex visual intent into executable logic and autonomously navigate digital environments, visually-grounded code generation marks a definitive breakthrough toward autonomous software agents. An ongoing, dynamically updated project and resources associated with this survey have been released at \href{https://anonymous.4open.science/r/Awesome-Multimodal-LLM-for-Code-2031}{anonymous repo}.

Submission Type: Long submission (more than 12 pages of main content)

Assigned Action Editor: ~Alessandro_Sordoni1

Submission Number: 7832

Loading