Visual-Enhanced Multimodal Framework for Flexible Job Shop Scheduling Problem

Peng Zhao, Zhiguang Cao, Di Wang, Wen Song, Wei Pang, You Zhou, Yuan Jiang

Published: 27 Oct 2025, Last Modified: 27 Jan 2026MM '25: Proceedings of the 33rd ACM International Conference on MultimediaEveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Multimodal models leverage complementary information across modalities to enrich feature representations. While visual information shows potential in representing structure for some combinatorial optimization problems (COPs), its application to complex scheduling like the Flexible Job Shop Scheduling Problem (FJSP) remains underexplored. Current learning-based FJSP solvers predominantly rely on handcrafted state features. This dependence can lead to inconsistencies and may not fully capture the problem's intricate dynamics. Crucially, these methods overlook visual modalities. Visual representations offer a distinct advantage by inherently capturing the global topological structure and complex resource interactions within the FJSP state. Unlike localized handcrafted features, this holistic, structural view provides a richer foundation for understanding scheduling complexity and making informed decisions. To overcome these limitations by leveraging visual information-known for representing topological structures and providing richer state representations-we introduce the AO-framework. This multimodal feature fusion approach enhances handcrafted state features by integrating insights from visual data. Our core contribution is a novel fusion mechanism utilizing orthogonal projection and local attention. Unlike traditional methods that often rely on simple concatenation of visual data, our method uniquely reduces redundancy by projecting global image-derived features onto local handcrafted features. This process extracts distinct information inherent to the visual modality, significantly improving the quality and complementarity of the resulting state features and enabling more informed scheduling decisions. To our knowledge, the AO-framework represents the first multimodal framework applied to scheduling problems, demonstrating the significant potential of visual information in this domain. Extensive experiments across various FJSP solvers and datasets confirm that our framework yields substantial enhancements in solution quality, decision-making capabilities, and generalization.
Loading