MSM: Multi-Scale Mamba in Multi-Task Dense Prediction

Mang Cao; Sanping Zhou; Ye Deng; Wenli Huang; Le Wang; Jinjun Wang

MSM: Multi-Scale Mamba in Multi-Task Dense Prediction

Mang Cao, Sanping Zhou, Ye Deng, Wenli Huang, Le Wang, Jinjun Wang

25 Sept 2024 (modified: 26 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: multi-task learning, representation learning, multiscale, mamba

Abstract: High-quality visual representations are crucial for success in multi-task dense prediction. The Mamba architecture, initially designed for natural language processing, has garnered interest for its potential in computer vision due to its efficient modeling of long-range dependencies. However, when applied to multi-task dense prediction, it reveals inherent limitations. Unlike text processing with diverse tokenization strategies, image token partitioning requires careful consideration of multiple options. In multi-task dense prediction, each task may require specific levels of granularity in scene structure. Unfortunately, the current Mamba implementation, which segments images into fixed patch scales, fails to match these requirements, leading to sub-optimal performance. This paper proposes a simple yet effective Multi-Scale Mamba (MSM) for multi-task dense prediction. Firstly, we employ a novel Multi-Scale Scanning (MS-Scan) to establish global feature relationships at various scales. This module enhances the model's capability to deliver a comprehensive visual representation by integrating information across scales. Secondly, we adaptively merge task-shared information from multiple scales across different task branches. This design not only meets the diverse granularity demands of various tasks but also facilitates more nuanced cross-task feature interactions. Extensive experiments on two challenging benchmarks, i.e., NYUD-V2 and PASCAL-Context, show the superiority of our MSM vs its state-of-the-art competitors in multi-task dense prediction.

Primary Area: applications to computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 4534

Loading