MTSNet: Joint Feature Adaptation and Enhancement for Text-Guided Multi-view Martian Terrain Segmentation
Abstract: Martian terrain segmentation plays a crucial role in autonomous navigation and safe driving of Mars rovers as well as global analysis of Martian geological landforms. However, most deep learning-based segmentation models cannot effectively handle the challenges of highly unstructured and unbalanced terrain distribution on the Martian surface, thus leading to inadequate adaptability and generalization ability. In this paper, we propose a novel multi-view Martian Terrain Segmentation framework (MTSNet) by developing an efficient Martian Terrain text-Guided Segment Anything Model (MTG-SAM) and combining it with a tailored Local Terrain Feature Enhancement Network (LTEN) to capture intricate terrain details. Specifically, the proposed MTG-SAM is equipped with a Terrain Context attention Adapter Module (TCAM) to efficiently and effectively unleashing the model adaptability and transferability on Mars-specific terrain distribution.
Then, a Local Terrain Feature Enhancement Network (LTEN) is designated to compensate for the limitations of MTG-SAM in capturing the fine-grained local terrain features of Mars surface. Afterwards, a simple yet efficient Gated Fusion Module (GFM) is introduced to dynamically merge the global contextual features from MTG-SAM encoder and the local refined features from LTEN module for comprehensive terrain feature learning. Moreover, the proposed MTSNet enables terrain-specific text as prompts resolving the efficiency issue of existing methods that require costly annotation of bounding boxes or foreground points. Experimental results on AI4Mars and ConeQuest datasets demonstrate that our proposed MTSNet can effectively learns the unique Martian terrain feature distribution and achieves state-of-the-art performance on multi-view terrain segmentation from both the perspectives of the Mars rover and the satellite remote sensing. Code is available at https://github.com/raoxuefeng/mtsnet.
Primary Subject Area: [Content] Multimodal Fusion
Secondary Subject Area: [Content] Vision and Language
Relevance To Conference: This paper presents a text-guided multi-view Martian Terrain Segmentation framework (MTSNet) consisting of an efficient Martian Terrain text-Guided Segment-Anything Model (MTG-SAM) and a tailored Local Terrain Feature Enhancement Network (LTEN) to capture intricate terrain details. The integration of text prompts with visual data for terrain segmentation is a clear example of multimodal processing. And the use of text prompts to guide the segmentation process leverages linguistic information alongside visual information, enhancing the model's understanding and processing of multimedia content. In summary, this paper focuses on leveraging multimodal data (visual and textual) for terrain segmentation, its advancement in multimedia content analysis, and its application in a relevant domain like remote sensing and autonomous navigation, make it consistent with the themes and interests of multimedia/multimodal processing. It showcases how multimodal fusion-based model can be utilized to interpret complex multimedia data in challenging environments, thus proving the effectiveness and advancement of such methods.
Supplementary Material: zip
Submission Number: 3939
Loading