Abstract: Recently, transformer-based techniques incorporating superpoints have become prevalent in 3D instance segmentation. However, they often encounter an over-segmentation problem, especially noticeable with large objects. Additionally, unreliable mask predictions stemming from superpoint mask prediction further compound this issue. To address these challenges, we propose a novel framework called MSTA3D. It leverages multi-scale feature representation and introduces twin-attention mechanisms to effectively capture them. Furthermore, MSTA3D integrates a box query with a box regularizer, offering a complementary spatial constraint alongside semantic queries. Experimental evaluations on ScanNetV2, ScanNet200, and S3DIS datasets demonstrate that our approach surpasses state-of-the-art 3D instance segmentation methods.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Content] Media Interpretation
Relevance To Conference: 3D instance segmentation refers to a task that involves identifying and separating individual objects within an image, including detecting object boundaries and assigning a unique label to each identified object. Its significant role in computer vision has surged corresponding to the demand for 3D perception in various applications, such as augmented/virtual reality, autonomous driving, robotics, and indoor scanning.
Supplementary Material: zip
Submission Number: 1440
Loading