TransCues: Boundary and Reflection-empowered Pyramid Vision Transformer for Semantic Transparent Object Segmentation

Tuan-Anh Vu; Nguyen Truong Hai; Ziqiang Zheng; Binh-Son Hua; Qing Guo; Ivor Tsang; Sai-Kit Yeung

TransCues: Boundary and Reflection-empowered Pyramid Vision Transformer for Semantic Transparent Object Segmentation

Tuan-Anh Vu, Nguyen Truong Hai, Ziqiang Zheng, Binh-Son Hua, Qing Guo, Ivor Tsang, Sai-Kit Yeung

16 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: representation learning for computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: semantic segmentation, transparent object segmentation, pyramidal vision transformer

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: We present a novel pyramidal transformer architecture with two object cues which significantly advanced semantic transparent object segmentation and demonstrated remarkable performance across a variety of benchmark datasets

Abstract: Although glass is a prevalent material in everyday life, most semantic segmentation methods struggle to distinguish it from opaque materials. We propose $\textbf{TransCues}$, a pyramidal transformer encoder-decoder architecture to segment transparent objects from a color image. To distinguish between glass and non-glass regions, our transformer architecture is based on two important visual cues that involve boundary and reflection feature learning, respectively. We implement this idea by introducing a Boundary Feature Enhancement (BFE) module paired with a boundary loss and a Reflection Feature Enhancement (RFE) module that decomposes reflections into foreground and background layers. We empirically show that these two modules can be used together effectively, leading to improved overall performance on various benchmark datasets. In addition to binary segmentation of glass and mirror objects, we further demonstrate that our method works well for generic semantic segmentation for both glass and non-glass labels. Our method outperforms the state-of-the-art methods by a large margin on diverse datasets, achieving $\textbf{+4.2}$\% mIoU on Trans10K-v2, $\textbf{+5.6}$\% mIoU on MSD, $\textbf{+10.1}$\% mIoU on RGBD-Mirror, $\textbf{+13.1}$\% mIoU on TROSD, and $\textbf{+8.3}$\% mIoU on Stanford2D3D, demonstrate the effectiveness and efficiency of our method.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 705

Loading