Abstract: The goal of referring image segmentation is to identify the object matched with an input natural language expression. Previous methods only support English descriptions, whereas Chinese is also broadly used around the world, which limits the potential application of this task. Therefore, we propose to extend existing datasets with Chinese descriptions and preprocessing tools for training and evaluating bilingual referring segmentation models. In addition, previous methods also lack the ability to collaboratively learn channel-wise and spatial-wise cross-modal attention to well align visual and linguistic modalities. To tackle these limitations, we propose a Linguistic Excitation module to excite image channels guided by language information and a Linguistic Aggregation module to aggregate multimodal information based on image-language relationships. Since different levels of features from the visual backbone encode rich visual information, we also propose a Cross-Level Attentive Fusion module to fuse multilevel features gated by language information. Extensive experiments on four English and Chinese benchmarks show that our bilingual referring image segmentation model outperforms previous methods.
0 Replies
Loading