Abstract: Referring expression comprehension (REC) aims at locating the target object described by an expression. We observe that most of the graph-based REC methods only focus on establishing relations between all objects in an image and the given expression during the graph construction while ignoring the relationships between objects in the same category. As a result, these methods are sub-optimal in locating the target object described by the expression, particularly when the target object is surrounded by objects of similar categories. Meanwhile, during reasoning, numerous irrelevant objects are considered for expression, which will introduce significant harmful noise. To address these issues, this paper proposes a new graph-based group division network (GBGDN). Different from the existing works, our work partitions the constructed graphs into several sub-graphs based on the categories of objects and expressions. In each sub-graph, the common visual features of objects will be strengthened through a feature enhancement strategy. Subsequently, the enhanced sub-graphs and expressions undergo joint processing via a filtering-based reasoning module designed to reduce the influence of unrelated nodes in each sub-graph, facilitating more accurate reasoning and matching. Experimental results across various datasets, including RefCOCO /+/g, Flickr30K Entities, RefClef, and Ref-reasoning, showcase the superiority of our proposed method over existing approaches. Most importantly, our method does not need pre-training.
Loading