Abstract: Highlights•A graph-based framework is proposed for cross-modal aggregation and disentanglement.•Multi-granularity semantic consistency learning measures original vs. disentangled representations.•Extensive experiments on Flickr30K and MS-COCO datasets demonstrate our method’s superiority.
Loading