Abstract: Highlights•We put forward a knowledge projector network which regards prior knowledge in CLIP (Radford et al., 2021) as a teacher to guide slot attention generation process.•An adaptive weighted fusion module is used to incorporate global features into slot representations.•An effective similarity calculation method is proposed to compare with fine-grained image–text matching methods. The results indicate that our method outperforms CLIP and the most recent image–text alignment algorithms.
External IDs:dblp:journals/ipm/DongZHZLH26
Loading