Abstract: Highlights•We present a hierarchical visual-semantic interaction (HVSI) method to exploit multi-scale interaction of visual and semantic features.•The proposed HVSI method is an end-to-end framework and does not depend on any pre-trained language model.•A visual-semantic alignment module is proposed to alleviate score gap between visual and semantic features.•Extensive experiments on multiple benchmarks show that HVSI achieves state-of- the-art or competitive performances.
Loading