Abstract: Highlights•A novel relationship-guided vision-language Transformer is proposed for FAR.•An image-text cross-attention enhances the interaction between text and image tokens.•A token selection mechanism can reduce the image background interference.•An image-text alignment loss is designed for further modality alignment.•Experiments verify the superiority of our method, especially on limited labeled data.
Loading