Abstract: Highlights•Introduce MSVG, a novel framework for visual grounding in remote sensing.•Propose an MTAM module for multi-stage visual–textual feature alignment.•Propose a VEFM module that refines correlation, ensuring precise localization.•We achieved new SOTA results on both RefCOCO and DIOR-RSVG datasets.
Loading