Abstract: With the development of large language models (LLMs) and remote sensing technology, visual language (VL) tasks in the field of remote sensing have attracted more and more research attention. Commonly used VL datasets currently usually focus on the overall scene of the image, lacking the description of instance-level details such as location, size, category and so on. More importantly, users are usually not allowed to directly intercept regions in the image to ask questions through these datasets. However, the instance-level question answering based on these information is of great significance for target extraction in practical applications. In this manuscript, we build an innovative and challenging dataset FAIR1M-GQA. It unlocks the ability of the model to learn directly from text input and text output both with region coordinates, which are directly linked to fine-grained objects in remote sensing images. We experiment our dataset to verify the feasibility of the relevant task and provide the benchmark results.
Loading