MISTA: A Large-Scale Dataset for Multi-Modal Instruction Tuning on Aerial Images

Hao Wu, Ke Lu, Yuqiu Li, Junhao Huang, Jian Xue

Published: 01 Jan 2024, Last Modified: 17 May 2025ICME 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: This paper introduces MISTA, a novel dataset for visual instruction tuning on aerial imagery, designed to enhance large multi-modal model applications in remote sensing. Originating from the renowned DOTA-v2.0 aerial object detection benchmark, MISTA uniformly processes high-resolution images into 2048×2048 pixels, creating a detailed and complex dataset tailored for remote sensing analysis. To craft this dataset, we design an automated annotation pipeline, employing advanced language models such as GPT-4 and LLaVA-1.5, to generate diverse and specialized instruction-following data. The annotations include various instruction types like multi-turn conversation, detailed description, and complex reasoning, each reflecting the intricacies inherent in remote sensing tasks. The innovative approach of subdividing aerial images into individually annotated sub-patches significantly enhances the richness of the dataset and allows for a more granular analysis of visual content. As a robust foundation for multi-modal model development in remote sensing, MISTA represents a significant advancement, setting the stage for future research and further applications in the field.