TL;DR: GeoPixel is a high-resolution remote sensing multimodal model enabling pixel-level grounding and GeoPixelD is introduced, a dataset curated via semi-automated, scalable pipeline integrating prior-informed visual prompting with state-of-the-art LMMs.
Abstract: Recent advances in large multimodal models (LMMs) have recognized fine-grained grounding as an imperative factor of visual understanding and dialogue. However, the benefits of such representation in LMMs are limited to the natural image domain, and these models perform poorly for remote sensing (RS). The distinct overhead viewpoint, scale variation, and presence of small objects in high-resolution RS imagery present a unique challenge in region-level comprehension. Moreover, the development of the grounding conversation capability of LMMs within RS is hindered by the lack of granular, RS domain-specific grounded data. Addressing these limitations, we propose GeoPixel - the first end-to-end high-resolution RS-LMM that supports pixel-level grounding. This capability allows fine-grained visual perception by generating interleaved masks in conversation. GeoPixel supports up to 4K HD resolution in any aspect ratio, ideal for high-precision RS image analysis. To support the grounded conversation generation (GCG) in RS imagery, we curate a visually grounded dataset GeoPixelD through a semi-automated pipeline that utilizes set-of-marks prompting and spatial priors tailored for RS data to methodically control the data generation process. GeoPixel demonstrates superior performance in pixel-level comprehension, surpassing existing LMMs in both single-target and multi-target segmentation tasks. Our methodological ablation studies validate the effectiveness of each component in the overall architecture. Our code and data will be publicly released.
Lay Summary: Modern artificial intelligence (AI) systems are getting better at understanding both pictures and language at the same time. These models are called large multimodal models (LMMs). But most of these models are trained on everyday photos, and they don’t work well with satellite or aerial images used in remote sensing (RS) that show the Earth from above and help with tasks like mapping, urban planning, and environmental monitoring.
We propose GeoPixel, an LMM designed especially for remote sensing and can perform “pixel grounding”, which means it can look at a satellite image, understand what’s in it, and then link specific words to the exact parts of the image they describe, down to the pixel level. This allows for a detailed and accurate understanding of what’s in the image.
Here’s how GeoPixel stands out:
- It works with high-resolution images (up to 4K quality), which is important for spotting small objects like cars from above.
- It can generate detailed descriptions of remote sensing images while also providing segmentation masks that highlight exactly where objects are in the image.
- We also created an RS dataset called GeoPixelD. This includes over 600,000 objects and associated descriptions, helping the model learn to understand and describe remote sensing images accurately.
- Compared to existing models, GeoPixel performs better at identifying and describing multiple objects in a single image and is especially good at linking language with specific visual features.
GeoPixel outperformed other models in describing scenes and identifying objects in complex, high-resolution remote sensing images. It can help with practical applications such as disaster response, environmental monitoring, and infrastructure planning by offering more precise, grounded, and interpretable information.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Link To Code: https://github.com/mbzuai-oryx/GeoPixel
Primary Area: Applications->Computer Vision
Keywords: Large Multimodal Models (LMMs), Remote Sensing (RS), Vision Language Models (VLMs), Computer Vision, Segmentation, Pixel Grounding
Flagged For Ethics Review: true
Submission Number: 6282
Loading