Keywords: Computer Vision, Remote Sensing, Geospatial AI, Human-in-the-loop
Abstract: Map generation tasks feature extensive non-structural *vectorized data* (e.g., points, polylines, and polygons) and thus pose significant challenges to common pixel-wise generative models. Conventional approaches use multiple stages, first segmenting these features at the pixel level and then performing vectorized post-processing, with errors and complexity compounding at each stage. Motivated by the recent success of auto-regressive language modeling, we propose the first map foundation model, named Map Auto-Regressor (MARS), that is capable of generating both multi-polyline road networks and polygon buildings in a unified manner. For training MARS, we collected to our knowledge the largest multi-class map extraction dataset totaling 3.4M examples, which we call MAP-3M. Across four road and building datasets, MARS outperforms or matches the performance of multistage baselines. Additionally, we develop a ``Chat with MARS'' capability that enables interactive human-in-the-loop map generation and correction, supported by the auto-regressive nature of our end-to-end approach.
We release our MAP-3M dataset and project demo page at (1) https://huggingface.co/datasets/bag-lab/MAP-3M and (2) https://huggingface.co/spaces/bag-lab/MARS, respectively.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 10088
Loading