Abstract: A soundscape is defined by the acoustic environment a person perceives at a location. In this work, we propose a framework for mapping soundscapes across the Earth. Since soundscapes involve sound distributions that span varying spatial scales, we represent locations with multi-scale satellite imagery and learn a joint representation among this imagery, audio, and text. To capture the inherent uncertainty in the soundscape of a location, we additionally design the representation space to be probabilistic. We also fuse ubiquitous metadata (including geolocation, time, and data source) to enable learning of spatially and temporally dynamic representations of soundscapes. We demonstrate the utility of our framework by creating large-scale soundscape maps integrating both audio and text with temporal control. To facilitate future research on this task, we also introduce a large-scale dataset, GeoSound, containing over 300k geotagged audio samples paired with both low- and high-resolution satellite imagery. We demonstrate that our method outperforms the existing state-of-the-art on both GeoSound and the existing SoundingEarth dataset. Our dataset and code will be made available at TBD.
Primary Subject Area: [Content] Multimodal Fusion
Secondary Subject Area: [Experience] Multimedia Applications, [Content] Media Interpretation
Relevance To Conference: We focus on soundscape mapping. Our framework combines information from three sources: audio, text, overhead imagery, and available metadata. Using our framework, we can build large-scale soundscape maps using audio or textual queries. We believe our work will interest the multimedia community, particularly those focused on remote-sensing multimedia applications. Our proposed framework for learning multimodal probabilistic embedding space should inspire other multimodal learning frameworks, especially where the correspondence between modalities is naturally noisy.
Supplementary Material: zip
Submission Number: 5080
Loading