AMG-Embedding: a Self-Supervised Embedding Approach for Audio Identification

Published: 20 Jul 2024, Last Modified: 05 Aug 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Audio Identification aims to precisely retrieve exact matches from a vast music repository through a query audio snippet. The need for specificity and granularity has traditionally led to representing music audio using numerous short fixed-duration overlapped segment/shingle features in fingerprinting approaches. However, fingerprinting imposes constraints on scalability and efficiency, as hundreds or even thousands of embeddings are generated to represent a typical music audio. In this paper, we present an innovative self-supervised approach called Angular Margin Guided Embedding (AMG-Embedding). AMG-Embedding is built on a traditional fingerprinting encoder and aims to represent variable-duration non-overlapped segments as embeddings through a two-stage embedding and class-level learning process. AMG-Embedding significantly reduces the number of generated embeddings while achieving high-specific fragment-level audio identification simultaneously. Experimental results demonstrate that AMG-Embedding achieves retrieval accuracy comparable to the based fingerprinting approach while consuming less than $1/10th$ of its storage and retrieval time. The efficiency gains of our approach position it as a promising solution for scalable and efficient audio identification systems.
Primary Subject Area: [Engagement] Multimedia Search and Recommendation
Secondary Subject Area: [Systems] Data Systems Management and Indexing
Relevance To Conference: This work contributes to multimedia processing by addressing the crucial task of audio identification within vast music repositories. Traditional approaches in audio identification rely on fixed-duration fingerprinting techniques, which pose challenges in scalability and efficiency due to the generation of numerous embeddings. Our proposed method, Angular Margin Guided Embedding (AMG-Embedding), presents an innovative self-supervised approach that overcomes these limitations. By representing variable-duration non-overlapped segments as embeddings through a two-stage embedding and class-level learning process, AMG-Embedding achieves high-specific fragment-level audio identification while significantly reducing the number of generated embeddings. Our experimental results demonstrate that AMG-Embedding offers retrieval accuracy comparable to traditional fingerprinting approaches while consuming substantially less storage and retrieval time. The efficiency gains achieved by our approach position it as a promising solution for scalable and efficient audio identification systems, thus contributing to advancing multimedia processing by enhancing the effectiveness and efficiency of audio retrieval in multimedia repositories.
Submission Number: 5298
Loading