Abstract: This paper addresses the problem of multi-speaker sound source localization (SSL). Previously, we developed a neural network that integrates the von Mises Bernoulli (vMB) distribution into a Residual Network (ResNet) architecture, enabling robust SSL by leveraging the periodicity of phase information to mitigate environmental noise. Building on this, we proposed a Video Vision Transformer (ViViT)-based method for SSL, which not only demonstrated superior robustness in realworld environments compared to vM-B ResNet but also achieved higher localization accuracy. However, both methods were limited to single-source localization. In this study, we extend these approaches by modifying the ViViT architecture to accommodate multi-source sound localization. Additionally, we incorporate the von Mises Bernoulli distribution into the ViViT framework to further enhance robustness against varying environmental conditions. Experimental results confirm the effectiveness of the proposed method.
External IDs:dblp:conf/eusipco/YokotaYN25
Loading