Abstract: We propose an online neural diarization method based on TS-VAD, which shows remarkable performance on highly overlapping speech. We introduce online VBx to help TS-VAD get the target-speaker embeddings. First, when the amount of data is insufficient, only online VBx is executed to accumulate speaker information. Afterwards, a separate offline subsystem is utilized to extract i-vectors based on core samples for TS-VAD online decoding. Finally, we devise a speaker selection strategy that allows TS-VAD to handle an unknown number of speakers. We evaluate our system on AliMeeting dataset. The experimental results demonstrate that our online method can effectively handle high-degree overlapped audios.
Loading