Abstract: We investigate the efficiency of two very different spoken term detection approaches for transcription when the available data is insufficient
to train a robust speech recognition system.
This work is grounded in a very low-resource
language documentation scenario where only
a few minutes of recording have been transcribed for a given language so far. Experiments on two oral languages show that a pretrained universal phone recognizer, fine-tuned
with only a few minutes of target language
speech, can be used for spoken term detection through searches in phone confusion networks with a lexicon expressed as a finite
state automaton. Experimental results show
that a phone recognition based approach provides better overall performances than Dynamic Time Warping when working with clean
data, and highlight the benefits of each methods for two types of speech corpus.
Loading