Improved phonotactic language identification using random forest language models

Xiaorui Wang, Shijin Wang, Jiaen Liang, Bo Xu

Published: 2008, Last Modified: 15 May 2025ICASSP 2008EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Recently a new language model, the random forest language model (RFLM), has been proposed and shown encouraging results in speech recognition tasks. In this paper we applied the RFLM to language identification tasks. We proposed a shared backoff smoothing to deal with data sparseness problem. Experiments were conducted on a subset of NIST 2003 language recognition evaluation data. The RFLM obtained 15.7% relative error rate reduction comparing with the standard trigram LM. The RFLM can be used as a counterpart to n-gram LM and BTLM for system fusion. We also empirically studied the relation between system performance and the tree numbers in a RFLM.