A Better Phone Set for the TIMIT Dataset Discovered in Clustering of Listen, Attend and SpellDownload PDF


Sep 20, 2018 (edited Sep 10, 2019)NIPS 2018 Workshop IRASL Blind SubmissionReaders: Everyone
  • Keywords: LAS, attention, LSTM, cluster, phone, conditional entropy, LAS, acoustic modeling, speech
  • Abstract: Listen, Attend and Spell(LAS)maps a sequence of acoustic spectra directly to a sequence of graphemes, with no explicit internal representation of phones. This paper asks whether LAS can be used as a scientific tool, to discover the phone set of a language whose phone set may be controversial or unknown. Phonemes have a precise linguistic definition, but phones may be defined in any manner that is convenient for speech technology: we propose that a practical phone set is one that can be inferred from speech following certain procedures, but that is also highly predictive of the word sequence. We demonstrate that such a phone set can be inferred by clustering the hidden node activation vectors of an LAS model during training, thus encouraging the model to learn a hidden representation characterized by acoustically compact clusters that are nevertheless predictive of the word sequence. We further define a metric for the quality of a phone set (sum of conditional entropy of the phone set given graphemes, and given acoustics), and demonstrate that according to this metric, the clustered-LAS phone set is better than the original TIMIT phone set.
0 Replies