Abstract: Listen, Attend and Spell(LAS) maps a sequence of acoustic spectra directly to a sequence of graphemes, with no explicit internal representation of phones. This paper asks whether LAS can be used as a scientific tool, to discover the phone set of a language whose phone set may be controversial or unknown. Phonemes have a precise linguistic definition, but phones may be defined in any manner that is convenient for speech technology: we propose that a practical phone set is one that can be inferred from speech following certain procedures, but that is also highly predictive of the word sequence. We demonstrate that such a phone set can be inferred by clustering the hidden nodes activation vectors of an LAS model during training, thus encouraging the model to learn a hidden representation characterized by acoustically compact clusters that are nevertheless predictive of the word sequence. We further define a metric for the quality of a phone set (the sum of conditional entropy of the graphemes given the phone set and the phones given the acoustics), and demonstrate that according to this metric, the clustered-LAS phone set is comparable to the original TIMIT phone set. Specifically, the clustered-LAS phone set is closer to the acoustics; the original TIMIT phone set is closer to the text.
Keywords: Listen, Attend and Spell, Sequence to Sequence attention mechanisms, clustering, phones, entropy