Sense Element in Continuous Speech: Evidence from Lhasa Tibetan Speech Synthesis

Yiqing Zu, Chen Lu, Ngo drup, Ronghua Zhu, Chenning Liu, Pengfei Shao, 'Bum Thr Klu, Xiao Zhang, Guoping Hu

20 Dec 2022OpenReview Archive Direct UploadReaders: Everyone

Abstract: In tonal languages， semantic elements can be combined into a unit characterized by tone sandhi． A multisyllabic unit with tone sandhi has the same status as a monosyllabic element． Such a speech unit is termed as the sense element， or SE． Since the semantic concepts expressed by the utterances are multidimensional， and the sounds unfold linearly in the temporal dimension， that is why the grammatical structure of SEs themselves， though different， are at the same working level． SE， as a running element in the process of language production and comprehension， is uttered out one after another in speech．MOS ( mean opinion score) rating is conducted to compare the synthesized Lhasa Tibetan speech generated by two models: the dictionary-entry model and the SE-entry model． The result shows that the model using SE as input unit obtains an MOS of 4．25， which is 0．82 point higher than the traditional one using dictionary entries as input． A Lhasa Tibetan speech database which includes 2，475 sentences， 3． 95 hours in total， is used in this experiment． The database is annotated on two levels: i ) tonal annotation， which includes the annotation of tonal value and tone sandhi domain， and ii) grammatical annotation， which includes word segmentation， POS tagging and the annotation of function words． The resulting alignment of tonal and grammatical annotations demonstrates that among 25，265 word boundaries， there is a 15% inconsistency between word boundaries and tone sandhi boundaries: the 85% consistency is found in content words， such as nouns and adjectives， while inconsistency only occurs in verb phrases． For example， verb phrases like “negation adverb + verb” and “verb + topic marker” are grouped into a tone sandhi domain． It is suggested that when homographic characters play different roles in sentences， such a difference is exhibited through varying forms of tonal realization． Specifically in Lhasa Tibetan， there are three tonal patterns for homographic disyllables ( tone sandhi， citation tone plus citation tone， and citation tone plus tone loss) ， which express various semantic and syntactic relations between the two syllables． Segmentation of the input texts into SE sequences would enable better alignment of text and speech and result in correctly synthesized speech．

0 Replies