Abstract: The objective of this work is to build virtual talking avatars of characters fully automatically from TV shows. From this unconstrained data, we show how to capture a character’s style of speech, visual appearance and language in an effort to construct an interactive avatar of the person and effectively immortalize them in a computational model. We make three contributions (i) a complete framework for producing a generative model of the audiovisual and language of characters from TV shows; (ii) a novel method for aligning transcripts to video using the audio; and (iii) a fast audio segmentation system for silencing non-spoken audio from TV shows. Our framework is demonstrated using all 236 episodes from the TV series Friends ( $$\approx $$ 97 h of video) and shown to generate novel sentences as well as character specific speech and video.
0 Replies
Loading