Abstract: The automatic Visual Language IDentification (VLID), i.e. a problem of using visual information to identify the language being spoken, using no audio information, is studied. The proposed method employs facial landmarks automatically detected in a video. A convex optimisation problem to find jointly both the discriminative representation (a soft-histogram over a set of lip shapes) and the classifier is formulated. A 10-fold cross-validation is performed on dataset consisting of 644 videos collected from youtube.com resulting in accuracy 73% in a pairwise discrimination between English and French (50% for a chance). A study, in which 10 videos were used, suggests that the proposed method performs better than average human in discriminating between the languages.
Loading