Lip-Text Verification Using Multivariate Time Series Lip Motion Features

Turi Abu; Polina Konovalova

Lip-Text Verification Using Multivariate Time Series Lip Motion Features

Turi Abu, Polina Konovalova

28 Apr 2026 (modified: 28 Apr 2026)THU 2026 Spring ANM SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Lip reading verification, Time series classification

TL;DR: Time series Lip motion features extraction for lip reading verification

Abstract: Lipreading has recently gained attention not only as a complementary modality to audio-based speech processing, but also as a promising tool for security-oriented applications such as identity verification, liveness detection, and audio-visual consistency checking. By analyzing the visual patterns of a speaker’s lip movements, these systems can help determine whether spoken content matches the observed facial motion, making them particularly useful in scenarios where audio may be spoofed, manipulated, or unavailable. Early research in lipreading relied on handcrafted visual features, including pixel intensities, geometric lip contours, and Active Appearance Models (AAMs). More recent approaches, however, leverage deep learning techniques to learn representations directly from raw video data. While these methods have demonstrated strong performance, they often depend on large-scale datasets and tend to produce high-dimensional, less interpretable feature representations. In this proposal, we investigate an alternative formulation of lipreading for verification tasks. Specifically, we propose to model lip-text verification as a multivariate time series learning problem using compact geometric features extracted from facial landmarks. Instead of processing raw images, we derive a low-dimensional representation that captures the temporal dynamics of lip articulation. These features are designed to be interpretable, computationally efficient, and robust to speaker variation through normalization. We propose a hybrid deep learning architecture combining temporal convolution and recurrent sequence modeling to learn discriminative patterns from the extracted time series. The system is designed to verify whether a given lip motion sequence corresponds to a claimed text sequence. This proposal outlines the feature design, modeling approach, and experimental plan for evaluating the effectiveness of this representation.

Submission Number: 4

Loading