Bridging Semantics Across Modalities: Decoupled Representation Learning for Audio-Visual Speech Recognition
Abstract: Highlights•A unified speech recognition framework for noise robustness and unseen speakers.•Offering an insight into cross-modal linguistically semantic alignment and fusion.•The tailored constraints facilitate modality- and speaker-invariant representations.•Promising audio-visual speech recognition results can be obtained across datasets.
External IDs:doi:10.1016/j.knosys.2025.114722
Loading