Keywords: Segmentation, Multimodal Learning, Real-time MRI, Vocal Tract
TL;DR: VocSegMRI is a multimodal framework that integrates real-time MRI with acoustic and phonological class for precise vocal tract segmentation.
Registration Requirement: Yes
Abstract: Accurate segmentation of articulatory structures in real-time MRI (rtMRI) remains challenging, as existing methods rely primarily on visual cues and overlook complementary information from synchronized speech signals. We propose VocSegMRI, a multimodal framework integrating video, audio, and phonological inputs via cross-attention fusion and a contrastive learning objective that improves cross-modal alignment and segmentation precision. Evaluated on USC-75 and further validated via zero-shot transfer on USC-TIMIT, VocSegMRI outperforms unimodal and multimodal baselines, with ablations confirming the contribution of each component.
Visa & Travel: Yes
Read CFP & Author Instructions: Yes
Originality Policy: Yes
Single-blind & Not Under Review Elsewhere: Yes
LLM Policy: Yes
Submission Number: 14
Loading