VocSegMRI: Multimodal Learning for Precise Vocal Tract Segmentation in Real-time MRI

26 Mar 2026 (modified: 16 Apr 2026)MIDL 2026 Short Papers SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Segmentation, Multimodal Learning, Real-time MRI, Vocal Tract
TL;DR: VocSegMRI is a multimodal framework that integrates real-time MRI with acoustic and phonological class for precise vocal tract segmentation.
Registration Requirement: Yes
Abstract: Accurate segmentation of articulatory structures in real-time MRI (rtMRI) remains challenging, as existing methods rely primarily on visual cues and overlook complementary information from synchronized speech signals. We propose VocSegMRI, a multimodal framework integrating video, audio, and phonological inputs via cross-attention fusion and a contrastive learning objective that improves cross-modal alignment and segmentation precision. Evaluated on USC-75 and further validated via zero-shot transfer on USC-TIMIT, VocSegMRI outperforms unimodal and multimodal baselines, with ablations confirming the contribution of each component.
Visa & Travel: Yes
Read CFP & Author Instructions: Yes
Originality Policy: Yes
Single-blind & Not Under Review Elsewhere: Yes
LLM Policy: Yes
Submission Number: 14
Loading