Abstract: A healthy voice is crucial for verbal communication and hence in daily as well as professional life.
The basis for a healthy voice are the sound producing vocal folds in the larynx. A hallmark of
healthy vocal fold oscillation is the symmetric motion of the left and right vocal fold. Clinically,
videoendoscopy is applied to assess the symmetry of the oscillation and evaluated subjectively.
High-speed videoendoscopy, an emerging method that allows quantifcation of the vocal fold
oscillation, is more commonly employed in research due to the amount of data and the complex,
semi-automatic analysis. In this study, we provide a comprehensive evaluation of methods that detect
fully automatically the glottal midline. We used a biophysical model to simulate diferent vocal fold
oscillations, extended the openly available BAGLS dataset using manual annotations, utilized both,
simulations and annotated endoscopic images, to train deep neural networks at diferent stages of
the analysis workfow, and compared these to established computer vision algorithms. We found
that classical computer vision perform well on detecting the glottal midline in glottis segmentation
data, but are outperformed by deep neural networks on this task. We further suggest GlottisNet, a
multi-task neural architecture featuring the simultaneous prediction of both, the opening between
the vocal folds and the symmetry axis, leading to a huge step forward towards clinical applicability
of quantitative, deep learning-assisted laryngeal endoscopy, by fully automating segmentation and
midline detection.
0 Replies
Loading