Abstract: Glottal Closure Instants (GCIs) correspond to the temporal lo-
cations of significant excitation to the vocal tract occurring dur-
ing the production of voiced speech. GCI detection from speech
signals is a well-studied problem given its importance in speech
processing. Most of the existing approaches for GCI detection
adopt a two-stage approach (i) Transformation of speech signal
into a representative signal where GCIs are localized better, (ii)
extraction of GCIs using the representative signal obtained in
first stage. The former stage is accomplished using signal pro-
cessing techniques based on the principles of speech produc-
tion and the latter with heuristic-algorithms such as dynamic-
programming and peak-picking. These methods are thus task-
specific and rely on the methods used for representative signal
extraction. However in this paper, we formulate the GCI detec-
tion problem from a representation learning perspective where
appropriate representation is implicitly learned from the raw-
speech data samples. Specifically, GCI detection is cast as a su-
pervised multi-task learning problem solved using a deep con-
volutional neural network jointly optimizing a classification and
regression cost. The learning capability is demonstrated with
several experiments on standard datasets. The results compare
well with the state-of- the-art algorithms while performing bet-
ter in the case of presence of real-world non-stationary noise.
0 Replies
Loading