Deep neural networks for automatic speaker recognition do not learn supra-segmental temporal features
Abstract: Highlights•Literature explains speaker recognition in neural nets by modeling of voice dynmaics.•Diagnostic: We quantify how well deep learning models actually capture dynamics.•Observation: State-of-the-art deep nets do not model speaker prosody but ignore it.•Interpretation as “cheating”: Achieving high without putting in due effort.•Outlook: Increasing task difficulty biases models towards prosody, but not enough.
Loading