Note that the presented architecture works at the frame level, meaning that each single frame (plus its corresponding context) is fed-forward through the network, obtaining a class posterior probability for all of the target languages. This fact makes the DNNs particularly suitable for real-time applications because, unlike other approaches (i.e. i-vectors), we can potentially make a decision about the language at each new frame. Indeed, at each frame, we can combine the evidence from past frames to get a single similarity score between the test utterance and the targetlanguages. A simple way of doing this combination is to assume that frames are independent and multiply the posterior estimates of the last layer. The score sl for language l of a given test utterance is computed by multiplying the output probabilities pl obtained for all of its frames; or equivalently, accumulating the logs as:(6)sl=1N∑t=1Nlogp(Ll|xt​, θ)where p(Ll|xt​, θ) represents the class probability output for the language l corresponding to the input example at time t, xt by using the DNN defined by parameters θ.
