In paper 'Distilling Word Embeddings: An Encoding Approach', the authors mention that matching softmax can also be applied along with standard crossentropy loss (with one-hot ground truth), or more elaborately, the teacher model’s effect declines in an annealing fashion when the student model is more aware of data, which is supported by another related paper that you have read. Provide the full name of that paper.