Abstract: User experiences can be made more engaging by incorporating surprise. For example, online shoppers may like to view unique products. In this paper we propose an approach for detecting surprising documents, such as product titles. As the concept of surprise is subjective, there is currently no principled method for measuring the surprisingness score of a document. We present such a method; an unsupervised approach for automatically discovering surprising documents in an unlabeled corpus. Our approach is based on a probabilistic model of surprise, and a construction of effective distributional word embeddings, which can be adapted to the semantic context in which the word appears. As the performance of our model does not degrade with the length of the document, it is particularly well suited for very short documents (even a single sentence). We evaluate our model both in supervised and unsupervised settings, demonstrating its state-of-the-art performance on two real-world data sets: a collection of e-commerce products from eBay, and a corpus of NSF proposals. These experiments show that our surprisingness score exhibits high correlation with human annotated labels.
Loading