Prob2Vec: Mathematical Semantic Embedding for Problem Retrieval in Adaptive Tutoring

Du Su; Ali Yekkehkhany; Yi Lu; Wenmiao Lu

Prob2Vec: Mathematical Semantic Embedding for Problem Retrieval in Adaptive Tutoring

Du Su, Ali Yekkehkhany, Yi Lu, Wenmiao Lu

27 Sept 2018 (modified: 05 May 2023)ICLR 2019 Conference Blind SubmissionReaders: Everyone

Abstract: We propose a new application of embedding techniques to problem retrieval in adaptive tutoring. The objective is to retrieve problems similar in mathematical concepts. There are two challenges: First, like sentences, problems helpful to tutoring are never exactly the same in terms of the underlying concepts. Instead, good problems mix concepts in innovative ways, while still displaying continuity in their relationships. Second, it is difficult for humans to determine a similarity score consistent across a large enough training set. We propose a hierarchical problem embedding algorithm, called Prob2Vec, that consists of an abstraction and an embedding step. Prob2Vec achieves 96.88\% accuracy on a problem similarity test, in contrast to 75\% from directly applying state-of-the-art sentence embedding methods. It is surprising that Prob2Vec is able to distinguish very fine-grained differences among problems, an ability humans need time and effort to acquire. In addition, the sub-problem of concept labeling with imbalanced training data set is interesting in its own right. It is a multi-label problem suffering from dimensionality explosion, which we propose ways to ameliorate. We propose the novel negative pre-training algorithm that dramatically reduces false negative and positive ratios for classification, using an imbalanced training data set.

Keywords: personalized learning, e-learning, text embedding, Skip-gram, imbalanced data set, data level classification methods

TL;DR: We propose the Prob2Vec method for problem embedding used in a personalized e-learning tool in addition to a data level classification method, called negative pre-training, for cases where the training data set is imbalanced.

11 Replies

Loading