- Abstract: In recent years, softmax together with its fast approximations has become the de-facto loss function for deep neural networks with multiclass predictions. However, softmax is used in many problems that do not fully fit the multiclass framework and where the softmax assumption of mutually exclusive outcomes can lead to biased results. This is often the case for applications such as language modeling, next event prediction and matrix factorization, where many of the potential outcomes are not mutually exclusive, but are more likely to be independent conditionally on the state. To this end, for the set of problems with positive and unlabeled data, we propose a relaxation of the original softmax formulation, where, given the observed state, each of the outcomes are conditionally independent but share a common set of negatives. Since we operate in a regime where explicit negatives are missing, we create an adversarially-trained model of negatives and derive a new negative sampling and weighting scheme which we denote as Cooperative Importance Sampling (CIS). We show empirically the advantages of our newly introduced negative sampling scheme by pluging it in the Word2Vec algorithm and benching it extensively against other negative sampling schemes on both language modeling and matrix factorization tasks and show large lifts in performance.
- Keywords: Negative Sampling, Sampled Softmax, Word embeddings, Adversarial Networks
- TL;DR: Defining a partially mutual exclusive softmax loss for postive data and implementing a cooperative based sampling scheme