Abstract: Learning distributed representations, or embeddings, that encode the relational similarity patterns among objects is a relevant task in machine learning. A popular method to learn the embedding matrices $X, Y$ is optimizing a loss function of the term ${\rm SoftMax}(XY^T)$. The complexity required to calculate this term, however, runs quadratically with the problem size, making it a computationally heavy solution. In this article, we propose a linear-time heuristic approximation to compute the normalization constants of ${\rm SoftMax}(XY^T)$ for embedding vectors with bounded norms. We show on some pre-trained embedding datasets that the proposed estimation method achieves higher or comparable accuracy with competing methods. From this result, we design an efficient and task-agnostic algorithm that learns the embeddings by optimizing the cross entropy between the softmax and a set of probability distributions given as inputs. The proposed algorithm is interpretable and easily adapted to arbitrary embedding problems. We consider a few use cases and observe similar or higher performances and a lower computational time than similar ``2Vec'' algorithms.
Submission Length: Regular submission (no more than 12 pages of main content)
Supplementary Material: zip
Changes Since Last Submission: We thank the reviewers for their insightful comments. We created a revised version of the paper to account for the comments. We feel it largely improved the article's readability. Several issues were related to slight terminology misuses and a lack of clarity. We summarize here the main changes that we performed.
* We changed the title and substituted "attention scores" with "softmax scores". We did not make this modification lightly, and we would like to justify our choices. The term ${\rm SoftMax}(XY^T)$ is the one appearing in transformers and the expression "attention scores" is used in the literature (https://openreview.net/pdf?id=R8sQPpGCv0). For this reason, we felt it was a good name for this quantity. We remark that we never claimed to have designed an efficient transformer and in fact, the word "transformer" only appeared once in the whole paper and it refers to related works. In this respect, we firmly believe we did not claim something different from what we did. Nonetheless, all three reviewers raised concerns and were confused - to some extent - by this nomenclature choice. If all reviewers were uncomfortable, then any reader can be and this is beyond our intentions. We thus changed the title and renamed this quantity, clearly stating and repeating that our contributions are: an efficient method to normalize ${\rm SoftMax}(XY^T)$ and an efficient 2Vec-type algorithm based on this normalization technique. We think this type of contribution precisely falls within the scope of TMLR and we thank the reviewers for acknowledging the relevance of the problem we considered and of our contribution.
* We extensively reworked the abstract, introduction, and main result sections, ensuring their coherence. We also tried to motivate better our embedding problem, emphasizing its use cases.
* In particular, we rephrased the main section to guide the reader through our argument, moving the theorems to the appendix.
* We also updated and corrected the figures according to the reviewers' requests.
----
## Camera ready update
We expanded section "2.2 Empirical evaluation" by comparing the performance of our estimation method with other baselines, using sampling, top-k estimates, and low-rank approximations. We also slightly reworked the introduction (the part concerning the related works) to make it more coherent with the new version of section 2.2.
Code: https://github.com/lorenzodallamico/EDRep/
Assigned Action Editor: ~Manzil_Zaheer1
Submission Number: 3597
Loading