Understanding Compositionality in Data Embeddings

Understanding Compositionality in Data Embeddings

TMLR Paper4009 Authors

19 Jan 2025 (modified: 14 Apr 2025)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Embeddings are often difficult for humans to interpret, raising potential safety concerns. To address this, we analyze embeddings from different data structures such as words, sentences, and graphs, and interpret them in an understandable manner. This study investigates the algebraic relations, specifically additive, between pairs of vectors that represent entities known to be similar across a particular feature. To this end, we apply two methods: (1) Correlation-based Compositionality Detection, which measures the correlation between known attributes of objects and their embeddings, and (2) Additive Compositionality Detection, which decomposes embeddings into an additive combination of vectors representing specific attributes. Embeddings are evaluated from various models, layers, and training stages to explore their capacity to encode compositional relationships. Sentence embeddings, for example, can be interpreted as the sum of underlying conceptual components. Similarly, word embeddings can be interpreted as capturing a combination of semantic and morphological information. Graph embeddings in recommender systems reflect the sum of a user’s demographic attributes. In all three types of data, the relationships between structured entities are encoded as vector operations in embeddings, with a simple operation such as addition playing a central role in expressing compositionality. Code will be publicly available on GitHub upon acceptance.

Submission Length: Regular submission (no more than 12 pages of main content)

Previous TMLR Submission Url: https://openreview.net/forum?id=WN9zvxi1Cr

Changes Since Last Submission: Dear Action Editor, Thank you for your thoughtful feedback and the opportunity to revise our manuscript. We are grateful for the constructive comments and have made significant efforts to address the concerns raised. Below is a concise summary of our revisions: 1. Addressing Overlap with Previous Work We have resolved overlap concerns by clearly delineating prior work from our novel contributions. Specifically: Results from previous studies are now explicitly described as foundational, and all new experiments and analyses are clearly identified as distinct contributions in this manuscript. Figures and examples shared with earlier work have been removed or annotated in the appendix, and their context reframed to emphasize how this submission extends beyond the prior findings. 2. New Experiments and Analyses We have conducted additional experiments to provide more robust analyses: Sentence Embeddings: We designed a more challenging experiment that involves decomposing sentence embeddings into multiple constituent concepts, rather than the simpler decomposition into subject, verb, and object used previously. This task is inherently more complex, as sentence embeddings are composed of the interactions between individual tokens within a sentence. As a result, linear combinations of these concepts are not necessarily expected, making this a more demanding and insightful test of additive compositionality. Advanced Models: Additive compositionality has been evaluated across more advanced models, including SBERT, GPT, and LLaMA. Other Baselines: We also performed analyses on different SBERT layers and the CLS token at various training stages in BERT. These serve as additional baselines. Knowledge Graph Embeddings: We evaluated additive compositionality in knowledge graph embeddings, examining various scoring-function-based embeddings and their performance across different training stages. In different types of data, the relationships between structured entities such as morphological or semantic components in words, concepts in sentences, and attributes in graph nodes are encoded as vector operations in embeddings, with a simple operation such as addition playing a central role in expressing compositionality. 3. Presentation and Writing Improvements We have thoroughly revised the manuscript to address typos, formatting issues, and redundancies. We hope these revisions effectively address the concerns raised and improve the quality and originality of our submission. We are grateful for your feedback and guidance, and we remain open to any further suggestions to strengthen the manuscript. Thank you for your time and consideration. Sincerely, Authors

Assigned Action Editor: ~Massimiliano_Mancini1

Submission Number: 4009

Loading