VibeSpace: Automatic vector embedding creation for arbitrary domains and mapping between them using large language models

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: unsupervised representation learning, vector embeddings, large language models, recommender systems
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: We fully automate data collection and vector embedding creation with LLMs - we also make mappings between vector spaces.
Abstract: We present VibeSpace; a method for the fully unsupervised construction of interpretable embedding spaces applicable to arbitrary domain areas. By leveraging knowledge contained within large language models, our method automates otherwise costly data acquisition processes and assesses the similarity of entities, allowing for meaningful and interpretable positioning within vector spaces. Our approach is also capable of learning intelligent mappings between vector space representations of non-overlapping domains, allowing for a novel form of cross-domain similarity analysis. First, we demonstrate that our data collection methodology yields comprehensive and rich datasets across multiple domains, including songs, books, and movies. Second, we show that our method yields single-domain embedding spaces which are separable by various domain specific features. These representations provide a solid foundation upon which we can develop classifiers and initialise recommender systems, demonstrating our method's utility as a data-free solution to the cold-start problem. Further, these spaces can be interactively queried to obtain semantic information about different regions in embedding spaces. Lastly, we argue that by exploiting the unique capabilities of current state-of-the-art large language models, we produce cross-domain mappings which capture contextual relationships between heterogeneous entities which may not be attainable through traditional methods. The presented method facilitates the creation of embedding spaces of any domain which circumvents the need for collection and calibration of sensitive user data, as well as providing deeper insights and better interpretations of multi-domain data.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 6456
Loading