Kv2vec: A Distributed Representation Method for Key-value Pairs from Metadata Attributes

Chenxu Niu, Wei Zhang, Suren Byna, Yong Chen

Published: 2022, Last Modified: 05 Jan 2026HPEC 2022EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Distributed representation methods for words have been developed for years, and numerous methods exist, such as word2vec, GloVe, and fastText. However, they are not designed for key-value pairs, which is an important data pattern and widely used in many scenarios. For example, metadata attributes of scientific files consist of a collection of key-value pairs. In this research, we propose kv2vec, a method that captures relationships between keys and values and represents key-value pairs in dense vectors. The fundamental idea of the kv2vec method is utilizing recurrent neural networks (RNNs) with long short-term memory (LSTM) hidden units to convert each key-value pair to a distributed vector representation. This new method overcomes the weaknesses of existing embedding models for representing key-value pairs as vectors. Moreover, it can be integrated into dataset search solutions through querying metadata attributes for self-describing file formats that are widely used in HPC systems. We evaluate the kv2vec method with multiple real-world datasets, and the results show that kv2vec outperforms existing models.