Abstract: With the advancement of information technologies, we can obtain various kinds of data, which can be leveraged for various purposes. The availability of a large amount of data is a desirable situation. However, it makes dataset retrieval a time-consuming and complex task. Conventional dataset search methods require unified metadata and knowledge about keywords representing the datasets. In other words, they require user knowledge regarding the datasets, such as the terms used in the dataset and fields in the metadata. To address this issue, we propose a topic-based search method without metadata, especially for users lacking knowledge about the datasets. The topic-based search can find datasets by using not the exact keywords but abstract keywords described as topics. In this paper, we focus on table data, which contain column names and data values and are widely used for storing data. As preliminary analysis, we collected and analyzed public datasets available in Japanese data portals to clarify the features of datasets that should be searched through dataset search. The analysis results revealed the use of many general and common keywords as column names, but it is difficult to implement a dataset search using only column names. Therefore, based on the analysis results, we decided to use embeddings converted from the datasets to utilize both column names and data values to extract topics from datasets. The experimental results showed that we can extract topics from datasets by using the topic modeling method and obtain better search results when compared with the search method using exact keywords.
Loading