DUNKS: Chunking and Summarizing Large and Heterogeneous Data for Dataset Search

Qiaosheng Chen, Xiao Zhou, Zhiyang Zhang, Gong Cheng

Published: 2024, Last Modified: 15 Jan 2026ISWC (2) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: With the vast influx of open data on the Web, dataset search has become a trending research problem which is crucial to data discovery and reuse. Existing methods for dataset search either employ only the unstructured metadata of datasets but ignore their actual data, or cater to structured data in a single format such as RDF despite the diverse formats of open data. In this paper, to address the magnitude of large datasets, we decompose RDF data into data chunks, and then, to accommodate big chunks to the limited input capacity of dense ranking models based on pre-trained language models, we propose a multi-chunk summarization method that extracts representative data from representative chunks. Moreover, to handle heterogeneous data formats beyond RDF, we transform other formats into chunks to be processed in a uniform way. Experiments on two test collections for dataset search demonstrate the effectiveness of our dense ranking over summarized data chunks.

External IDs:dblp:conf/semweb/ChenZZC24