DUNKS: Chunking and Summarizing Large and Heterogeneous Data for Dataset Search

Published: 01 Jan 2024, Last Modified: 16 Sept 2025ISWC (2) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: With the vast influx of open data on the Web, dataset search has become a trending research problem which is crucial to data discovery and reuse. Existing methods for dataset search either employ only the unstructured metadata of datasets but ignore their actual data, or cater to structured data in a single format such as RDF despite the diverse formats of open data. In this paper, to address the magnitude of large datasets, we decompose RDF data into data chunks, and then, to accommodate big chunks to the limited input capacity of dense ranking models based on pre-trained language models, we propose a multi-chunk summarization method that extracts representative data from representative chunks. Moreover, to handle heterogeneous data formats beyond RDF, we transform other formats into chunks to be processed in a uniform way. Experiments on two test collections for dataset search demonstrate the effectiveness of our dense ranking over summarized data chunks.
Loading