Retrieval-augmented Query Reformulation for Heterogeneous Research Asset Retrieval in Virtual Research Environment
Abstract: Discovering and reusing research assets such as datasets and computational notebooks is crucial for building research workflows in data-centric studies. The rapid growth of research assets in scientific communities provides scientists with great opportunities to enhance research efficacy but also poses significant challenges in finding suitable materials for specific tasks. Scientists, especially those focusing on cross-disciplinary research, often find it difficult to formulate effective queries to retrieve desired resources. Previous work has proposed query reformulation methods to increase the efficiency of research asset search. However, it relies on existent knowledge graphs and is constrained to computational notebooks only. As research assets utilized by data analytic workflows are in essence heterogeneous, i.e., of distinct kinds and from diversified sources, query reformulation methods in this regard should consider the relationship between different types of research assets. To address the above challenges, we propose a retrieval-augmented query reformulation method for heterogeneous research asset retrieval. It is developed in the context of a Notebook-based virtual research environment (VRE) and offers query reformulation services to other VRE components. We demonstrate the effectiveness of the proposed query reformulation service with experiments on dataset and notebook retrieval. Up till now, we have indexed 8,954 datasets and 18,158 notebooks. The experimental results show that the proposed service can create useful query suggestions.
Loading