LLM + Vector Data: Coupling of Large Language Models with Vector Data Management for Enhancing Data Science

Arijit Khan, Yuxiang Wang, Weixi Zhang, Yao Tian, M. Tamer Özsu

Published: 2025, Last Modified: 07 Jan 2026ICDEW 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: The emergence of generative AI (GenAI) is a major driving force behind the modern data science ecosystem, a field that exploits data as the central asset for actionable insights. Analogously, GenAI is a form of artificial intelligence which learns from massive datasets to generate new data, showcasing human-like creativity in text, images to code, speech, and video. Two critical pillars of the GenAI technology are large language models (LLMs) and vector data. In particular, LLMs are a category of genAI models that emphasize on generating new text contents. On the other hand, there is also an upsurge of dense, high-dimensional, billion-scale vector data from deep learning models that embed complex data, e.g., text, multimedia, graphs, and tables into vector representations aiming to preserve semantic similarity. Since LLMs operate on vector data at various stages consisting of pre-training, fine-tuning, inference, and retrieval-augmented generation (RAG), coupling large language models with vector data management is essential for enhancing data science services with cross-modal data querying and generation. It creates new opportunities and challenges in areas such as accuracy, consistency, efficiency, scalability, privacy, fairness, explainability, data regulations, software-hardware collaboration, and cloud-native systems. The workshop aims to advance the understanding of how LLMs and vector data management can cooperatively contribute to data science solutions.

External IDs:dblp:conf/icde/KhanWZTO25