Keywords: Large Language Models, Graph Representation Learning, Vector Tokenization, Scalable Graph Representation
Abstract: Large Language Models (LLMs) have demonstrated remarkable proficiency in diverse language-centric tasks, yet their application to structured graph data presents unique challenges, particularly in efficiently tokenizing graph elements. While graphs offer powerful structural representations, existing methods for interfacing them with LLMs, such as creating distinct token embeddings for every node, face significant scalability limitations: the input vocabulary for the LLM grows linearly with the number of nodes, hindering applicability to large-scale graphs. Drawing inspiration from vector quantization's success in compressing information in domains like audio and vision, we introduce a novel approach to represent graph node features for LLMs. Our method, GraphQ-LM, employs Residual Vector Quantization (RVQ) to encode continuous node features into a compact sequence of discrete tokens derived from fixed-size codebooks. These "graph tokens," representing structural feature information, are seamlessly integrated with textual attributes of nodes and their neighborhoods, forming a rich, multimodal input for the LLM. By aligning the codebook's embedding dimension with that of the LLM and jointly training the RVQ module with the LLM, we learn graph-aware representations optimized for downstream tasks like node classification. Extensive experiments demonstrate that GraphQ-LM not only achieves state-of-the-art performance but, crucially, offers a scale-free tokenization strategy.
Primary Area: learning on graphs and other geometries & topologies
Submission Number: 22086
Loading