Efficient processing of SPARQL queries over GraphFrames

Ramazan Ali Bahrami, Jayati Gulati, Muhammad Abulaish

Published: 2017, Last Modified: 06 Jan 2026WI 2017EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: With the advent of huge data management systems storing voluminous data, there arises a need to develop efficient data analytics techniques for knowledge discovery at different levels of granularity. Resource Description Framework (RDF), mainly developed for Semantic Web, is presumably a good option when considering graph databases dealing with huge real-world data. RDF models information in the form of triples <subject, predicate, object>, and is considered as a useful tool to store graph data (aka linked data) where each edge can be stored as a triple. Due to existence of huge amount of linked data, mostly in the form of graphs, graph mining has been successful in attracting researchers from different research fields for efficient handling (storage, indexing, retrieval, etc.) of graph data. As a result, various APIs like GraphX and GraphFrames are developed to facilitate relational queries over graph data. Though GraphX is older than GraphFrames and processing SPARQL queries over GraphX has been explored by some researchers, to the best of our knowledge, SPARQL query processing over GraphFrames has not been explored yet. In this paper, we present an initial study on query-specific search space pruning and query optimization approach to process SPARQL queries over GraphFrames in an efficient manner. The experimental results, in terms of low response time for query execution, are encouraging, and give way to invest more research efforts in this direction.