A semantics-aware storage framework for scalable processing of knowledge graphs on Hadoop

HyeongSik Kim, Padmashree Ravindra, Kemafor Anyanwu

Published: 2017, Last Modified: 14 Oct 2023IEEE BigData 2017Readers: Everyone

Abstract: Knowledge graphs are graph-based data models which employ named nodes and edges to capture differentiation among entities and relationships in richly diverse data collections such as in the biomedical domain. The flexibility of knowledge graphs allows for heterogeneous collections to be linked and integrated in precise ways. However, resulting data models often have irregular structures which are not easy to manage using platforms for structured, schema-first data models like the relational model. To facilitate exchange, inter-operability and reuse of data, standards such as Resource Description Framework (RDF) have been increasingly adopted for representation. Domains such as the biomedical now have large collections of publicly available RDF graphs as well as benchmark workloads. To achieve scalability in data processing, some efforts are being made to build on distributed processing platforms such as Hadoop and Spark. However, while some distributed graph platforms have emerged for certain classes of mining workloads for non-semantic graphs (without typed edges and nodes), knowledge graph processing, which often involves ontological inferencing, continues to be plagued by scalability and efficiency challenges. In this paper, we present the design of a Hadoop-based storage architecture for knowledge graphs that overcomes some of the challenges of big RDF data processing. The rationale of the design strategy is to go beyond the traditional approach of exploiting structural properties of graphs while storing to include exploitation of semantic properties of knowledge graphs. Our system SemStorm is a Hadoop-based indexed, polymorphic, signatured file organization that supports efficient storage of data collections with significant data heterogeneity. Naive storage models for such data place more demands for meta-data management than traditional systems can support. The polymorphic file organization is further coupled with a nested, column-oriented file format to enable discriminatory data access based on queries. A major hallmark of SemStorm is the enabling of semantic-awareness in storage framework. The idea is to exploit the knowledge represented in ontologies that accompany data for optimizing data storage models such as identifying and managing data (sometimes implicit) redundancies. Another major advantage of SemStorm is that it derives optimized storage models for data autonomically, i.e., without user input. Extensive experiments conducted on real-world and synthetic benchmark datasets show that SemStorm is up to 10X faster than existing approaches.

0 Replies