Abstract: This article develops Hercules, a system for entity resolution (ER), conflict resolution (CR), timeliness deduction (TD), and missing value/link imputation (MI) in graphs. It proposes GCR+s, a class of graph cleaning rules (GCR) that support not only predicates for ER and CR but also temporal orders to deduce timeliness and data extraction to impute missing data. As opposed to previous graph rules, GCR+s are defined with a dual graph pattern to accommodate irregular structures of schemaless graphs and adopt patterns of a star form to reduce the complexity. We show that while the implication and satisfiability problems are intractable for GCR+s, it is in polynomial time to detect and correct errors with GCR+s. Underlying Hercules, we train a ranking model to predict the temporal orders on attributes and embed it as a predicate of GCR+s. We provide an algorithm for discovering GCR+s by combining the generations of patterns and predicates. We also develop a method for conducting ER, CR, TD, and MI in the same process to improve the overall quality of graphs by leveraging their interactions and chasing with GCR+s; we show that the method has the Church–Rosser property under certain conditions. Using real-life and synthetic graphs, we empirically verify that Hercules is 53% more accurate than the state-of-the-art graph cleaning systems and performs comparably in efficiency and scalability.
Loading