Abstract: Resumo A wide range of applications use semi-structured data. A characteristic of these data is that they are heterogeneous and do not follow a predefined schema, i.e., schema-less. The lack of structure makes it difficult to use this data since many applications depend on it to perform their tasks. Thus, we propose CoFFee, a schema mining approach that, given a set of heterogeneous schemas, provides a summarized schema containing a set of core attributes. To this end, CoFFee uses a strategy that combines co-occurrence and frequency of attributes. It models a set of entity schemas as a graph and uses centrality metrics to capture the co-occurrence between attributes. We evaluated CoFFee using data extracted from six DBpedia classes and compared it with two state-of-the-art approaches. The results achieved show that CoFFee produces a summarized schema of good quality, outperforming the baselines by an average of 22% of the F1 score.
External IDs:dblp:conf/sbbd/NetoMBS22
Loading