Abstract: Most current knowledge discovery systems use only attribute-value information. But relational information between objects is also important to the knowledge hidden in today’s databases. Two such domains are chemical structures and domains where objects are related in space and time. Inductive Logic Programming (ILP) discovery systems handle relational data, but require data to be expressed as a subset of first-order logic. We are investigating the application of the graph-based relational discovery system SUBDUE (Cook, Holder, Djoko 1996) in structural domains. Input to SUBDUE is a graph with labeled vertices and directed or undirected labeled edges. SUBDUE performs a beam search of the space of all possible subgraphs of the input graph. The search is guided by the minimum description length (MDL) principle, looking for subgraphs (substructures) with many instances that can be used to compress the original data and represent structural knowledge. We applied SUBDUE to the task of identifying structural patterns that distinguish carcinogenic and non-carcinogenic chemical compounds available from the National Toxicology Program (ntpserver.niehs.nih.gov). Each atom in a compound is represented as a vertex with directed edges to other vertices, where the edge labels specify whether the vertex is the atom name, type or partial charge. Bonds between atoms are represented as undirected edges between the vertices. We divided the data into a training set (268 compounds) and a testing set (30 compounds). SUBDUE found a substructure containing a bromine atom that occurred in 134 of the 143 carcinogenic training compounds and in only 24 of the 125 noncarcinogenic training compounds. This same substructure was found in 15 of the 19 carcinogenic testing compounds and in only 4 of the 11 noncarcinogenic testing compounds. The results are similar to those of ILP systems like PROGOL (Srinivasan ct aL 1997). We are experimenting with a new concept-learning version of SUBDUE that finds substructures compressing the positive data without compressing the negative data. Preliminary results show that the new version is competitive with the predominantly concept-learning ILP systems.
0 Replies
Loading