Abstract: Data mining on large relational databases has gained popularity and its significance is well recognized. However, the performance of SQL based data mining is known to fall behind specialized implementation since the prohibitive nature of the cost associated with extracting knowledge, as well as the lack of suitable declarative query language support. Frequent pattern mining is a foundation of several essential data mining tasks. These facts motivated us to develop original SQL-based approaches for mining frequent patterns. In this work, we investigate approaches based on SQL for the problem of finding frequent patterns from a transaction table. Most of them adopt Apriori-like approaches. However those methods may suffer from the inferior performance since the costly candidate-generation-and-test operation especially when mining datasets with prolific patterns and/or long patterns. We develop a class of efficient SQL based pattern growth methods for mining frequent patterns. The commonality of these approaches is that they use a divide and conquer method to decompose mining tasks and then use a pattern growth method to avoid the combinatory problem inherent to candidate-generation-and-test approach. Apriori algorithms with the help of SQL either require several scans over the data or require many and complex joins between the input tables. While our SQL-based algorithms avoid making multiple passes over the large original input table and complex joins between the tables. A comprehensive performance study evaluates on DBMS (IBM DB2 UDB EEE V8) and compares the performance results between SQL based frequent pattern mining approaches based on Apriori and the approaches in this thesis. The empirical results show that our algorithms can get efficient performance. Moreover, recently most major database systems have included capabilities to support parallelization, this thesis examined how efficiently SQL based frequent pattern mining can be parallelized and speeded up using parallel database systems.
Loading