Language-Model Based Informed Partition of Databases to Speed Up Pattern Mining

Carlos Bobed Lisbona, Jordi Bernad, Pierre Maillot

Published: 01 Jan 2024, Last Modified: 16 Feb 2025Proc. ACM Manag. Data 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Extracting interesting patterns from data is the main objective of Data Mining. In this context, Frequent Itemset Mining has shown its usefulness in providing insights from transactional databases, which, in turn, can be used to gain insights about the structure of Knowledge Graphs. While there have been a lot of advances in the field, due to the NP-hard nature of the problem, the main approaches still struggle when they are faced with large databases with large and sparse vocabularies, such as the ones obtained from graph propositionalizations. There have been efforts to propose parallel algorithms, but, so far, the goal has not been to tackle this source of complexity (i.e., vocabulary size), thus, in this paper, we propose to parallelize frequent itemset mining algorithms by partitioning the database horizontally (i.e., transaction-wise) while not neglecting all the possible vertical information (i.e., item-wise). Instead of relying on pure item co-appearance metrics, we advocate for the adoption of a different approach: modeling databases as documents, where each transaction is a sentence, and each item a word. In this way, we can apply recent language modeling techniques (i.e., word embeddings) to obtain a continuous representation of the database, clusterize it in different partitions, and apply any mining algorithm to them. We show how our proposal leads to informed partitions with a reduced vocabulary size and a reduced entropy (i.e., disorder). This enhances the scalability, allowing us to speed up mining even in very large databases with sparse vocabularies. We have carried out a thorough experimental evaluation over both synthetic and real datasets showing the benefits of our proposal.