Efficient Permutation Testing for Significant Sequential Patterns

Sam Pinxteren, Toon Calders

2021 (modified: 13 Jan 2022)SDM 2021Readers: Everyone

Abstract: Mining frequent patterns in sequential data to gain a better understanding of the dynamics of the sequences often results in too many patterns for the output to be interpretable. To contain this overload, methods for filtering sequential patterns on surprisingness or informativeness have been developed. These methods, however, often show bias towards short patterns as our experiments show. Also, most are not based on solid statistical grounds. In this paper, we propose a new way to test the significance of sequential patterns. The test computes a p-value for the support of a given sequential pattern under a null-model that randomly permutes all sequences of the database. This p-value can be used, for instance, to filter out sequential patterns whose frequencies can be attributed to bursts; that is, a few sequences in which the frequency of certain items is much higher without necessarily introducing meaningful sequential patterns. The main contribution of our paper is an efficient algorithm to compute the chance that a given subsequence appears in a sequence when it is randomly permuted. This is followed by the derivation of p-values for the support of this subsequence (or pattern). We perform quantitative experiments on synthetic data that confirm the superiority of the permutation-based significance test to deal with long patterns and bursts, and qualitative experiments on well-known textual benchmarks showing that our method produces a natural pattern ranking.

0 Replies