Apprentissage dans un contexte déséquilibré : une application à la détection de fraude. (Learning from Imbalanced Data: Application to Bank Fraud Detection)

Guillaume Metzler

Published: 01 Jan 2019, Last Modified: 07 Dec 2023undefined 2019Readers: Everyone

Abstract: Fraud and anomaly detection, or more generally learning in an imbalanced context, is a task very often encountered in industrial applications.Detecting these anomalies is a major challenge in today's society due to its potential economic consequences. BLITZ Business Services is confronted with this type of problem in the context of the fight against check fraud. These frauds represent 0.4% of the transactions but millions of euros of losses per year for its customers. Dealing with fraud data, and more generally with imbalanced data, is a complex task for most current learning algorithms because of the under-representation of frauds over non-frauds. The techniques are also as diverse and varied as the nature of the frauds encountered and range from sampling strategy, representation learning, optimization of measures appropriate to an imbalanced context or the construction of classification algorithms combining the advantages of several of the former. This thesis is intended to be eclectic, in the same way as the techniques present in the state of the art and is divided into two main axes: (i) a so-called geometric approach in which we propose metric learning algorithms for data classification and (ii) a cost-sensitive approach that we use for both theoretical and practical purposes.Our first contribution is based on learning local models around known frauds in order to build risky areas. It is based on the assumption that a new fraud is very likely to occur in the neighborhood of a known fraud. A theoretical study accompanies this algorithm to ensure that the number of false positives generated by the algorithm remains controlled.In our second contribution, we propose a version the k-Nearest Neighbors algorithm adapted to the imbalanced context. In this study, we propose to analyze how the distance from a new query to a fraud should be modified in order to optimize a measure adapted to this context: the F-measure, through cross validation. This measure is at the heart of our third contribution, which is mainly theoretical. We propose to derive a bound on the optimal F-measure from the pseudo-linearity property of this measure, the errors made by the hypotheses learned and a cost-sensitivity approach. The theoretical bounds obtained are then used to build an iterative algorithm for optimizing the F-measure, algorithm that is at least as efficient as its competitors.Our fourt and final contribution is industrial and aims to combine the use of tree-based models and cost sensitivity to improve BLITZ existing system by offering a profit optimization system for its customers.

0 Replies