Abstract: In many domains, high-quality data are used as a foundation for decision-making. An essential component to assess data quality lies in anomaly detection. We describe and empirically evaluate the design and implementation of a framework for data quality testing over real-world streams in a large-scale telecommunication network. This approach is both general—by using general-purpose measures borrowed from information theory and statistics—and scalable—through anomaly detection pipelines that are executed in a distributed setting over state-of-the-art big data streaming and batch processing infrastructures. We empirically evaluate our system and discuss its merits and limitations by comparing it to existing anomaly detection techniques, showing its high accuracy, efficiency, as well as its scalability in parallelizing operations across a large number of nodes.
Loading