The BTWorld use case for big data analytics: Description, MapReduce logical workflow, and empirical evaluation

Tim Hegeman, Bogdan Ghit, Mihai Capota, Jan Hidders, Dick H. J. Epema, Alexandru Iosup

2013 (modified: 26 Aug 2022)IEEE BigData 2013Readers: Everyone

Abstract: The commoditization of big data analytics, that is, the deployment, tuning, and future development of big data processing platforms such as MapReduce, relies on a thorough understanding of relevant use cases and workloads. In this work we propose BTWorld, a use case for time-based big data analytics that is representative for processing data collected periodically from a global-scale distributed system. BTWorld enables a data-driven approach to understanding the evolution of BitTorrent, a global file-sharing network that has over 100 million users and accounts for a third of today's upstream traffic. We describe for this use case the analyst questions and the structure of a multi-terabyte data set. We design a MapReduce-based logical workflow, which includes three levels of data dependency - inter-query, inter-job, and intra-job - and a query diversity that make the BTWorld use case challenging for today's big data processing tools; the workflow can be instantiated in various ways in the MapReduce stack. Last, we instantiate this complex workflow using Pig-Hadoop-HDFS and evaluate the use case empirically. Our MapReduce use case has challenging features: small (kilobytes) to large (250 MB) data sizes per observed item, excellent (10 <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">-6</sup> ) and very poor (10 <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> ) selectivity, and short (seconds) to long (hours) job duration.

0 Replies