Abstract: The study of online social networks has become a major topic of research within the last decade, and many aspects of the networks and the behavior of their users have been investigated. The majority of research efforts has been directed at Twitter, which grants limited data accesses to researchers and provides detailed information. However, recently, important social and economic phenomena such as WallStreetBets or Antiwork have originated on Reddit, which has thus become an important field of investigation in its own right, and, due to its open nature, all Reddit data is available to study. As a consequence, in contrast to Twitter, where it is difficult to obtain large amounts of data, the main challenge of researching Reddit is to handle the vast amounts of data that are freely available. Here, we present the Reddit Dataset Stream Pipeline (RDSP), a simple and efficient parallel system based on Akka Streams that is capable of processing the entire Reddit dataset. We demonstrate how to build massive temporal graphs between subreddits from a parallel streamed dataset. We investigate the generated graphs and present experimental results. Moreover, we publish both the datasets as well as the codebase in order to invite researchers from different fields to contribute and profit from this work.
0 Replies
Loading