Finally, a Downloadable Test Collection of TweetsOpen Website

2017 (modified: 12 Nov 2022)SIGIR 2017Readers: Everyone
Abstract: Due to Twitter's terms of service that forbid redistribution of content, creating publicly downloadable collections of tweets for research purposes has been a perpetual problem for the research community. Some collections are distributed by making available the ids of the tweets that comprise the collection and providing tools to fetch the actual content; this approach has scalability limitations. In other cases, evaluation organizers have set up APIs that provide access to collections for specific tasks, without exposing the underlying content. This is a workable solution, but difficult to sustain over the long term since someone has to maintain the APIs. We have noticed that the non-profit Internet Archive has been making available for public download captures of the so-called Twitter "spritzer" stream, which is the same source as the Tweets2013 collection used in the TREC 2013 and 2014 Microblog Tracks. We analyzed both datasets in terms of content overlap and retrieval baselines to show that the Internet Archive data can serve as a drop-in replacement for the Tweets2013 collection, thereby providing the research community with, finally, a downloadable collection of tweets. Beyond this finding, we also study the impact of tweet deletions over time and how they affect the test collections.
0 Replies

Loading