Data and Resources Paper: A Multi-granularity Decade-Long Geo-Tagged Twitter Dataset for Spatial Computing

Published: 01 Jan 2023, Last Modified: 14 May 2025SIGSPATIAL/GIS 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: This paper presents a publicly accessible large-scale geo-tagged Twitter dataset, comprising 95.8 million tweets from 247 countries, spanning from Jan. 2012 to Dec. 2021. To systematically extract this dataset from over 57.18 TB of raw tweets, we employed parallel computing on a 40-node cluster with 480 CPU cores. Distinguishing it from most existing Twitter datasets, our dataset includes four-level granularity tweet locations, two-level granularity user profile locations, and tweet text languages, enabling personalized queries. To enhance the open accessibility of our dataset, we have designed an innovative interactive online query system (https://sigspatial.yunhefeng.me) and provided free-to-use JSON APIs (https://github.com/ResponsibleAILab/unt-geotweet-api) for customized queries to retrieve tweet IDs in tweet coordinate, tweet text-based location, and user location modes. Then users can use https://github.com/ResponsibleAILab/unt-tweet-rehydration to download complete tweet information. Furthermore, we have demonstrated the practical utility of our dataset through two applications: human movement modeling and geo-aware Large Language Model (LLM) tuning. Our geo-tagged Twitter dataset, along with the accompanying query system and APIs, contributes to the research community and opens up avenues for multidisciplinary investigations and the advancement of knowledge.
Loading