Abstract: Social media data is often used to pulse the opinion of online communities, either by predicting sentiment or stances (e.g., political), to mention just two typical use cases. However, those analysis assume that the data samples really represent the underlying demographics of the overall community, both, in number and characteristics, which in most cases is not true. As a result, extrapolating these results to larger populations usually do not work. This happens because social media data is inherently biased, mainly due to two facts: (1) not all people is equally active in social media platforms and most of them are really passive; and (2) there are demographic biases in gender and age, among other attributes. Hence, the questions of how representative is the data and if is possible to make it representative are of crucial importance. We also discuss related issues such as using public samples of mostly private platforms as well as typical errors in the analysis of such data.
Loading