Quantifying Depressed Social Media During COVID-19: Information Retrieval With ML & NLP

Brent D Davis; Dawn Estes McKnight; Rumi Chunara; Daniel J Lizotte; Alona Fyshe

Quantifying Depressed Social Media During COVID-19: Information Retrieval With ML & NLP

Brent D Davis, Dawn Estes McKnight, Rumi Chunara, Daniel J Lizotte, Alona Fyshe

04 Sept 2020 (modified: 24 May 2023)Submitted to NLP-COVID19-EMNLPReaders: Everyone

Keywords: COVID-19, NLP, IR, Information Retrieval, AIR, ML, Machine Learning, Depression, Mental Health, Social Media

TL;DR: An approach using ML & NLP to quantify the amount of depressed social media associated with COVID-19

Abstract: The ongoing pandemic continues to disrupt the normal functioning of society in numerous ways, and symptoms of depression are on the rise. In this work, we explored how analysis of social media can reveal changes in the number of authors presenting depressive symptoms on social media using Twitter and Reddit. We first assessed the level of depressive symptoms expressed in a large set of tweets. While there are some efforts for identifying depressive symptoms in tweets, they are limited in scope and typically do not account for contemporary online discourse surrounding the experience of depression. To ensure that our assessment accounted for contemporary discourse, we extracted recent posts from /r/Depression, where symptoms and experience are a main topic of discussion. To further ensure that our assessment accounted for language that expresses depressive symptoms in a variety of contexts, rather than only when explicitly discussing the experience of depression, we also extracted all of the other Reddit posts of users who posted in /r/Depression. These user posts were extracted from all posts made by all authors in /r/Depression across all of Reddit for November and December 2019 (the most recent two months available in their entirety on Pushshift). We then trained a GloVe word embedding on the posts made by users across Reddit who post in /r/Depression. Using the resulting word vectors, we then trained an author representation using the usr2vec method for both our /r/Depression authors and a sampled set of users to act as contrast against our archetypal example. This produces a high-dimensional representation of a user, based on a composite of the word representations we trained previously. Then, we used a linear kernel support vector machine (SVM) to find a separating hyper-plane between these high dimensional representations of users who post in /r/Depression and the control set not active in /r/Depression. From here, we could use the SVM to directly classify unseen user representations; however, this is prone to bias, the classifications are challenging to explain, and training a representation for every new user is computationally expensive. We instead extracted vocabulary strongly associated with users who post in /r/Depression by taking the cosine of every word representation in the vocabulary of our word embedding with the decision direction the SVM produces. We took the most aligned words and used them to form a query for retrieving content written by depressed users. These words can be visualized and reviewed, mitigating bias and improving explainability. We call this method `Archetype-based Information Retrieval' (AIR); our work is an example of using AIR to find depression-associated content, based on a similar approach for finding posts about substance abuse. %(aligned) We created a query from the 200 most closely aligned-words and used BM25 to assign a score to tweets from the Mega-COV and official Twitter COVID-19 datasets. We took the top-scoring quartile of tweets from our search as being posts that indicate depressive symptoms. We sorted the tweets by the time they were posted, and looked for changes in the frequency of high-scoring matches to our query over time. We then ran topic models (Latent Dirichlet Allocation, Contextual) on tweets grouped by the month they were posted in and looked for consistencies and changes over time in the topics discovered by these automated approaches. Future work will explore ties between social media metrics and traditional, offline metrics. We intend to group tweets by geotags and look for corresponding trends; it is an open question whether the local, municipal, provincial, federal or international situation regarding COVID-19 forms the primary stressors on individuals. This study lays the foundation for AIR as a tool for investigating COVID-19 impacts on mental health.

0 Replies

Loading