Linking Survey and Social Media Data: Natural Language Processing for Bridging the Gap Between Open Access and Data Protection

Conor Gaughan; Rachel Gibson; Alexandru Cernat; Marta Cantijoch; Riza Batista-Navarro

Linking Survey and Social Media Data: Natural Language Processing for Bridging the Gap Between Open Access and Data Protection

Conor Gaughan, Rachel Gibson, Alexandru Cernat, Marta Cantijoch, Riza Batista-Navarro

Published: 26 Jul 2025, Last Modified: 06 Oct 2025NLPOR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: surveys, social media, linked data, ethics, consent, disclosive, NLP

TL;DR: Our paper uses natural language processing to generate new variables of interest from social media data that can be linked with survey data for open sharing.

Submission Type: Non-Archival

Abstract: The open release of social media data is problematic for both ethical and legal reasons, and the publicly searchable nature of social media text imposes a serious risk of disclosure. This is especially risky when linking social media data with participant survey responses which will likely contain sensitive information. This work-in-progress paper seeks to outline a standardised procedure for the extraction of anonymised variables from participant social media data which can be safely shared with other researchers. Using two pre-existing datasets which link participant survey data with their X (formerly Twitter) profiles during the US 2020 and 2024 elections campaigns, we use NLP methods to extract 126 variables which describe the structural and semantic nature of the social media text. Doing so, we look to demonstrate how these new variables can be used to enhance public opinion research such as the prediction of socio-demographic and attitudinal characteristics.

Submission Number: 17

Loading