Abstract: Author profiling is the task of inferring characteristics about
individuals by analyzing content they share. Supervised machine
learning still dominates automatic systems that perform this task,
despite the popularity of prompting large language models to address
natural language understanding tasks. One reason is that the
classification instances consist of large amounts of posts,
potentially a whole user profile, which may exceed the input length
of Transformers. Even if a model can use a large context window, the
entirety of posts makes the application of API-accessed black box
systems costly and slow, next to issues which come with such
"needle-in-the-haystack" tasks. To mitigate this limitation, we
propose a new method for author profiling which aims at
distinguishing relevant from irrelevant content first, followed by
the actual user profiling only with relevant data. To circumvent the
need for relevance-annotated data, we optimize this relevance filter
via reinforcement learning with a reward function that utilizes the
zero-shot capabilities of large language models. We evaluate our
method for Big Five personality trait prediction on two Twitter
corpora. On publicly available real-world data with a skewed label
distribution, our method shows similar efficacy to using all posts
in a user profile, but with a substantially shorter context. An
evaluation on a version of these data balanced with artificial posts
shows that the filtering to relevant posts leads to a significantly
improved accuracy of the predictions.
Paper Type: Long
Research Area: Computational Social Science and Cultural Analytics
Research Area Keywords: psycho-demographic trait prediction, NLP tools for social analysis
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 87
Loading