Multitask Transformer Models for Demographic and Industry Profiling on Long-Form Blog Texts

TMLR Paper7092 Authors

21 Jan 2026 (modified: 16 Feb 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: We address the challenge of multitask author profiling on long-form blog text by developing four transformer-based models that jointly predict gender, age group, and industry. Using a cleaned version of the Blog Authorship Corpus, we explore document-length handling strategies that span input ranges from 192 to 500 tokens, including long-context encoding, BART-based summarization, and chunked processing with prediction fusion. Our experiments show that multitask learning consistently outperforms strong single-task baselines, with the largest gains for industry. We further find that broader input context yields more reliable predictions, while alternative representations emphasize complementary stylistic and topical cues. Taken together, these findings provide a comprehensive analysis of text-length effects in multitask author profiling and highlight the importance of contextual breadth for robust demographic inference. The dataset was preprocessed by merging industry tags into fourteen categories and applying standard text normalization.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Cedric_Archambeau1
Submission Number: 7092
Loading