Combining Demographic Tabular Data with BERT Outputs for Multilabel Text Classification in Higher Education Survey Data

Kevin Chovanec, John Fields, Praveen Madiraju

Published: 2023, Last Modified: 15 Jun 2024IEEE Big Data 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Institutions of Higher Education (HEI) often possess rich text data in the form of student surveys. However, because these data are expensive to process, many universities have not yet capitalized on this resource. When working with student text data, researchers often desire to first label student responses with common categories of interest, a task in Natural Language Processing known as Multi-label Text Classification (MLTC). BERT and other Large Language Models have produced state of the art results on MLTC tasks; yet because MLTC generally presents challenges of data scarcity and data sparsity, accuracy often remains too low to fully automate the task. Unlike many common MLTC datasets, these student survey data can usually be paired with rich tabular data, both academic and demographic. In this paper, we show that a fusion approach combining tabular data with BERT outputs derived from student responses significantly improves model performance, increasing label ranking average precision from.75 to.84. The paper thus contributes to the open academic discussion of whether fusing tabular demographic data with BERT outputs improves performance, and also offers a practical approach for HEIs to automate survey labeling and thus incorporate more student text data into institutional research.