Abstract: Questionnaire data serve as a valuable resource across numerous scientific domains, offering insights into human behavior, health, and social trends. Traditional downsampling-based representation learning methods, such as standardization and one-hot encoding, reformat these data into tabular structures that discard semantic richness and obscure inter-sample and inter-feature relationships. To address this limitation, this paper introduces SemantiQ, an upsampling-based representation learning framework that embeds questionnaire responses into a unified semantic space. SemantiQ leverages retrieval-augmented generation and large language models to transform question text, option text, and external knowledge into semantically enriched natural language statements, which are then encoded into semantic embeddings and refined through multi-stage training and test-time training. Experiments on multiple real-world datasets show that SemantiQ outperforms state-of-the-art baselines.
Loading