fLSA: Learning Semantic Structures in Document Collections Using Foundation Models

fLSA: Learning Semantic Structures in Document Collections Using Foundation Models

ACL ARR 2025 February Submission2811 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Humans can learn to solve new tasks by inducing high-level strategies from example solutions to similar problems and then adapting these strategies to solve unseen problems. Can we use large language models to induce such high-level structure from example documents or solutions? We introduce fLSA, a foundation-model-based Latent Semantic Analysis method that iteratively clusters and tags document segments based on document-level contexts. These tags can be used to model the latent structure of given documents and for hierarchical sampling of new texts. Our experiments on story writing, math, and multi-step reasoning datasets demonstrate that fLSA tags are more informative in reconstructing the original texts than existing tagging methods. Moreover, when used for hierarchical sampling, fLSA tags help expand the output space in the right directions that lead to correct solutions more often than direct sampling and hierarchical sampling with existing tagging methods.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: topic modeling, knowledge tracing/discovering/inducing

Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models

Languages Studied: English

Submission Number: 2811

Loading