Abstract: In this paper, we propose a novel Rhetorical Structure Index (RSI) to measure the structural importance of a word or a phrase. Unlike TF-IDF and other content-driven measurements, RSI identifies words or phrases that are structural cues in an unstructured document. We show structurally motivated features with high RSI values are more useful than content-driven features for applications such as segmenting unstructured lecture transcripts into meaningful segments. Experiments show that using RSI significantly improves the segmentation accuracy compared to TF-IDF, a traditional content-based feature weighting scheme.
0 Replies
Loading