Clustering and Cleaning of Word Usage Graphs

ACL ARR 2024 August Submission460 Authors

16 Aug 2024 (modified: 17 Sept 2024)ACL ARR 2024 August SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Word Usage Graphs (WUGs) represent human judgments about semantic proximity between word uses as a weighted undirected graph. WUGs pose specific challenges to clustering algorithms such as incompleteness and noise. We are the first to systematically compare multiple graph clustering algorithms for WUGs and find that the Weighted Stochastic Block Model is comparable to or outperforms the current state-of-the-art. We further test various graph cleaning strategies to improve the quality of remaining cluster assignments while minimizing data loss. With better clustering and cleaning methods we hope to help researchers help other researchers improve the quality of their WUGs without additional manual annotation. We publish clustered and cleaned graphs for further research.
Paper Type: Long
Research Area: Semantics: Lexical and Sentence-Level
Research Area Keywords: word usage graphs, word sense induction, human-annotated, clustering, cleaning, semantic proximity, word-in-context
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data analysis
Languages Studied: German
Submission Number: 460
Loading