Clustering and Cleaning of Word Usage Graphs

Clustering and Cleaning of Word Usage Graphs

16 Aug 2024 (modified: 17 Sept 2024)ACL ARR 2024 August SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Word Usage Graphs (WUGs) represent human judgments about semantic proximity between word uses as a weighted undirected graph. WUGs pose specific challenges to clustering algorithms such as incompleteness and noise. We are the first to systematically compare multiple graph clustering algorithms for WUGs and find that the Weighted Stochastic Block Model is comparable to or outperforms the current state-of-the-art. We further test various graph cleaning strategies to improve the quality of remaining cluster assignments while minimizing data loss. With better clustering and cleaning methods we hope to help researchers help other researchers improve the quality of their WUGs without additional manual annotation. We publish clustered and cleaned graphs for further research.

Paper Type: Long

Research Area: Semantics: Lexical and Sentence-Level

Research Area Keywords: word usage graphs, word sense induction, human-annotated, clustering, cleaning, semantic proximity, word-in-context

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data analysis

Languages Studied: German

Submission Number: 460

Loading