CLUSTERBERT: MULTI-STAGE FINE-TUNING OF TRANSFORMERS FOR DEEP TEXT CLUSTERING

Martin Teuffenbach; Andreas Stephan; Yuxi Xia; Claudia Plant; Benjamin Roth

CLUSTERBERT: MULTI-STAGE FINE-TUNING OF TRANSFORMERS FOR DEEP TEXT CLUSTERING

Martin Teuffenbach, Andreas Stephan, Yuxi Xia, Claudia Plant, Benjamin Roth

22 Sept 2022 (modified: 13 Feb 2023)ICLR 2023 Conference Withdrawn SubmissionReaders: Everyone

Keywords: Text Clustering, Deep Clustering, Transformer, Sentence Embedding

Abstract: Transformer models have originally been designed for text generation, classification and sequence labelling, and they have achieved new state-of-the-art results in those areas. Recent deep clustering methods learn cluster-friendly spaces for complex data and thereby outperform traditional clustering algorithms, especially on images and graphs. We propose ClusterBERT, an unsupervised algorithm that combines the strengths of both approaches. By tightly integrating transformer-based sentence representation learning with clustering, our method discovers a cluster-friendly representation of text data that retains useful semantic information. ClusterBERT is a multi-stage procedure that consists of domain adaptation, clustering, and hardening of the clusters. Starting from an initial representation obtained by transformer models, ClusterBERT learns a cluster-friendly space for text data by jointly optimizing the reconstruction loss and a clustering loss. Our experiments demonstrate that ClusterBERT outperforms state-of-the-art text clustering methods.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Submission Guidelines: Yes

Please Choose The Closest Area That Your Submission Falls Into: Deep Learning and representational learning

5 Replies

Loading