Investigating Data Augmentations in Unsupervised Sentence Embeddings for Biomedical Text

Anonymous

Investigating Data Augmentations in Unsupervised Sentence Embeddings for Biomedical Text

Anonymous

16 Dec 2023ACL ARR 2023 December Blind SubmissionReaders: Everyone

Abstract: Unsupervised sentence representation learning is crucial in NLP, with contrastive learning showing notable success. This study concentrates on sentence embeddings in the biomedical domain, employing Bert-base-uncased and Chinese-bert-wwm-ext for English and Chinese text, respectively. We assess our models using BIOSSES and ChineseBLUE benchmarks, marking the first investigation into data augmentation methods for enhancing contrastive learning in biomedical NLP. Our findings reveal that general-purpose natural language pre-trained Bert-base models excel in biomedical tasks when fine-tuned with domain-specific texts. By applying various data augmentation techniques, we enhance the contrastive learning of biomedical sentence embeddings. Results show a 4.34% increase in BIOSSES’s unsup-SimCSE average Spearman’s correlation, and improvements in ChineseBLUE tasks, surpassing state-of-the-art unsup-SimCSE scores. We also establish that augmentation methods preserving sentence constituents, like Punctuation insertion and MixCSE-Instance weighting, yield superior outcomes.

Paper Type: long

Research Area: Semantics: Sentence-level Semantics, Textual Inference and Other areas

Languages Studied: Chinese, English

0 Replies

Loading