CoLeM: A framework for semantic interpretation of Russian-language tables based on contrastive learning

Published: 22 Jun 2025, Last Modified: 22 Jun 2025ACL-SRW 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: russian-language tables, tabular data, semantic table interpretation, column type annotation, knowledge graphs, self-supervised learning, contrastive learning, table representations
TL;DR: We propose a new framework called CoLeM to annotate table columns based on contrastive learning. We used a corpus of Russian-language web tables to train CoLeM model and establish new state-of-the-art performance with a micro F1 score of up to 97%.
Abstract: Tables are extensively utilized to represent and store data, however, they often lack explicit semantics necessary for machine interpretation of their contents. Semantic table interpretation is essential for integrating structured data with knowledge graphs, yet existing methods face challenges with Russian-language tables due to limited labeled data and linguistic peculiarities. This paper introduces a contrastive learning approach to minimize reliance on manual labeling and enhance the accuracy of column annotation for rare semantic types. The proposed method adapts contrastive learning for tabular data through augmentations and employs a distilled multilingual BERT model trained on the unlabeled RWT corpus (comprising 7.4 million columns). The resulting table representations are incorporated into the RuTaBERT pipeline, reducing computational overhead. Experimental results demonstrate a micro-F1 score of 97% and a macro-F1 score of 92%, surpassing several baseline approaches. These findings emphasize the efficiency of the proposed method in addressing data sparsity and handling unique features of the Russian language. The results further confirm that contrastive learning effectively captures semantic similarities among columns without explicit supervision, which is particularly vital for rare data types.
Archival Status: Archival
Paper Length: Long Paper (up to 8 pages of content)
Submission Number: 177
Loading