Cleaning Semi-Structured Errors in Open Data Using Large Language Models

Manuel Mondal, Julien Audiffren, Ljiljana Dolamic, Gérôme Bovet, Philippe Cudré-Mauroux

Published: 2024, Last Modified: 06 Dec 2024SDS 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Many datasets suffer from errors, rendering data cleaning, the process of rectifying these issues, very time-consuming. The most commonly studied errors encompass inaccuracies in data values or labels (data errors) and issues with data formatting that hinder parsing (structured errors). We focus on a distinct category of errors known as Semi-Structured Errors (SSEs), which occur when both the data values and the structure are correct, but the values are misplaced within the structure, requiring time-demanding and complex parsing rules (including their exceptions) to account for them. In this work, we explore the capabilities of Large Language Models to clean SSEs and show promising experimental results on a public dataset of Swiss federal law.