In-Database Text Classification with BornSQL

Published: 24 Mar 2026, Last Modified: 25 Mar 2026EDBT 2026EveryoneCC BY-NC-ND 4.0
Abstract: The integration of databases and machine learning promises to enhance various aspects of data management, analysis, and application. However, in-database machine learning (In-DB ML) is not easily portable to different database management systems and current approaches are typically limited to training and inference, while modern machine learning pipelines often involve aspects such as continuous learning, unlearning, and explainability. This paper presents BornSQL, a In-DB ML algorithm based on the Born Classifier [7], and exclusively implemented through standard SQL queries. BornSQL can handle categorical data, and it is particularly appropriate for classification of textual data. Further contributions of BornSQL are i) incremental learning to efficiently enforce model updates when new data become available in the db, ii) unlearning when selected data needs to be excluded due to privacy issues, and iii) global/local explainability to associate the importance of a feature/attribute in determining the classification result. We illustrate the usage and scalability of the algorithm using a benchmark database consisting of 2,359,828 scientific publications divided into three classes and composed of 3,942,559 features. The training time is linear in the number of publications and the average inference time for a publication is 1 millisecond on our experimental environment. We discuss potential applications such as cost-effective model serving, exploratory data analysis, and data privacy.
Loading