Statistical determination of numerical codes in mixed databases

Angel Kuri-Morales; Raul Galindo-Hernandez

Statistical determination of numerical codes in mixed databases

Angel Kuri-Morales, Raul Galindo-Hernandez

16 Jul 2019 (modified: 05 May 2023)RIIAA 2019 Conference SubmissionReaders: Everyone

Keywords: Machine learning, multivariate approximation, central limit theorem, neural networks

Abstract: Machine Learning (ML) algorithms try to obtain information from the databases in order to take advantage of this information and generate a model that is able to solve future problems. A smaller part of the existing ML algorithms focus on dealing with non-numerical data (categorical data). This is an area that has received less attention from researchers, mainly because it is harder to map categorical data in metric spaces. By having data in a metric space, richer information can be obtained inherent in the data. In this work the problem of transforming categorical data into numerical data is addressed. Specifically is applied to mixed databases (MDB), which contain both types of data. It is required to transform the categorical variables into numerical variables, preserving the embedded patterns of the MDB, in order to subsequently access the wide range of algorithms that treat only numerical data. A statistical approach is taken to transform the categorical attributes of a MDB into numerical attributes. The central limit theorem and the multivariate approximation are the bases from which the solution of the problem starts. After its transformation to a fully numerical database, it is ready for the application of computational intelligence algorithms based on metrics. Once the MDB is transformed into a fully numerical database, a neural network is trained and the results are compared with 11 different classic machine learning algorithms. A higher accuracy is achieved in several databases which are structurally different.

0 Replies

Loading