Statistical determination of numerical codes in mixed databases

Jul 16, 2019 Submission readers: everyone
  • Keywords: Machine learning, multivariate approximation, central limit theorem, neural networks
  • Abstract: Machine Learning (ML) algorithms try to obtain information from the databases in order to take advantage of this information and generate a model that is able to solve future problems. A smaller part of the existing ML algorithms focus on dealing with non-numerical data (categorical data). This is an area that has received less attention from researchers, mainly because it is harder to map categorical data in metric spaces. By having data in a metric space, richer information can be obtained inherent in the data. In this work the problem of transforming categorical data into numerical data is addressed. Specifically is applied to mixed databases (MDB), which contain both types of data. It is required to transform the categorical variables into numerical variables, preserving the embedded patterns of the MDB, in order to subsequently access the wide range of algorithms that treat only numerical data. A statistical approach is taken to transform the categorical attributes of a MDB into numerical attributes. The central limit theorem and the multivariate approximation are the bases from which the solution of the problem starts. After its transformation to a fully numerical database, it is ready for the application of computational intelligence algorithms based on metrics. Once the MDB is transformed into a fully numerical database, a neural network is trained and the results are compared with 11 different classic machine learning algorithms. A higher accuracy is achieved in several databases which are structurally different.
0 Replies

Loading