Implications of Different Encodings of Binned Data when Clustering

Nathan Phelps

Published: 03 Dec 2025, Last Modified: 13 Feb 2026Journal of ClassificationEveryoneCC BY 4.0

Abstract: When using clustering to uncover patterns in a dataset, a data analyst must make several decisions. In some cases, one of those decisions is how to handle binned data (e.g., age or income bands), which is a common data type collected in surveys. When clustering, it is possible to encode this variable as a nominal, ordinal, or interval-scaled variable (e.g., using the bin’s midpoint), and it is not clear which of these encodings, if any, should be preferred over others. We examined the impacts of these encodings on clustering results obtained from four clustering algorithms: partitioning around medoids (PAM) with Gower’s distance, K-prototypes, a latent class model, and KAMILA, on several simulated datasets and three household finance survey datasets from North America. We found that the optimal encoding varies depending on the clustering algorithm. We recommend the nominal encoding for latent class models, the ordinal encoding for K-prototypes and KAMILA (although the results were less definitive for these two), and the midpoint encoding for PAM.