Structural Quantile Normalization: a general, differentiable feature scaling technique balancing gaussian approximation and structural preservation
Feature scaling is an essential practice in modern machine learning, both as a preprocessing step and as an integral part of model architectures, such as batch and layer normalization in artificial neural networks. Its primary goal is to align feature scales, preventing larger-valued features from dominating model learning—especially in algorithms utilizing distance metrics, gradient-based optimization, and regularization. Additionally, many algorithms benefit from or require input data approximating a standard Gaussian distribution, establishing "Gaussianization" as an additional objective. Lastly, an ideal scaling method should be general, as in applicable to any input distribution, and differentiable to facilitate seamless integration into gradient-optimized models. Although differentiable and general, traditional linear methods, such as standardization and min-max scaling, cannot reshape distributions relative to scale and offset. On the other hand, existing nonlinear methods, although more effective at Gaussianizing data, either lack general applicability (e.g., power transformations) or introduce excessive distortions that can obscure intrinsic data patterns (e.g., quantile normalization). Present non-linear methods are also not differentiable. We introduce Structural Quantile Normalization (SQN), a general and differentiable scaling method, that enables balancing Gaussian approximation with structural preservation. We also introduce Fast-SQN; a more performance-efficient variant with the same properties. We show that SQN is a generalized augmentation of standardization and quantile normalization. Using the real-world "California Housing" dataset, we demonstrate that Fast-SQN outperforms state-of-the-art methods—including classical and ordered quantile normalization, and Box-Cox, and Yeo-Johnson transformations—across key metrics (i.e., RMSE, MAE, MdAE) when used for preprocessing. Finally, we show our approach transformation differentiability and compatibility with gradient-based optimization using the real-world "Gas Turbine Emission" dataset and propose a methodology for integration into deep networks.