# Stable Anisotropic Regularization
Given the success of Large Language Models (LLMs), there has been considerable interest in studying the properties of model activations. The literature overwhelmingly agrees that LLM representations are dominated by a few "outlier dimensions" with exceedingly high variance and magnitude. Several studies in Natural Language Processing have sought to mitigate the impact of such outlier dimensions and force LLMs to be isotropic (i.e., have uniform variance across all dimensions in embedding space). Isotropy is thought to be a desirable property for LLMs that improves model performance and more closely aligns textual representations with human intuition. However, many of the claims regarding isotropy in NLP have been based on the average cosine similarity of embeddings which has recently been shown to be a flawed measure of isotropy. In this paper, we propose I-STAR: IsoScore*-based STable Anisotropic Regularization, a novel regularization method that can be used to increase or decrease levels of isotropy in embedding space during training. I-STAR uses IsoScore*, the first accurate measure of isotropy that is both differentiable and stable on mini-batch computations. In contrast to several previous works, we find that decreasing isotropy in contextualized embeddings improves performance on the majority of tasks and models considered in this paper. 

## Roadmap to this Repo
This repository consists of the following primary files:

- **run_model.py**: Contains code to train all models on all tasks considered in our paper, including training with CosReg and I-STAR. Use run_model.sh to specify the training arguments and hyperparameters. The regularizer parameter accepts one of three arguments: "None" (no regularization), "cos_sim" (CosReg), and "istar" (I-STAR). Running run_model.py will automatically save the best checkpoints and call run_analysis after the final training epoch. Please note that the code contains infrastructure to be connected to wandb and is set up for distributed training with PyTorch DDP.
- **regularizers.py**: Contains the full code for IsoScore* and CosReg to be used while training models. Additionally, you can use the isocore_star() function to calculate the IsoScore* values of a given point cloud.
- **training_utils.py**: Contains all DataLoaders and preprocessing functions.
- **analysis.py**: Contains functions used to calculate several statistics about model representations, such as the IsoScore* values, TwoNN intrinsic dimensionality estimates, mean, standard deviation, and average cosine similarity.

## Measuing Isotropy with IsoScore*
IsoScore* is inspired from IsoScore (https://aclanthology.org/2022.findings-acl.262/) and is equivalent to IsoScore when no covariance matrix shrinkage is applied. The complete code for IsoScore* can be found in regularizers.py. To calculate IsoScore* for a point cloud of data, you can set is_eval=True when calling isocore_star(). Please note that if you are calculating IsoScore* for a distribution where the number of points is fewer than the dimensionality of the ambient vector space, the resulting IsoScore* value will be artificially low unless you perform RDA (Regularized Discriminant Analysis) shrinkage on the covariance matrix!

## I-STAR Regularization. 
To train a given model with I-STAR regularization, you can call run_models.sh with the --regularizer "istar" option. I-STAR regularization introduces two additional hyperparameters that can be tuned:


+ **Shrinkage Parameter (Zeta)**: The shrinkage parameter determines the balance between information from the sample covariance matrix and the stable, shrinkage covariance matrix. By setting the --zeta argument to 0, the IsoScore* values will be calculated entirely from the mini-batch covariance matrix, effectively disabling shrinkage and making IsoScore* equivalent to IsoScore. On the other hand, setting --zeta to 1 will disregard the covariance of the mini-batch and calculate IsoScore* values based solely on the shrinkage matrix.
+ **Tuning Parameter (Lambda)**: The tuning parameter can be set to any real number. Positive values of the tuning parameter encourage isotropy in model representations, while negative values force the representations to be even less isotropic.

You have the flexibility to perform I-STAR regularization on any layer of the model by adjusting the --layer argument in run_model.sh. In the paper, we set --layer "all", which means we calculate a single IsoScore* penalty by considering the union of the hidden states from all layers of the model.
