### Overview
This module covers various extractions and analyses on internal representations 
of models. In particular, we support 3 levels of representations:

1. Sparse autoencoder representations of the MLPs.
2. Raw activations of the MLPs.
3. Simple embedding representations at the last layer of the transformer.

We explore what level of information can be extracted about the reward models at
each of these representation levels.