# 1. Feature as a Decomposition:

* There is significant empirical evidence suggesting that neural networks have interpretable linear directions in activation space.

![1696735494605](image/Decomposing_with_Lan_Models/1696735494605.png)

![1696739319846](image/Decomposing_with_Lan_Models/1696739319846.png)

* If linear directions are interpretable, it's natural to think there's some "basic set" of meaningful directions which more complex directions can be created from. We call these directions features, and they're what we'd like to decompose models into.
* They decompose the activation vector as a combination of more general features which can be any direction
* Superposition Hypothesis:

  ![1696735846434](image/Decomposing_with_Lan_Models/1696735846434.png)

  ![1696735865247](image/Decomposing_with_Lan_Models/1696735865247.png)

# 2. What makes a good decomposition

## 2.1 what is a good one:

![1696737735165](image/Decomposing_with_Lan_Models/1696737735165.png)

## 2.2 What this would allow us to do:

![1696737835191](image/Decomposing_with_Lan_Models/1696737835191.png)

# 3.Using Sparese Autoencoders to Find Good Decompositions:


![1696739468031](image/Decomposing_with_Lan_Models/1696739468031.png)