## Interpretability beyond feature attribution

One of the more well known papers that heavily utilizes the inner activations of a model to generate further information about the model is [Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors](https://arxiv.org/abs/1711.11279)

This methodology combines using the activations to train a linear classifier along with the gradients with respect to a certain layer to determine how


A basic implementation of TCAV is as follows:

```python
model = train_model(model, data_train)

# defined concepts
concepts = data_train.get_concept("stripes")
non_concepts = random_data_like(concepts)


y_train = torch.stack([torch.ones(len(concepts)), torch.zeros(len(non_concepts))]).reshape(-1)


captures = capture_layers_of_model(model, [conv1, conv2,, ...])
# concat the concepts and not so we can generate the activations together
_ = model(torch.cat((concepts, non_concepts), 0))


# Here we train a linear model to get the concept activation vectors
cav = {}
for layer in capture.keys():
    # for each layer we are 'testing' we get the activations and train a linear classifier, then save the CAV
    activations = capture[layer].activations
    activations = activations.reshape(len(activations), -1).detach().numpy()
    linear_model = LinearModel()

    linear_model.fit(activations, y_train)
    cav[layer] = linear_model.coef_.reshape(-1)


# here we calculate the TCAV score itself which shows how sensitive the layer is to a concept

withheld_concepts = data_test.get_concept("stripes")
preds = model(withheld_concepts)

for layer in capture.keys():
    capture[layer].capture_gradients()

preds.backward(y)

cav_sensitivity_scores = {}
for layer in capture.keys():
    grad = capture[layer].grad
    grad = grad.reshape(len(grad), -1)
    cav_sensitivity_scores[layer] = grad @ cav[layer]

tcav_scores = {}
for layer, scores in cav_sensitivity_scores.items():
    tcav_scores[layer] = sum(scores > 0).item() / len(scores)
```
