
# DePass Attribution Toolkit

This repository provides an implementation of DePass, a modular and extensible attribution framework for analyzing transformer-based language models. DePass supports arbitrary-granularity attribution by allowing initialization and propagation of attribution signals from any component within a transformer model. In this implementation, we demonstrate several representative use cases including token-level, neuron-level, module-level (MLP or attention), head-level, and custom subspace-level attributions, enabling fine-grained interpretability of internal mechanisms across attention and feedforward layers.

## File Structure

- `cite_functions.py`: Core implementation of the `attr_state_manager` class, supporting various attribution granularities and propagation strategies.
- `Demo.ipynb`: Demonstrates usage with HuggingFace-compatible LLaMA/Qwen models, including token-level, model component-level (e.g., MLP, attention, neurons), and subspace-level attribution with custom initialization.
- `datasets/counterfact_info.json`: Contains prompts, factual info, injected incorrect info, and correct info, used for subspace-level input attribution to analyze which input parts activate specific subspaces.
- `datasets/truthfulqa_info.json`: Derived from TruthfulQA, includes questions with factually correct and incorrect options, supporting evaluation of attribution consistency under adversarial or ambiguous inputs.
- `datasets/counterfact_multilingual.json`: Provides fact-based prompts in multiple languages with aligned semantics and language-specific forms, enabling multilingual subspace attribution experiments and cross-lingual consistency analysis.

## Environment Setup

Tested with the following major packages:

- `torch==2.4.1+cu121`
- `transformers==4.44.2`
- `numpy==1.26.3`

Ensure GPU support (CUDA 12.1) is available for best performance.

```bash
pip install torch==2.4.1+cu121 transformers==4.44.2 numpy==1.26.3
```

You may also need `tqdm` for progress bar visualization.

## Quick Start

In `Demo.ipynb`, the typical workflow includes:

1. **Model and Tokenizer Loading**:
   Load a pretrained model (e.g., LLaMA, Qwen) and tokenizer using HuggingFace `transformers`.

2. **Attribution Manager Instantiation**:
   ```python
   manager = attr_state_manager(model_name="llama", model=model, tokenizer=tokenizer)
   ```

3. **Token-Level Attribution**:
   ```python
   attribute_state, states = manager.get_last_layer_attribute_state(prompt)
   ```

4. **Module-Level Attribution (e.g., MLP layer)**:
   ```python
   attribute_state = manager.get_layer_module_attribution_state(prompt, start_layer_idx=5, type="mlp")
   ```

5. **Subspace-Level Attribution**:
   Users can define a custom initialization tensor for a given layer and propagate it:
   ```python
   attribute_state = manager.get_subspace_attribute_state(prompt, start_layer_idx, custom_tensor)
   ```

## Attribution Outputs

The output attribution tensors produced by DePass vary by use case but follow the general format:

```
(N, *, D)
```

Where:
- `N`: sequence length (number of tokens)
- `*`: dimension determined by decomposition granularity:
  - `M`: Number of user-defined components (e.g., selected neurons, module parts, or embedding subspaces)
  - `N`: full token-to-token attribution (when analyzing inter-token propagation)
- `D`: hidden size of the model

This flexible structure enables arbitrary initialization and propagation schemes across the transformer layers.



## Notes
- Internally uses PyTorch hooks to capture intermediate activations and control attention behavior.