# Lazy Attention

Lazy Attention is a focused attention mechanism that addresses two long-standing issues in standard Transformer attention: representational collapse and attention sink. Instead of treating these failures independently, Lazy Attention views them as two sides of the same problem—improper attention distribution. The layer sharpens attention when multiple relevant tokens compete (overload) and relaxes it when none are informative (underload), yielding a more faithful contextual representation.

## Motivation
- **Attention overload**: When several tokens receive similarly high weights, the contextualized representation collapses to an indistinguishable mixture.
- **Attention underload**: When no token is semantically useful, classic softmax still forces a distribution, creating spurious attention sinks.
- **Unified perspective**: Both issues stem from forcing dense distributions; Lazy Attention adapts the focus level instead of always committing to a full softmax.

## Key Ideas
- **Positional discrimination**: Head- and dimension-specific positional biases encourage heads to specialize, preventing overloaded, redundant attention maps.
- **Elastic-Softmax**: A relaxed normalization that subtracts an adaptive offset before softmax, allowing the layer to suppress irrelevant tokens entirely.
- **Attention sparsity**: The combination yields up to 59.58% sparsity in attention maps while maintaining language modeling quality.

## Implementation Highlights
- `fla.layers.LazyAttention` is a drop-in PyTorch module that replaces the previous `SWAttention` implementation.
- Integrated into the `LazyAttentionModel`/`LazyAttentionForCausalLM` stack located in `fla/models/lazy_attention`.
- Works with Flash Attention when available but falls back to pure PyTorch kernels for portability.

## Installing
```bash
pip install -e .
```
The project targets Python 3.10+ and PyTorch 2.2 or newer. Optional extras (Flash Attention, Triton kernels, etc.) can be installed for additional speedups.

## Quick Start
Instantiate the layer directly:
```python
import torch
from fla.layers import LazyAttention

attn = LazyAttention(hidden_size=2048, num_heads=32, num_kv_heads=8)
inputs = torch.randn(2, 512, 2048)
outputs, weights, _ = attn(inputs)
print(outputs.shape)  # torch.Size([2, 512, 2048])
```

Or build a language model with 🤗 Transformers tooling:
```python
from fla.models import LazyAttentionConfig, LazyAttentionForCausalLM

config = LazyAttentionConfig(hidden_size=2048, num_heads=32, num_hidden_layers=24)
model = LazyAttentionForCausalLM(config)
```
The resulting blocks internally use `LazyAttention`, enabling direct comparison with baseline Transformers by swapping only the attention layer.

## Repository Map
- `fla/layers/lazy_attention.py`: Core Lazy Attention module.
- `fla/models/lazy_attention/`: Configurations and model wrappers that employ Lazy Attention.
- `examples/`: Minimal training and inference scripts.
- `benchmarks/`: Throughput and memory benchmarks for attention variants.
- `tests/`: Unit tests covering kernels, modules, and integration paths.
- `notebooks/`: Interactive analysis, including bias visualization for Lazy Attention.

## Experiments & Evaluation
- Lazy Attention eliminates attention sink artifacts on synthetic probe tasks measuring residual focus on BOS tokens.
- Demonstrates competitive perplexity against standard attention on open language modeling benchmarks while delivering sparse attention maps.
- Compatible with the evaluation utilities in `evals/` and the benchmark harness under `benchmarks/`.

## Contributing
Issues and pull requests are welcome. If you add a variant of Lazy Attention or run new experiments, include reproducible configs under `legacy/training/configs/` or `training/` and update the README accordingly.

## Citation
If you build upon Lazy Attention in academic work, please cite the accompanying paper once it is available. A BibTeX entry will be provided here after publication.

---
Maintained by the Lazy Attention authors. Join our community Discord (link in `utils/links.md`) for discussions and updates.
