# *FlexPrefill*: Supplementary Materials

This repository contains the supplementary material, i.e., the algorithm implementation file for the paper "*FlexPrefill*: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference" submitted to ICLR.

## Key Components

We note that the implementation of our *FlexPrefill* is based on Triton 3.0.0, which provides efficient CUDA kernels.

### *FlexPrefill* Attention

The main implementation of *FlexPrefill* can be found in the `flex_prefill_attention` function. This function orchestrates the context-aware sparse attention mechanism, which consists of three main steps:

1. **Sparse Pattern Determination**: Algorithm 1 (refer to the paper) determines whether to use the *Query-Aware* pattern or fall back to the *Vertical-Slash* pattern for each attention head.

2. **Sparse Index Selection**: Based on the attention patterns obtained in step 1 and the given cumulative attention threshold γ, the sparse index set **S** that needs to be computed for each attention head is obtained by:
   - Algorithm 2 (refer to the paper) for *Query-Aware* pattern
   - Algorithm 3 (refer to the paper) for *Vertical-Slash* pattern

3. **Sparse Attention Computation**: The algorithm performs sparse attention computation for each attention head using the obtained sparse index and returns the final attention result.

For more detailed explanations, please refer to Algorithm 1 in our paper.

## Usage

To use *FlexPrefill*, import the `flex_prefill_attention` function and call it with your query, key, and value tensors, along with the desired hyperparameters:

```python
output = flex_prefill_attention(
    q, k, v,
    gamma=0.9,
    tau=0.1,
    min_budget=None,
    max_budget=None,
    softmax_scale=None,
    block_size=128
)
```

### Parameters:

- `q`, `k`, `v`: Query, Key, and Value tensors
- `gamma`: Cumulative attention threshold
- `tau`: Threshold for determining sparse pattern
- `min_budget`: Minimum computation budget (optional)
- `max_budget`: Maximum computation budget (optional)
- `softmax_scale`: Scale factor for softmax computation (optional)
- `block_size`: Size of attention blocks
