# HOBA: Higher-Order Block-Diagonal Attention Unrolling for Transformer

**Anonymous Submission to ICLR 2026**

## Abstract

Transformers with 2D self-attention are powerful but computationally intensive, specifically for long sequences due to their quadratic complexity. Therefore, sparse attention methods attempt to alleviate this cost by limiting attention patterns. However, they often compromise explainability and fail to generalize well to global dependencies. Therefore, we propose Higher-Order Block-Diagonal Attention (HOBA), a novel transformer variant that models triplet interactions utilizing 3D attention tensors and block-diagonal unrolling. HOBA can capture richer patterns within and across blocks while efficiently modeling long-range dependencies without high computational cost. We use knowledge distillation with RoBERTa as the teacher to train the HOBA student model. We evaluate HOBA on five NLP tasks across seven benchmark datasets, comparing it against Full-3D (no block or cross-block), standard 2D attention, and sparse mechanisms including Longformer, BigBird, Local, and Dilated attention. We further isolate the contri- butions of block structure and higher-order interactions, confirming HOBA’s superiority over both dense and sparse baselines. We also demonstrate that allowing cross-block interaction yields significant accuracy gains by enhancing long-range token dependencies.

## Repository Structure

```
├── agnews/                     # AG News 4-class classification
├── imdb2/                      # IMDB binary sentiment classification  
├── MNLI/                       # Multi-Genre Natural Language Inference
├── squad/                      # SQuAD reading comprehension
├── sst2/                       # Stanford Sentiment Treebank binary
├── Trec/                       # TREC question classification 
├── yelp/                       # Yelp Polarity sentiment (512 seq len)
└── HobaWithoutCrossBlock.py    # HOBA variant without cross-block communication
```

Each dataset folder contains:
- `HobaSST2.py` (or equivalent): **Proposed HOBA mechanism**
- `VanillaSST2.py` (or equivalent): **Baseline vanilla attention**

### Higher-Order Block Diagonal Attention (HOBA)

Our approach introduces a novel attention mechanism that:

1. **Higher-Order Interactions**: Extends standard attention Q·K^T to incorporate third-order tensor products Q·K₁·K₂
2. **Block Diagonal Structure**: Partitions sequences into overlapping blocks for computational efficiency
3. **Cross-Block Communication**: Enables information flow between blocks for long-range dependencies
4. **Knowledge Distillation**: Uses pretrained RoBERTa as teacher to guide student model training

### Mathematical Formulation

Standard attention: `Attention(Q,K,V) = softmax(QK^T/√d)V`

Our HOBA mechanism: `Attention(Q,K₁,K₂,V₁,V₂) = softmax(Q⊗K₁⊗K₂/√d)(V₁⊗V₂)`

Where ⊗ denotes tensor product operations computed within block-diagonal structure.

### Requirements

```bash
pip install torch>=1.9.0
pip install transformers>=4.20.0
pip install datasets==2.16.1
pip install numpy==1.25.2
pip install pandas>=1.3.0
pip install tqdm>=4.62.0
pip install scikit-learn>=1.0.0
```

### Dataset Download

**Important**: All datasets will be automatically downloaded from HuggingFace when running the scripts. No manual dataset preparation is required.

### Running Experiments

**Important**: Before running any experiments, navigate to the specific dataset directory and ensure you have appropriate permissions to create result folders.

#### Step 1: Navigate to dataset directory
```bash
cd sst2/  # or agnews/, trec/, yelp/, etc.
```

#### Step 2: Run experiments
```bash
# Baseline (Vanilla Attention)
python VanillaSST2.py

# Proposed Method (HOBA)
python HobaSST2.py
```

#### Example for different datasets:
```bash
# AG News
cd agnews/
python VanillaAGNews.py
python HobaAGNews.py

# TREC
cd trec/
python VanillaTREC.py
python HobaTREC.py

# Yelp Polarity (512 sequence length)
cd yelp/
python VanillaYelp.py
python HobaYelp.py
```


**Note**: You can modify hyperparameters by editing the main function in each script before running.


#### Variant without Cross-Block Communication:
```bash
# From root directory
python HobaWithoutCrossBlock.py
```

### Output and Results

Each experiment automatically creates and saves results in:
- `results/models/`: Trained model checkpoints  
- `results/metrics/`: Performance metrics, training logs, and confusion matrices
- JSON reports with detailed per-class analysis

**Note**: Ensure write permissions in the current directory for result folder creation.

## Hardware Requirements

- **Minimum**: 8GB GPU memory
- **Recommended**: 16GB+ GPU memory for optimal performance
- **CPU**: Compatible but significantly slower


## Reproducibility

All experiments use fixed random seeds (42) and deterministic operations where possible for reproducible results.

---

**Note**: This is an anonymous submission. Author information and institutional affiliations have been omitted for peer review process.