# StreamAgent: A Proactive Decision Agent for Streaming Video Understanding

## 💡 Highlights

* 📲 **Proactive StreamAgent Design**: A novel agent integrates task semantics and historical observations to **anticipate future task-relevant temporal and spatial cues**, enabling proactive and goal-driven responses in streaming video understanding.

* :zap: **Efficient Memory via Hierarchical KV-Cache**: Introduces a streaming KV-cache mechanism that **decouples short-term fine-grained perception from long-term semantic memory**, allowing **selective recall** and efficient computation even on long video streams.
* 📊 **State-of-the-Art Streaming VideoQA Performance**: Achieves **SOTA results** on multiple streaming and long video understanding benchmarks (e.g., OVO-Bench, StreamingBench), significantly outperforming existing online baselines with **lower latency and memory cost**.

## 🔥 News

**[July 22, 2025]** Code is released!

# 📌 Overview

![overview](asset/overview.png)

## 🧩 Key Components

### 📦 Incremental Memory Update

- Converts long video streams into compact Markovian memory states.
- Maintains **short-term memory on GPU** and **long-term abstract memory on CPU**.

### 🔮 Proactive Event Anticipation

- Leverages a **multimodal world model** to forecast future event sequences.

- Evaluates multiple plan trajectories using A*-inspired heuristic.

### 🛠️ Tool-Augmented Reasoning

- Selects and applies tools **based on reasoning intent** and video content.
- Enables fine-grained observation through bounding box selection and region zoom-in.

### 💾 Streaming KV-Cache Engine

- Decouples **short-term fine-grained perception** from **long-term semantics**.
- Dynamically selects KV entries per layer based on **attention importance**.

## 🎯 Supported Tasks

StreamAgent supports a wide range of streaming and long-form video tasks:

- 📺 **Streaming Video QA**: Real-time, query-aware, and delay-tolerant QA.
- 🧠 **Memory-Based Tasks**: Episodic tracing, sequential recall, hallucination detection.
- 🔮 **Future-Oriented Tasks**: Active planning and response deferral.

## 🚀 Getting Started

1. **Install the required environment**

```bash
pip install -e .
```

2. **Run the Script**

```bash
bash run.sh
```

## 💡 Design Insight

> “While reactive agents answer what’s seen, StreamAgent waits for the right moment — predicting, planning, and acting.”

StreamAgent redefines stream video understanding from passive observation to **strategic anticipation and active reasoning**.

## 📬 Contact

For questions, issues, or collaborations, please open a GitHub issue or reach out to the maintainers.
