# MTBench: A Multimodal Time Series Benchmark

**MTBench** ([Huggingface](https://huggingface.co/collections/afeng/mtbench-682577471b93095c0613bbaa), [Github](https://github.com/Graph-and-Geometric-Learning/MTBench), [Arxiv](https://arxiv.org/pdf/2503.16858)) is a suite of multimodal datasets for evaluating large language models (LLMs) in temporal and cross-modal reasoning tasks across **finance** and **weather** domains.

Each benchmark instance aligns high-resolution time series (e.g., stock prices, weather data) with textual context (e.g., news articles, QA prompts), enabling research into temporally grounded and multimodal understanding.




##  Finance News-Driven Question Answering (QA)

This dataset introduces a **reasoning-intensive QA benchmark** that evaluates an LLM’s ability to jointly interpret financial news text and corresponding stock time-series data. It is designed to go beyond traditional classification or forecasting tasks by requiring **causal inference**, **correlation assessment**, and **evidence-based decision making**.


The dataset includes two core QA tasks:

#### 1. **Correlation Prediction**

Models are asked to determine how strongly a news article is correlated with future stock price movement. This reflects real-world complexity, where news does not always directly align with market outcomes.

- **3-class labels**: Positive, Neutral, Negative correlation
- **5-class labels**: Strong Positive, Moderate Positive, No Correlation, Moderate Negative, Strong Negative

Labels are generated by prompting GPT-4o with access to ground-truth price changes, ensuring consistency with observed trends.

This task challenges LLMs to assess both **sentiment alignment** and **magnitude of influence**, pushing beyond surface-level sentiment classification.

#### 2. **Multiple-Choice QA**

Each sample presents a **question with four answer choices** grounded in both:
- Financial news content
- Historical and future stock price time-series

The correct answer is validated using causal logic, textual evidence, and observed market behavior. Distractors (incorrect options) are designed to reflect common reasoning failures—such as over-reliance on superficial trends or misinterpretation of sentiment.

This tests the model’s ability to:
- Understand nuanced financial text
- Integrate it with quantitative market behavior
- Identify misleading claims and infer causal relationships


###  Dataset Format

Each sample includes:

- `input_window` / `output_window`: Stock prices surrounding the publication event (5-min granularity)
- `input_timestamps` / `output_timestamps`: UNIX timestamps for time-series alignment
- `text`: Full article content with metadata (`published_utc`, `article_url`)
- `news_price_correlation`: Correlation label (e.g., `"Strong Positive Correlation"`)
- `MCQA`: Multiple-choice question + 4 options and the correct answer

-

###  Example QA Prompt

**Question**:  
*Which of the following statements about RHI’s stock price and the given financial analysis is correct?*

**Options**:  
A. The market is bearish due to a price drop after the news.  
B. Investors are losing confidence despite positive earnings forecasts.  
C. The stock showed sustained upward momentum, indicating confidence. ✅  
D. Price rise was purely speculative, undermining the Zacks Rank.

---

This QA dataset offers a **rich testing ground for multimodal financial reasoning**, bridging textual analysis with numerical forecasting and market interpretation.

## 📦 Other MTBench Datasets

### 🔹 Finance Domain

- [`MTBench_finance_news`](https://huggingface.co/datasets/afeng/MTBench_finance_news)  
  20,000 articles with URL, timestamp, context, and labels

- [`MTBench_finance_stock`](https://huggingface.co/datasets/afeng/MTBench_finance_stock)  
  Time series of 2,993 stocks (2013–2023)

- [`MTBench_finance_aligned_pairs_short`](https://huggingface.co/datasets/afeng/MTBench_finance_aligned_pairs_short)  
  2,000 news–series pairs  
  - Input: 7 days @ 5-min  
  - Output: 1 day @ 5-min

- [`MTBench_finance_aligned_pairs_long`](https://huggingface.co/datasets/afeng/MTBench_finance_aligned_pairs_long)  
  2,000 news–series pairs  
  - Input: 30 days @ 1-hour  
  - Output: 7 days @ 1-hour

- [`MTBench_finance_QA_short`](https://huggingface.co/datasets/afeng/MTBench_finance_QA_short)  
  490 multiple-choice QA pairs  
  - Input: 7 days @ 5-min  
  - Output: 1 day @ 5-min

- [`MTBench_finance_QA_long`](https://huggingface.co/datasets/afeng/MTBench_finance_QA_long)  
  490 multiple-choice QA pairs  
  - Input: 30 days @ 1-hour  
  - Output: 7 days @ 1-hour

### 🔹 Weather Domain

- [`MTBench_weather_news`](https://huggingface.co/datasets/afeng/MTBench_weather_news)  
  Regional weather event descriptions

- [`MTBench_weather_temperature`](https://huggingface.co/datasets/afeng/MTBench_weather_temperature)  
  Meteorological time series from 50 U.S. stations

- [`MTBench_weather_aligned_pairs_short`](https://huggingface.co/datasets/afeng/MTBench_weather_aligned_pairs_short)  
  Short-range aligned weather text–series pairs

- [`MTBench_weather_aligned_pairs_long`](https://huggingface.co/datasets/afeng/MTBench_weather_aligned_pairs_long)  
  Long-range aligned weather text–series pairs

- [`MTBench_weather_QA_short`](https://huggingface.co/datasets/afeng/MTBench_weather_QA_short)  
  Short-horizon QA with aligned weather data

- [`MTBench_weather_QA_long`](https://huggingface.co/datasets/afeng/MTBench_weather_QA_long)  
  Long-horizon QA for temporal and contextual reasoning



## 🧠 Supported Tasks

MTBench supports a wide range of multimodal and temporal reasoning tasks, including:

- 📈 **News-aware time series forecasting**
- 📊 **Event-driven trend analysis**
- ❓ **Multimodal question answering (QA)**
- 🔄 **Text-to-series correlation analysis**
- 🧩 **Causal inference in financial and meteorological systems**



## 📄 Citation

If you use MTBench in your work, please cite:

```bibtex
@article{chen2025mtbench,
  title={MTBench: A Multimodal Time Series Benchmark for Temporal Reasoning and Question Answering},
  author={Chen, Jialin and Feng, Aosong and Zhao, Ziyu and Garza, Juan and Nurbek, Gaukhar and Qin, Cheng and Maatouk, Ali and Tassiulas, Leandros and Gao, Yifeng and Ying, Rex},
  journal={arXiv preprint arXiv:2503.16858},
  year={2025}
}
