# MTBench
### Dataset Distribution

Distributions of financial news impact duration and financial news categories:

<div align="center">
  <img src="assets/finance_duration.png" alt="Finance Duration Distribution" width="48%"/>
  <img src="assets/finance_type.png" alt="Finance Report Type Distribution" width="48%"/>

</div>
Distributions of severe weather duration and their types:
<div align="center" style="display: flex; justify-content: space-between;">
  <img src="assets/weather_duration.png" alt="Weather Duration Distribution" width="48%"/>
  <img src="assets/weather_event.png" alt="Weather Event Distribution" width="48%"/>
</div>

### Evaluation

To evaluate models on MTBench, you need to:

1. Set up API keys for LLMs in `evaluation/api_call.py`
2. Choose the domain, evaluation task and the setting
3. Run the corresponding evaluation script

For example, to evaluate time series trend classification on financial data, you need to set the arguments in `evaluation/finance/run_trend_classification.sh` :

```
API_NAME="gpt-4o"  # choose the LLM to be evaluated
MODE="combined"    # choose the input type, select from ["timeseries_only", "combined"]
IN_DAYS=30         # length of input time series
OUT_DAYS=7         # length of output time series

python trend_classification.py \
    --dataset_folder="../../data/processed/finance/aligned_in${IN_DAYS}days_out${OUT_DAYS}days" \
    --save_path="../../results/finance/trend_classification_in${IN_DAYS}_out${OUT_DAYS}/${API_NAME}_${MODE}" \
    --model=$API_NAME \
    --mode=$MODE

```

Then run the evaluation script:

```bash
  cd evaluation/finance
  bash run_trend_classification.sh
```

Results are saved to `results/finance/trend_classification` correspondingly.
