# PreDiff: Leveraging Data Priors to Enhance Time Series Generation with Scarce Samples
<img src="https://img.shields.io/badge/python-3.9-blue">
<img src="https://img.shields.io/badge/pytorch-2.0-orange">

> **Abstract:** The fundamental motivation for time series generation tasks lies in addressing the pervasive challenge of data scarcity. However, we have identified a critical limitation: existing time series generation models are prone to substantial performance degradation when trained on limited data. To tackle this issue, we propose a novel framework that integrates data priors to enhance the robustness and generalization of time series generation under data-scarce conditions. Our framework is structured around a two-stage pipeline: pre-training and fine-tuning. In the pre-training stage, the model is trained on synthetic time series datasets to learn data priors, which encode the fundamental statistical properties and temporal dynamics of time series data. Subsequently, during the fine-tuning stage, the model is refined using a small-scale target dataset to adapt to the specific distribution of the target domain. Extensive experimental evaluations demonstrate that our framework mitigates performance degradation caused by data scarcity, achieving state-of-the-art results in time series generation tasks. This work not only advances the field of time series modeling but also provides a scalable solution for real-world applications where data availability is often limited.

![two-stage-strategy](two-stage-strategy.png)

Time series generation (TSG) is crucial across domains such as finance, energy, and healthcare, yet state-of-the-art models—particularly diffusion models—depend heavily on large, high-quality datasets, which are often unavailable due to privacy constraints, high acquisition costs, or event rarity. This leads to severe performance degradation under data-scarce conditions. To address this challenge, we propose **PreDiff**, a two-stage diffusion-based framework that leverages large-scale time series datasets as **data priors**: the model is first pre-trained on synthetic data to learn a general prior distribution, then fine-tuned on limited target data to adapt to the target distribution. Experiments show significant performance improvements under extreme data scarcity. Our contributions include identifying and analyzing the overlooked failure of current TSG models in low-data regimes, introducing a flexible framework that integrates various diffusion architectures and prior datasets, and demonstrating that larger and more feature-aligned priors lead to better performance.

<p align="center">
  <img src="figures\Figure1.png" alt="">
  <br>
  <b>Figure 1</b>: Performance drop in time series generation models on the Stock dataset with varying size.
</p>

## Dataset Preparation

All the four real-world datasets (Stocks, ETTh1, Energy and fMRI) can be obtained from [Google Drive](https://drive.google.com/file/d/11DI22zKWtHjXMnNGPWNUbyGz-JiEtZy6/view?usp=sharing). Please download **dataset.zip**, then unzip and copy it to the folder `./Data` in our repository.


## Running the Code

 The code requires conda3 (or miniconda3), and one CUDA capable GPU. The instructions below guide you regarding running the codes in this repository. 

## Environment & Libraries

The full libraries list is provided as a `requirements.txt` in this repo. Please create a virtual environment with `conda` or `venv` and run

~~~bash
(myenv) $ pip install -r requirements.txt
~~~

## PreTraining & fine-tuning & Sampling

We provide the complete pretraining, finetuning, sampling, and evaluation pipeline on the 10% Stocks dataset in `pre_DiffusionTS.ipynb`, ensuring strong reproducibility.

Below is a complete and reproducible process for adding a conda virtual environment as a selectable kernel in Jupyter Notebook :

### Firstly, **Activate your conda virtual environment**

```bash
conda activate your_env_name
```

Replace `your_env_name` with the name of your environment.

------

### **Secondly, Install ipykernel inside the environment**

```bash
(myenv) $ pip install ipykernel
```

(You may skip this step if it is already installed.)

------

### **Third, Register the environment as a Jupyter kernel**

```bash
(myenv) $ python -m ipykernel install --user --name your_env_name --display-name "Python (your_env_name)"
```

Explanation:

- **--name**: the actual kernel name
- **--display-name**: the name displayed in Jupyter Notebook (customizable)

Example:

```bash
(myenv) $ python -m ipykernel install --user --name diffusionts --display-name "Python (PreDiff)"
```

------

### **Finally, Launch Jupyter Notebook / Lab**

```bash
jupyter notebook
```

or

```bash
jupyter lab
```

Then, in **Kernel → Change Kernel**, you should see:

```
Python (PreDiff)
```

After that, the experimental results in the paper can be reproduced according to the code provided in `pre_DiffusionTS.ipynb` .

## Visualization and Evaluation

After sampling, synthetic data and orginal data are stored in `.npy` file format under the *output* folder, which can be directly read to calculate quantitative metrics such as discriminative, predictive, correlational and context-FID score. You can also reproduce the visualization results using t-SNE or kernel plotting, and all of these evaluational codes can be found in the folder `./Utils`. Please refer to `.ipynb` tutorial files in this repo for more detailed implementations.

**Note:** All the metrics can be found in the `./Experiments` folder. Additionally, by default, for datasets other than the Sine dataset (because it do not need normalization), their normalized forms are saved in `{...}_norm_truth.npy`. Therefore, when you run the Jupternotebook for dataset other than Sine, just uncomment and rewrite the corresponding code written at the beginning.

### Main Results

<p align="center">
  <b>Table 1</b>: The comprehensive comparison of
PreDiff against state-of-the-art time series
generation models on the Stock dataset at varying data availability levels (100%, 70%, 40%,
and 10%). Red text denotes the best results, and
blue text denotes the second-best.
  <br>
  <img src="figures\Table1.png" alt="">
</p>

<p align="center">
  <b>Table 2</b>: Comprehensive comparison of
PreDiff against six baselines across four
datasets 10% data availability. Lower values
indicate better performance.
  <br>
  <img src="figures\Table2.png" alt="">
</p>


## Acknowledgement

We appreciate the following github repos a lot for their valuable code base:

https://github.com/Y-debug-sys/Diffusion-TS

https://github.com/abudesai/timeVAE

https://github.com/lmnt-com/diffwave

https://github.com/ermongroup/CSDI

https://github.com/jsyoon0823/TimeGAN
