# Time Series Generation

SDForger is a versatile methodology designed to enable the generation of time series using LLMs. Starting from a few observations of multiple time series, the approach employs an efficient embedding representation to transform the time series into tabular data, which is then converted into text. SDForger leverages fine-tuning to learn meaningful patterns within the computed embeddings. At inference time, it produces new textual embeddings decoded back into fully synthetic time series data that mimic the original data’s statistical properties and temporal dynamics.

## Structure of SDForger

This data builder supports generation defining the following parameters:

1. **Time-Series Pattern Extraction via Functional PCA** \
   SDForger applies functional principal components analysis to extract dominant patterns in time series data and embed them into a structured tabular format.

2. **Template-Guided Textual Representation for LLM Fine-Tuning** \
   Utilizes a structured template to transform embedding tables into textual descriptions, preparing them for large language model (LLM) fine-tuning.

3. **Inference Step for Generation** \
   Employs a guided inference approach to generate structural embeddings.

4. **Refinement through Decoding and Filtering** \
   Implements a decoding mechanism followed by a filtering step to ensure high-quality output.

## Setup

To run SDForger on an cuda device, preferably on a A100 ( machine used for paper's results), please run:
```shell
cd SDForger
conda env create -f sdforgerpy310cuda.yaml
conda activate sdforgerpy310cuda
```

To run SDForger on your local MPS machine, please run:
```shell
cd SDForger
conda env create -f sdforgerpy310mps.yaml
conda activate sdforgerpy310mps
```

To run TTM evaluation:
```shell
git clone "https://github.com/ibm-granite/granite-tsfm.git" 
cd granite-tsfm
pip install ".[notebooks]"
```

## Data specification

Dataset are [here](resources/). This folder is divided into:

1. **Univariate** focuses on learning from a single time series to generate plausible alternative versions. This is useful for simulating counterfactual histories, seasonal variations, or stress-test scenarios in domains like finance, weather, and demand forecasting. \
We provide the bikesharing dataset.

2. **Multisample** aims to produce new instances by combining patterns from multiple existing time series. This setting reflects scenarios such as generating experimental samples, weather profiles, or patient trajectories from heterogeneous observations. It emphasizes diversity and generalization in data-rich contexts. \
We provide the ecl dataset.

3. **Multivariate** evaluates the ability to jointly generate multiple interdependent channels. It reflects real-world settings, such as energy systems, traffic flows, or sensor networks, where channel interactions and cross-correlations are crucial for realism and downstream utility.
We provide the bikesharing dataset.

## Example: Univariate Settings + TSG evaluation

The default configuration for univariate augmentation is [here](sources/config/config.yaml). It correponds to the univariate augmentation of the variable ```cnt``` of the bikesharing dataset.

First, run data augmentation. Then, run TSG evaluation:
```shell
cd SDForger
python sources/run_data_augmentation.py --config sources/config/config.yaml
python sources/run_TSG_evaluation.py --config sources/config/config.yaml
```

Dataset specification
- `data_name`: `bikesharing`, `ecl` ...
- `data_train_channels`: some examples are (`cnt` for univariate `bikesharing`), (`[cnt, temp, hum]` for multivariate `bikesharing`) and (`MT_` for multisample `ecl`) 
- `data_train_param`: `[data_length, num_samples]`, some examples are (`[3000, 1]` for univariate), (`[250, 30]` for multisample)

Global specification
- `seed`: random seed
- `save_results`: (`True`, `False`) whether or not to save_result
- `evaluation`:
  - `generated_data_path`: default is `output/new_data.npy`
  - `train_data_path`: default is `output/train_data.npy`
- `create_train_val_test`: (`True`, `False`) used for TTM finetuning

SDForger specification
- `sdforger_augmentation_strategy`: (`univariate`, `multisample`, `multivariate`)
- `sdforger_batch`: batch_size to finetune llm, default for `gpt2` is 32, for `granite` and  `Phi` is 16
- `sdforger_embedding_dim`: embedding dimension per channel (`3`, `5`, `7`, `auto`)
- `sdforger_embedding_type`: `fica`, `fpc`
- `sdforger_float_type`: `float32`
- `sdforger_init_value`: (`True`, `False`), default is `False` but can be turned on when using `base_template` 
- `sdforger_learning_rate`: `8.0e-05`
- `sdforger_llm`: `gpt2`, `ibm-granite/granite-3.0-2b-base`, `microsoft/Phi-3.5-mini-instruct`
- `sdforger_max_generations`: minimum number of curve to generate, for paper set min=max=`100`
- `sdforger_min_generations`: maximum number of curve to generate, for paper set min=max=`100`
- `sdforger_minimum_windows_length`: windows length (`250`)
- `sdforger_minimum_windows_number`: windows number (`30`)
- `sdforger_norms_diversity_threshold`: `0.0` stopping criterion turned of for paper evaluation but in practice used to dynamically stop generating if diversity criterion  not satisfied 
- `sdforger_permute`: (`True`, `False`), default is `True`, whether or not to permute columns of tabular representation
- `sdforger_text_template`: (`fim_template`, `base_template`)default is `fim_template` used in paper
- `sdforger_train_epochs`: `200`
- `sdforger_train_splitting`: (`minimize-overlap`, `maximize-overlap`)
- `sdforger_variance_explained`: variance used to fix embedding_dim, only used if `sdforger_embedding_dim` is `auto`

## Example: Multivariate Settings + TTM evaluation

The default configuration for multivariate augmentation is [here](sources/config/config_ttm.yaml). It correponds to the multivariate augmentation of the variables ```cnt```, ```temp```, ```hum``` of the bikesharing dataset.

First, run data augmentation. Then, run TTM evaluation:
```shell
cd SDForger
python sources/run_data_augmentation.py --config sources/config/config_ttm.yaml
python sources/run_TTM_evaluation.py --config sources/config/config_ttm.yaml
```

TTM specification
- `TTM`:
  - `TTM_MODEL_REVISION`: `main`
  - `TTM_column_specifiers`: column specifier object for TTM finetuning
  - `TTM_model`: `ibm-granite/granite-timeseries-ttm-r2`
  - `context_length`: (`512`) context length of the TTM version used
  - `forecast_length`: (`96`) context length of the TTM version used
- `create_train_val_test`: (`True`, `False`) used for TTM finetuning
```