# STT-LLM: Structural-Temporal Tokenization for Adapting LLMs to Longitudinal Profiles


## Abstract
Large Language Models have shown strong generalization across natural language tasks but remain underexplored for longitudinal biomedical profiles. In sports doping analytics, two critical challenges arise: (i) sequence prediction for early detection of prohibited substance use, and (ii) anomaly detection for identifying doping-related deviations. We propose STT-LLM, a structural-temporal tokenization framework that adapts LLMs to longitudinal analysis without modifying the backbone architecture. STT-LLM constructs joint embeddings that capture both temporal dynamics and pathway-based interactions, which are then transformed into LLM-compatible tokens through the specialized structural and temporal tokenizers. We evaluate our approach on real-world longitudinal steroid datasets from athletes, where STT-LLM consistently outperforms LLM baselines. In addition, we present a case study where STT-LLM provides contextual reasoning that aligns more closely with expert assessments compared to baseline models. These results highlight the effectiveness of embedding-guided tokenization for adapting LLMs to understand longitudinal data.  

## Key Features
- **Integration of Structural-Temporal Information**: Utilizes structural-temporal embeddings to effectively represent the intricate relationships in longitudinal clinical profiles.
- **Anomaly Detection**: Provides a robust mechanism for identifying anomalies in longitudinal data using binary classification.
- **Sequence Prediction**: Predicts the features of a sample based on the sequence of previous samples.
- **Contextual Reasoning**: If a sample is classified as anomaly, the model explains why is it anomaly.
---
## Methodology
The STT-LLM architecture integrates:
- **Prompt Decoder**: Extracts task-specific instructions and organizes the clinical time-series data.
- **Structural-temporal embeddings​**: Captures structural correlations and temporal dependencies.
- **Structural and Temporal Tokenization**: Processes both spatial and temporal dimensions of the data.
---
### Model Architecture
- **Structural Component**: Encodes graph structures to represent structural relationships.
- **Temporal Component**: Employs multi-head self-attention for dynamic temporal dependencies.
- **Structural-temporal Embedding**: Combines spatial and temporal embeddings into a cohesive representation.
  
  ![Model Architecture](img/model.png)  
---
## Datasets

### Real-World Data:
- Includes longitudinal steroid profiles with 6 metabolites per sample.

### Dummy Data:
- Due to data confidentiality, we provide the following dummy data:
  - **Train Dataset**: `data/training_data.csv`
  - **Evaluation Dataset**: `data/evaluation_data.csv`
  - **Labels**: `data/labels.csv`
  - **Reasoning Dataset**: `data/reasoning_data.csv`
- These files allow testing the STT-LLM pipeline, although results will differ from the real-world experiments.

---

## Prerequisites

To set up the environment, install dependencies:
```bash
pip install -r requirements.txt
```
---

## Folder Structure

```plaintext
├── classification.py            # Finetuning model for anomaly detection task
├── prediction.py             # Finetuning model for sequebce prediction task
├── contextual_reasoning.py             # Finetuning model for giving explanations to the anomaly
├── requirements.txt    # Python Dependencies
├── data
│   ├── training_data.csv  # Dummy Training Dataset
│   ├── evaluation_data.csv        # Dummy Evaluation Dataset
│   ├── labels.csv     # Dummy Labels
│   ├── reasoning_data.csv        # Dummy Reasoning Dataset
├── img
│   ├── model.png    # Model architecture
```
---

## Training

To train the STT-LLM model for a certain task:

1. Make sure you have a GPU and enough memory.
2. Enter your huggingface token in the script
3. Run the script:
   ```bash
   python3 classification.py # prediction.py or contextual_reasoning.py
   # Or
   python classification.py # prediction.py or contextual_reasoning.py
   ```
   
