# STT-LLM: Structural-Temporal Tokenization for Adapting LLMs to Longitudinal Clinical Profiles


## Abstract
Large Language Models have shown strong generalization across natural language tasks but remain underexplored for longitudinal clinical profiles. In sports anti-doping, biological profiles are analyzed to support early detection of prohibited substance use and identification of anomalous biological patterns, both of which require joint modeling of temporal dynamics and metabolic relationships. We propose STT-LLM, a structural-temporal tokenization framework that adapts LLMs to longitudinal clinical analysis without modifying their backbone architectures. STT-LLM constructs biologically grounded structural-temporal embeddings and transforms them into LLM-compatible tokens via specialized tokenizers that explicitly encode pathway structure and temporal evolution. We evaluate STT-LLM on real-world longitudinal datasets from athletes, showing consistent improvements over native LLM tokenization strategies in sequence prediction and anomaly detection. In addition, we present a case study where STT-LLM provides contextual reasoning that aligns more closely with expert assessments compared to baseline models. These results highlight tokenization as a key bottleneck and opportunity for adapting LLMs to clinical data.

## Key Features
- **Integration of Structural-Temporal Information**: Utilizes structural-temporal embeddings to effectively represent the intricate relationships in clinical time-series datasets.
- **Anomaly Detection**: Provides a robust mechanism for identifying anomalies in longitudinal data using binary classification.
- **Sequence Prediction**: Predicts the features of a sample based on the sequence of previous samples.
- **Contextual Reasoning**: If a sample is classified as anomaly, the model explains why is it anomaly.
---
## Methodology
The STT-LLM architecture integrates:
- **Prompt Decoder**: Extracts task-specific instructions and organizes the clinical time-series data.
- **Structural-temporal embeddings​**: Captures structural correlations and temporal dependencies.
- **Structural & Temporal Tokenization**: Processes both spatial and temporal dimensions of the data.
---
### Model Architecture
- **Structural Component**: Encodes graph structures to represent structural relationships.
- **Temporal Component**: Employs multi-head self-attention for dynamic temporal dependencies.
- **Structural-temporal Embedding**: Combines spatial and temporal embeddings into a cohesive representation.
  ![Model Architecture](readme/model.png)  
---
## Datasets

### Real-World Data:
- Includes longitudinal steroid profiles with 6 metabolites per sample.

### Dummy Data:
- Due to data confidentiality, we provide the following dummy data:
  - **Train Dataset**: `dummy_data/training_data.csv`
  - **Evaluation Dataset**: `dummy_data/evaluation_data.csv`
  - **Labels**: `dummy_data/labels.csv`
  - **Reasoning Dataset**: `dummy_data/reasoning_data.csv`
- These files allow testing the STT-LLM pipeline, although results will differ from the real-world experiments.

---

## Prerequisites

To set up the environment, install dependencies:
```bash
pip install -r requirements.txt
```
---

## Folder Structure

```plaintext
├── Classification.py            # Finetuning model for anomaly detection task
├── Prediction.py             # Finetuning model for sequebce prediction task
├── Contextual_Reasoning.py             # Finetuning model for giving explanations to the anomaly
├── requirements.txt    # Python Dependencies
├── dummy_data
│   ├── training_data.csv  # Dummy Training Dataset
│   ├── evaluation_data.csv        # Dummy Evaluation Dataset
│   ├── labels.csv     # Dummy Labels
│   ├── reasoning_data.csv        # Dummy Reasoning Dataset
├── readme
│   ├── model.png    # Model architecture
```
---

## Training

To train the STT-LLM model for a certain task:

1. Make sure you have a GPU and enough memory.
2. Enter your huggingface token in the script
3. Run the script:
   ```bash
   python3 Classification.py # Prediction.py or Contextual_Reasoning.py
   # Or
   python Classification.py # Prediction.py or Contextual_Reasoning.py
   ```
   
