# Clinical Time Series Tokenization Research

This repository contains the implementation and experimental code for the paper "Rethinking Tokenization for Clinical Time Series: When Less is More" (ML4H 2025).

## Overview

This work presents a systematic evaluation of tokenization approaches for clinical time series modeling, comparing Triplet and TextCode tokenization strategies across four clinical prediction tasks using the MIMIC-IV dataset.

## Key Research Contributions

### 1. Triplet Tokenization Ablations
- **Time2Vec Implementation**: Advanced time encoding using sinusoidal basis functions
- **LeTE Implementation**: Learnable time embeddings with Fourier and spline components  
- **Component Ablations**: Systematic removal of time and value features to isolate predictive signals
- **Code-only Variants**: Minimal tokenization using only medical codes

### 2. TextCode Tokenization Improvements
- **Flexible TextCode Encoder**: Support for both trainable and frozen language model encoders
- **Enhanced Code Mappings**: Complete coverage of medical code descriptions (100% vs 25% baseline)
- **Multi-scale Encoders**: Evaluation across 15M to 600M parameter language models
- **Domain Comparison**: Clinical vs general-domain pretrained encoders

### 3. Experimental Framework
- **Controlled Comparisons**: Systematic variation along mapping coverage, training approach, encoder scale, and domain axes
- **Statistical Rigor**: Paired Wilcoxon tests with Bonferroni correction across 10 random seeds
- **Reproducible Pipeline**: Standardized MEDS-Torch framework with transformer encoders

## Repository Structure

### Research Code Variants
- `triplet_encoder_time2vec.py` - Time2Vec implementation for advanced time encoding
- `triplet_encoder_lete.py` - LeTE (Learnable Time Embeddings) implementation
- `triplet_encoder_code_only.py` - Code-only ablation (no time/value features)
- `triplet_encoder_no_time.py` - No-time ablation variant
- `triplet_encoder_no_value.py` - No-value ablation variant
- `textcode_encoder_flexible.py` - Flexible TextCode encoder with trainable/frozen modes

### Experiment Scripts
- `experiment_baseline_multiseed.sh` - Baseline Triplet experiments
- `experiment_time2vec_multiseed.sh` - Time2Vec experiments
- `experiment_lete.sh` - LeTE experiments
- `experiment_code_only.sh` - Code-only ablation experiments
- `experiment_no_time.sh` - No-time ablation experiments
- `experiment_no_value.sh` - No-value ablation experiments
- `experiment_flexible_textcode.sh` - TextCode optimization experiments

## Key Findings

1. **Time Features**: Explicit time encodings provide no consistent statistically significant benefit across clinical tasks
2. **Value Features**: Show task-dependent importance, affecting mortality but not readmission prediction
3. **Frozen Encoders**: Dramatically outperform trainable counterparts while requiring fewer parameters
4. **Code Information**: Emerges as the most critical predictive signal in clinical time series

## Dataset and Framework

- **Dataset**: MIMIC-IV processed into MEDS format
- **Tasks**: In-hospital mortality, ICU mortality, post-discharge mortality, 30-day readmission
- **Framework**: MEDS-Torch with transformer encoders
- **Evaluation**: AUROC with 10 random seeds, statistical significance testing

## Reproducibility

All experiments use the standardized MEDS-Torch pipeline with consistent hyperparameters and evaluation protocols. The code preserves the original research branches as separate implementations to enable direct comparison of different tokenization approaches.

---

*This research demonstrates that simpler, more parameter-efficient tokenization approaches can achieve competitive performance in clinical time series modeling, challenging assumptions about the necessity of complex temporal and value feature encodings.*