# HMM-GLM Sports

A Python framework for modeling latent performance states in sports using Hidden Markov Models (HMMs) and Generalized Linear Models (GLMs).

## Overview

The HMM-GLM Sports framework provides tools for analyzing sports performance data using a hybrid approach that combines Hidden Markov Models (for modeling latent performance states) with Generalized Linear Models (for modeling outcomes within states). This approach is particularly useful for sports analytics, where athletes' performances often exhibit temporal dependencies and latent state dynamics.

Key features:

- **Latent State Modeling**: Identify and analyze hidden performance states that drive observed outcomes.
- **Context-Aware Transitions**: Model how game context affects state transitions.
- **Class Imbalance Handling**: Advanced weighting strategies to address class imbalance in sports data.
- **Multimodal Data Integration**: Combine spatiotemporal, biomechanical, and physiological data.
- **Sport-Specific Adjustments**: Special handling for sport-specific factors (e.g., goalie influence in hockey).
- **Model Variants**: Multiple model variants for different analytical needs.

## Installation

```bash
# Clone the repository
git clone https://github.com/username/hmm-glm-sports.git
cd hmm-glm-sports

# Create and activate a virtual environment (optional but recommended)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install the package and dependencies
pip install -e .
```

You can also install with specific feature sets:

```bash
# Install with data crawler dependencies
pip install -e ".[crawlers]"

# Install with machine learning extensions
pip install -e ".[ml]"

# Install with development tools
pip install -e ".[dev]"

# Install all dependencies
pip install -e ".[all]"
```

### Requirements

The main requirements are:
- Python 3.8+
- NumPy
- Pandas
- SciPy
- scikit-learn
- matplotlib
- seaborn
- hmmlearn
- statsmodels

For a complete list of dependencies, see `requirements.txt`.

## Quick Start

```python
from src.core.hmm_glm import CategoricalHMMComponent, LogisticGLMComponent, HMMGLMModel
from src.evaluation import evaluate_model, plot_confusion_matrix, compare_with_baseline

# Create HMM-GLM model
hmm_comp = CategoricalHMMComponent(n_states=3, n_categories=2)
glm_comp = LogisticGLMComponent()
model = HMMGLMModel(hmm_component=hmm_comp, glm_component=glm_comp)

# Fit the model
model.fit(X_train, y_train, sequences=sequences_train)

# Evaluate the model
metrics = evaluate_model(model, X_test, y_test, sequences=sequences_test)
print(metrics)  # {'accuracy': 0.85, 'auc': 0.92, ...}

# Visualize results
import matplotlib.pyplot as plt
y_pred = model.predict_proba(X_test, sequences=sequences_test)
fig, ax = plt.subplots(figsize=(8, 6))
plot_confusion_matrix(y_test, y_pred, ax=ax)
plt.show()

# Compare with baseline model
comparison = compare_with_baseline(model, X_test, y_test, sequences=sequences_test)
print(comparison)
```

## Examples

The `examples` directory contains several example scripts demonstrating different aspects of the framework:

- `basic_usage.py`: Basic usage of the HMM-GLM framework.
- `multimodal_integration.py`: Integrating multiple data modalities.
- `context_aware_transitions.py`: Using context-aware transition matrices.
- `class_imbalance.py`: Handling class imbalance with various weighting strategies.
- `nhl_adjustments.py`: Applying NHL-specific adjustments for goalie influence.
- `model_evaluation.py`: Comprehensive model evaluation and comparison.

## Project Structure

```
hmm-glm-sports/
├── src/
│   ├── core/
│   │   ├── context_transitions/  # Context-aware transition matrices
│   │   ├── weighting/            # Class imbalance handling
│   │   └── hmm_glm/              # Core HMM-GLM implementation
│   ├── data/
│   │   ├── crawlers/             # Data collection from sports APIs
│   │   ├── converters/           # Data conversion for HMM-GLM
│   │   └── ...                   # Data loading and preprocessing
│   ├── evaluation/               # Model evaluation tools
│   │   ├── metrics.py            # Performance metrics
│   │   ├── visualization.py      # Result visualization
│   │   └── comparison.py         # Model comparison utilities
│   ├── features/                 # Feature engineering
│   └── models/                   # Model variants
├── examples/                     # Example scripts
├── experiments/                  # Sport-specific experiments
│   ├── mlb/
│   ├── nba/
│   └── nhl/
└── docs/                         # Documentation
```

## Model Evaluation

The framework provides comprehensive tools for evaluating and comparing HMM-GLM models:

### Performance Metrics

```python
from src.evaluation import evaluate_model, calculate_per_state_metrics

# Calculate overall metrics
metrics = evaluate_model(model, X, y, sequences)
print(f"Accuracy: {metrics['accuracy']:.3f}")
print(f"AUC: {metrics['auc']:.3f}")
print(f"Brier Score: {metrics['brier_score']:.3f}")
print(f"State Diversity: {metrics['state_diversity']:.3f}")

# Calculate per-state metrics
state_metrics = calculate_per_state_metrics(model, X, y, sequences)
print(state_metrics)
```

### Visualization

```python
from src.evaluation import (
    plot_confusion_matrix, plot_roc_curve, 
    plot_state_transitions, plot_feature_importance
)

# Plot confusion matrix
plot_confusion_matrix(y, model.predict(X, sequences))

# Plot ROC curve
plot_roc_curve(y, model.predict_proba(X, sequences))

# Plot state transitions
plot_state_transitions(model.hmm_component.model.transmat_)

# Plot feature importance by state
feature_names = [f"Feature_{i}" for i in range(X.shape[1])]
plot_feature_importance(model, feature_names)
```

### Model Comparison

```python
from src.evaluation import compare_models, compare_with_baseline, statistical_significance_test

# Compare multiple models
models = [model1, model2, model3]
model_names = ["HMM-GLM (3 states)", "HMM-GLM (5 states)", "HMM-GLM (Dynamic)"]
comparison = compare_models(models, model_names, X, y, sequences)
print(comparison)

# Compare with baseline
baseline_comparison = compare_with_baseline(model, X, y, sequences)
print(baseline_comparison)

# Statistical significance test
from sklearn.linear_model import LogisticRegression
baseline = LogisticRegression().fit(X, y)
test_results = statistical_significance_test(model, baseline, X, y, sequences)
print(f"p-value: {test_results['p_value']:.4f}")
print(f"Better model: {test_results['better_model']}")
```

## Documentation

For detailed documentation, please refer to the `docs` directory or the docstrings in the code.

## Data Collection

The framework includes utilities for collecting play-by-play data from various sports leagues:

### MLB Data Collection

```python
from src.data.crawlers import MLBCrawler

# Initialize the crawler
mlb_crawler = MLBCrawler(output_dir="data/mlb", log_file="logs/mlb_crawler.log")

# Crawl data for a specific season and teams
mlb_crawler.crawl_season(
    year=2022,
    start_date="2022-04-07",  # 2022 Opening Day
    end_date="2022-10-05",    # Regular season end
    teams=["NYY", "BOS", "LAD", "HOU"]  # Specific teams (optional)
)

# Merge the collected data
merged_data = mlb_crawler.merge_season_data(
    year=2022,
    output_file="data/mlb/mlb_2022_merged.csv"
)
```

### NBA Data Collection

```python
from src.data.crawlers import NBACrawler

# Initialize the crawler
nba_crawler = NBACrawler(output_dir="data/nba", log_file="logs/nba_crawler.log")

# Crawl data for a specific season and teams
nba_crawler.crawl_season(
    season="2021-22",
    start_date="2021-10-19",  # 2021-22 Opening night
    end_date="2022-04-10",    # Regular season end
    teams=["LAL", "GSW", "BOS", "MIA"],  # Specific teams (optional)
    include_tracking=True  # Include player tracking data
)

# Merge the collected data
merged_data = nba_crawler.merge_season_data(
    season="2021-22",
    output_file="data/nba/nba_2021_22_merged.csv"
)
```

### NHL Data Collection

```python
from src.data.crawlers import NHLCrawler

# Initialize the crawler
nhl_crawler = NHLCrawler(output_dir="data/nhl", log_file="logs/nhl_crawler.log")

# Crawl data for a specific season and teams
nhl_crawler.crawl_season(
    year=2022,  # 2021-22 season
    start_date="2021-10-12",  # 2021-22 Opening night
    end_date="2022-04-29",    # Regular season end
    teams=["TOR", "MTL", "NYR", "EDM"]  # Specific teams (optional)
)

# Merge the collected data (shots only)
shots_data = nhl_crawler.merge_season_data(
    year=2022,
    output_file="data/nhl/nhl_2021_22_shots.csv"
)
```

### Data Sources

- **MLB**: Baseball Savant (Statcast) and MLB GameDay API
- **NBA**: NBA Stats API and Basketball-Reference
- **NHL**: NHL Stats API and Hockey-Reference

> **Note**: These crawlers are provided for research purposes only. Please respect the terms of service of each data provider and implement appropriate rate limiting when collecting data.

### Data Conversion

After collecting data using the crawlers, you need to convert it to a format compatible with the HMM-GLM framework:

```python
from src.data import load_and_convert_crawled_data

# Convert NHL data
nhl_df, nhl_metadata = load_and_convert_crawled_data(
    data_path="data/nhl",
    sport="nhl",
    output_path="data/nhl/nhl_converted.csv",
    min_sequence_length=3
)

# Convert NBA data
nba_df, nba_metadata = load_and_convert_crawled_data(
    data_path="data/nba",
    sport="nba",
    output_path="data/nba/nba_converted.csv",
    min_sequence_length=3
)

# Convert MLB data
mlb_df, mlb_metadata = load_and_convert_crawled_data(
    data_path="data/mlb",
    sport="mlb",
    output_path="data/mlb/mlb_converted.csv",
    min_sequence_length=3
)

# Print conversion statistics
print(f"NHL: Converted {nhl_metadata['n_original_rows']} rows into {nhl_metadata['n_converted_rows']} rows")
print(f"NHL: Created {nhl_metadata['n_sequences']} sequences")

print(f"NBA: Converted {nba_metadata['n_original_rows']} rows into {nba_metadata['n_converted_rows']} rows")
print(f"NBA: Created {nba_metadata['n_sequences']} sequences")

print(f"MLB: Converted {mlb_metadata['n_original_rows']} rows into {mlb_metadata['n_converted_rows']} rows")
print(f"MLB: Created {mlb_metadata['n_sequences']} sequences")
```

The conversion process:

1. **Loads raw data** from the specified directory
2. **Preprocesses** the data based on sport-specific requirements
3. **Engineers features** relevant for HMM-GLM modeling
4. **Creates sequences** by grouping events (e.g., by player and game)
5. **Formats** the data for direct use with HMM-GLM models

The converted data includes:
- **Sequence IDs**: For grouping related events
- **Sequence positions**: For maintaining temporal order
- **Engineered features**: Context-aware features for improved modeling
- **Binary outcomes**: Target variables for prediction

### Using Converted Data with HMM-GLM

```python
from src.core.hmm_glm import CategoricalHMMComponent, LogisticGLMComponent, HMMGLMModel
from src.data import load_crawled_data

# Load already converted data
df = load_crawled_data(
    data_path="data/nhl/nhl_converted.csv",
    sport="nhl",
    convert=False  # Data is already converted
)

# Extract features, outcomes, and sequences
feature_cols = [col for col in df.columns if col.startswith('feature_')]
X = df[feature_cols].values
y = df['is_goal'].values
sequences = df['sequence_id'].values
sequence_lengths = df.groupby('sequence_id').size().values

# Create and fit HMM-GLM model
hmm_comp = CategoricalHMMComponent(n_states=3, n_symbols=2)
glm_comp = LogisticGLMComponent()
model = HMMGLMModel(hmm_component=hmm_comp, glm_component=glm_comp)
model.fit(X, y, sequences=sequences)

# Make predictions
y_pred = model.predict(X, sequences=sequences)
```

## Citation

If you use this framework in your research, please cite:

```
@inproceedings{anonymous2025from,
  title={From On-Field Actions to Internal States: A Latent Variable Framework for Analyzing Athlete Performance},
  author={Anonymous, A.},
  year={2025}
  booktitle={Proceedings of AI Agents for Science}
}
```

## License

This project is licensed under the MIT License - see the LICENSE file for details.

