## ELSA: Embedding Local Spatial Autocorrelation

This repository contains a Python implementation of the Embedding Local Spatial Autocorrelation (ELSA) metric, as proposed in the paper "ELSA: LOCAL SPATIAL AUTOCORRELATION OF EMBEDDINGS" (anonymous ICLR 2026 submission).

ELSA adapts the classic Local Moran's I statistic for use with high-dimensional embedding vectors, which are common data structures in modern AI and machine learning. Instead of analyzing a single scalar value per location, ELSA measures the spatial autocorrelation of complex data types like images or text that have been converted into embeddings.

This implementation is written as a Python class and is inspired by the design of spatial statistics tools found in the PySAL ecosystem, particularly `esda.moran`. We also provide an example notebook in `example.ipynb` as well as links to our experimental datasets.

### Files

* `elsa.py`: Contains the ELSA class, which is the core of this implementation.

### Features

* Calculates the ELSA statistic for each location.

* Uses cosine similarity to measure the relationship between an embedding and the global average embedding.

* Identifies spatial clusters (hotspots, coldspots) and spatial outliers.

* Assesses statistical significance through a conditional permutation test, generating pseudo-p-values.

### Installation

To use this module, you'll need a few common scientific computing libraries.

```python 
pip install numpy pandas libpysal scikit-learn
```

### How to Use

The main component is the ELSA class. You initialize it with your embedding data and a PySAL spatial weights matrix (W).

#### Example

Here is a simple example using randomly generated data to demonstrate the workflow.

```python
import numpy as np
import pandas as pd
from libpysal.weights import lat2W
import importlib
from elsa import ELSA

# 1. Generate some dummy data
# Let's imagine we have 100 locations in a 10x10 grid
# Each location has a 64-dimensional embedding vector
n_observations = 100
embedding_dim = 64
np.random.seed(42)
embeddings = np.random.rand(n_observations, embedding_dim)
# Normalize embeddings
embeddings = embeddings / np.linalg.norm(embeddings, axis=1, keepdims=True)

# Create coordinates for the grid to build a weights matrix
coords = np.array([(i // 10, i % 10) for i in range(n_observations)])
df = pd.DataFrame(coords, columns=['lat', 'lon'])

# 2. Create a spatial weights matrix (W)
# Here we'll use a simple Queen contiguity matrix from the grid coordinates
# lat2W expects (nrows, ncols) for the grid dimensions
w = lat2W(10, 10)  # 10x10 grid = 100 observations
w.transform = 'r' # Row-standardize the weights

# 3. Compute ELSA
elsa_results = ELSA(embeddings, w, permutations=999)

# 4. Inspect the results
print("ELSA statistics (first 5):")
print(elsa_results.e[:5])

print("\nStandardized cosine similarities (z) (first 5):")
print(elsa_results.z[:5])

print("\nPseudo p-values (first 5):")
print(elsa_results.p_sim[:5])

print("\nQuadrant classifications (first 5):")
print(elsa_results.q[:5])
# Quadrant meanings:
# 1: High-High (hotspot)
# 2: Low-Low (coldspot)
# 3: Low-High (outlier)
# 4: High-Low (outlier)

# You can add the results to your original dataframe for analysis or plotting
df['elsa_e'] = elsa_results.e
df['elsa_p_sim'] = elsa_results.p_sim
df['elsa_q'] = elsa_results.q

print("\nDataFrame with results (first 5 rows):")
print(df.head())
```

You can also directly run the example in `example.ipynb` to see how it works.

### Understanding the ELSA Class

#### Initialization

`elsa.ELSA(x, w, permutations=999)`:

* `x`: A `numpy.ndarray` of shape `(n, d)`, where n is the number of locations and d is the embedding dimension.

* `w`: A `libpysal.weights.W` object representing the spatial relationships between locations.

* `permutations`: The number of permutations to run for the significance test. If 0, no p-values are calculated.

#### Key Attributes

After initialization, the `ELSA` object will have several useful attributes:

* `e`: An array of the ELSA statistics for each location.

        Strongly positive values indicate an embedding is surrounded by similar embeddings (part of a homogeneous cluster).

        Strongly negative values indicate an embedding is surrounded by dissimilar embeddings (a spatial outlier).

        Values near zero indicate no significant local spatial pattern.

* `z`: The standardized cosine similarity for each embedding relative to the global mean embedding. This tells you if an observation is more (`z > 0`) or less (`z < 0`) similar to the average embedding than expected.

* `p_sim`: The pseudo p-value for each ELSA statistic. A low p-value (e.g., `< 0.05`) suggests that the observed spatial pattern is unlikely to be due to random chance.

* `q`: An array classifying each location into one of four quadrants, which helps in identifying hotspots, coldspots, and spatial outliers.

### Datasets

The experiments in the paper use the following datasets:

#### Labeled datasets

* Population, Elevation, Forest cover and Nightlights datasets are accessed via the `torchspatial` benchmark. They can be found at: (https://github.com/seai-lab/TorchSpatial)[https://github.com/seai-lab/TorchSpatial]

#### Unlabeled datasets

* The Im2GPS3k and YFCC4K datasets are both available at: (https://github.com/lugiavn/revisiting-im2gps)[https://github.com/lugiavn/revisiting-im2gps]