# Garanteed Non-local (GNL) Cumulene Dataset [v0.1]

## Description:
Datasets designed to test non-local effects often yield unexpectedly high accuracy when evaluated with local models, complicating the assessment of model non-locality. The dataset introduced here is based on cumulenes, whose non-local ground-state energy precludes accurate representation by local models.  The training set contains geometry-optimized cumulenes with 3-10 and 13, 14 carbon atoms, which are then rattled and rotated at various angles. The test set contains cumulenes created in a similar fashion with the same number of carbons (In Domain) as well as cumulenes of unseen length, not present in the dataset (Out Domain 11,12 and 15,16). This not only tests both the local and non-local complexity of cumulene chains, but also tests for extrapolation to longer, unseen chains.

Cumulenes are made up of double-bonded carbon atoms terminated with two hydrogen atoms at each end. Cumulenes exhibit pronounced nonlocal behavior as a result of strong electron delocalization.  Small changes in chain length and relative angle between the terminating hydrogen atoms can result in large changes in the energy of the system, as visually represented in Figure~\ref{fig:cumulenes-combined}. These structures are known to illustrate the limited expressivity of local models~\citep{unke2021machine} and are similar to the k-chains introduced by~\cite{joshi2023expressive} in the context of the geometric WL test. The So3krates framework~\citep{frank2022so3krates} used global attention to capture the angular trends of cumulenes with fixed length, this dataset test for learning on multiple length cumulenes and extrapolatation to longer chains, simultaneously capturing length and angle trends.

## Content:
1. Training (200)
2. Validation (50)
3. Test (170)

## Test set:
1. In domain: n_c = 3-10,13,14 
2. Out domain (S): n_c = 11,12
3. Out domain (L): n_c = 15,16

Containing 50,60,60 configurations respectively.

## Training and Comparison:
For comparison purposes any local MPNN part of a proposed architecture, should be restricted to a 6 Angstrom receptive field (eg 2 message passing of 3A). This dataset constitutes as toy system that has chemical relevance to benchmark and compare various ideas to incldue non-local effects. 

## Format:
The data is saved as a `.xyz` text file, following the extended xyz format. This can be easily read using the ASE (Atomic Simulation Envionment) python package, many DFT codes and visualisers. 

The energy and forces can be found in the `energy` and `forces` key. The same values have been copied to a key that identifies the ground truth energy explicitly (such that they can't be overwritten easily with ase). The values in `dft_energy` and `dft_forces` are identical to the data in `eneregy`, `forces`. 

To access the data eg use:

```python
from ase.io import read, write

traj = read('gnl-v0.1-train.xyz', ':')

energy = [at.info['energy'] for at in traj]
forces = [at.arrays['forces'] for at in traj]
```
