# Aggregator Fusion Kernels

This directory contains implementations of the aggregator fusion kernels for CPU and GPU (CUDA).
We cover the inference-only case with 1 head and 1 base for now, although further code may be added at a later date.
In principle this can be generalised to training time (on the forward pass at least), however this will require `torch_sparse` (and `torch_scatter`) to be re-architected as their are currently implemented using template specialization for each type of aggregator, and the autograd functions are currently implemented separately.
This is something to be discussed with library maintainers.

It may be useful to investigate whether [TVM](https://tvm.apache.org/) can help us with automated kernel optimization, especially when it comes to handling multiple heads/bases. However, this lies beyond the scope of the original paper, and the overall principle should be usable for other architectures such as PNA.
