- Keywords: bioinformatics, optimal transport, kernel methods, attention, transformers
- Abstract: We address the problem of learning on sets of features, motivated by the need of performing pooling operations in long biological sequences of varying sizes, with long-range dependencies, and possibly few labeled data. To address this challenging task, we introduce a parametrized representation of fixed size, which embeds and then aggregates elements from a given input set according to the optimal transport plan between the set and a trainable reference. Our approach scales to large datasets and allows end-to-end training of the reference, while also providing a simple unsupervised learning mechanism with small computational cost. Our aggregation technique admits two useful interpretations: it may be seen as a mechanism related to attention layers in neural networks, or it may be seen as a scalable surrogate of a classical optimal transport-based kernel. We experimentally demonstrate the effectiveness of our approach on biological sequences, achieving state-of-the-art results for protein fold recognition and detection of chromatin profiles tasks, and, as a proof of concept, we show promising results for processing natural language sequences. We provide an open-source implementation of our embedding that can be used alone or as a module in larger learning models at https://github.com/claying/OTK.
- One-sentence Summary: We propose a new, trainable embedding for large sets of features such as biological sequences, and demonstrate its effectiveness.
- Supplementary Material: zip
- Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
- Code: [![github](/images/github_icon.svg) claying/OTK](https://github.com/claying/OTK)
- Data: [GLUE](https://paperswithcode.com/dataset/glue), [SST](https://paperswithcode.com/dataset/sst)