The token parser and manipulator, next-generation Deep Learning architecture

Deep Learning is an excellently scalable approach for processing unstructured, high-dimensional, raw sensory signal. It is so good that these properties also becomes its most popular criticism. At the moment, deep learning is mostly just a giant correlation machine, devouring enormous amount of data to recognise hidden pattern in data, but still lacking in human-like systematic generalisation required in many reasoning tasks. Symbolic AI on the other hand possesses these abilities by design, but relied on handcrafted symbols that has already been abstracted away from the raw information. Among many approach to combine the best of both worlds, I am most excited about the end-to-end trainable architecture with a perception module that structurised the raw input and a reasoning module operates on top of these symbol-like vectors. While there are still a lot of work before such a system becomes practically relevant, in this blog post we will take a look at the paper Contrastive Learning of Structured World Model, an early paper that offer a glimpse into such architecture through a concrete implementation.

Replace symbols with vectors and logic with algebra.

The field of Neuro-symbolic AI naturally emerges with the goal of combining the complimentary advantages of both neural networks and symbolic methods. These hybrid model are promised to offer the ability to learn representation from unstructured data at scale using DL, while also can perform logical reasoning from the wealth of symbolic AI’s literature. Much research has since been done toward the interface between vectors representation outputs and symbolic inputs. However many of the same problems of symbolic approach still present when moving from human-designed symbols to machine-learned vectors.

Instead of this hybrid approach, many researchers envision about a conceptually simpler approach, in which the symbol-manipulation reasoning module (i.e system 2) is learnt jointly in an end-to-end fashion with the signal processing module of unstructured data (system 1).

Binding problem. Figure 1: A conceptual framework for dynamically learning and binding unstructured information into symbol-like tokens to facilitate reasoning in neural networks. Figure adopted from [Greff et al] (permission pending for publication).

This vision is usually succinctly put:

“Replace symbols with vectors and logic with algebra”. – Yann Lecun


Note: Whether a system like this is still called “symbolic AI” or not is left as an semantic exercise for the celebrated researchers. In this blog, we will adopt the term “token” from the Transformer literature to refer to these symbol-like representation vectors.


Contrastive Learning of Structured World Model

While such archtecture might sound promising, it can be hard to really see what such an architecture is like. The Contrastive Learning of Structured World Model (C-SWM) by [Kipf et al.] materialised such vision through very simple building blocks drawn from a wide range of Deep Learning subfield. First let’s decompose the title of the paper to get a sense for its motivation and approach.

A world model refers to a compressed representation of the raw sensory signal that is assumed to come from a coherent and consistent environment with many independent and interacting agents, the world. This is in contrast to the task-specific representation usually obtained with a standard supervised learning. Having learned a simulation of the world can enable a system to reason (counterfactually) and plan more efficiently, which is the goal of Model-based Reinforcement Learning.

A structured world model implies a world model that explicitly captures the structures of the world in its representation, as opposed to implicitly do it all in different subspaces of a single output vector.

Contrastive Learning of this structured world model refers to a self-supervised approach that enable this model to be learned end to end from raw observation only, as all world models should. After all, God does not give us the answer to the ultimate question of life, universe and everything else either. 1

In this section, I will go through each individual modules and highlight how they fit in the overall architecture. I won’t go into the detail implementation because they are quite simple, not particularly important for this blog and the authors have done an excellent job in explaining it in the original paper.

The overall architecture

Similar to the blue print in Figure.1 above, the architecture of C-SWM is divided into 3 part:

  • The object-extractor segregates raw unstructured information into distinct object feature maps.
  • The object-encoder transforms each feature map to a consistent set of object representations.
  • The relational Graph neural network (GNN) manipulates these symbols, and take on the role of composition module.

All three components are tied together through a contrastive objective, and trained end-to-end in a self-supervised manner.

C-SWM architecture Figure 2: The architecture of C-SWM. The object extractor segregate information into object-level of abstraction. Object-encoder represents those segregated information to a common object representation format. The transition model process these tokens as the basic building block to model the dynamic of the world. Finally the entire framework is trained end-to-end via a self-supervised contrastive objective. Figure adopted from original paper, with annotations.

We can think of the first two module, the object extractor and the object encoder as the token parser/learner, which goal is to produce these set of compatible representation for each input data. The GNN operates on this set of tokens can be treated as token manipulator, the counterpart for traditional symbolic AI methods, which takes on the role of reasoning computation.

The token learner/parser

The world has many different structures, so what kind of structure should we impose on our model? In this paper, the level of abstraction designed for the token is the at the object level. 2 Object-centric representation learning is a subfield which chases this very objective [Workshop1] [Workshop2].

The object extractor

The first step to find structure in unstructured data is to figure out which part of the input should be grouped into the same object. This information segregation task is handled in this paper by the object extractor.

Since the input is a sequence of images, the authors opted for a standard Convolutional Neural Network. This network processes the entire image at the same time, and output a sets of feature map with each feature map belong to a different entity, i.e “object”.

The object encoder

After the information segregation step, each feature map are further encoded to obtain the set of object representation. In this case, the object encoder is simply an MLP that takes in the flatten feature map outputted by the object extractor.

Crucially, the MLP is reused across all feature maps. This innocuous choice will enforce that the all object tokens lies in the same vector space, enhance the consistency and compatibility between object tokens for downstream manipulation.

The token manipulator

This set of structured, high-level tokens will then be manipulated by a neural network counterpart of the logic module in symbolic AI. In this paper, this token manipulator is implemented by simple graph neural network, with the assumption that the object tokens form a fully connected graph (i.e each object can potentially interact with all other objects).

This GNN share the same node and edge update function over all object tokens, similar to how the object encoder MLP is shared between them. This choice will not only promote efficiency but also poses as a constraints on the interoperability between tokens during training.

This is, in my opinion, one of the most important advantages of this architecture over hybrid Neuro-symbolic approach, where many interface constraints are handled transparently through a simple choice of parameter sharing.

The object-factorised contrastive objective

Up until now, all the structure, “objectness”, we wish to impose on the representation are only vaguely expressed in the sense of a “set” of output token vectors, instead of a single (possibly larger) output vector. There are evidence that given appropriate structure of the representation, this object abstraction will emerges as a byproduct of optimisation [Burgess, 2019]. I tend to believe that since it is easier to understand a scene in term of its constituent objects, the optimisation process will converges to that solution first, given appropriate architecture.

In C-SWM, to promote the disentanglement of features into object more explicitly, the authors further employed a object-factorised contrastive loss. This is a simple modification to the original max-margin contrastive loss that takes into account the fact that the model outputs a set of tokens instead of a single one.

But…but, it’s just evaluated on toy dataset!

Since this paper doesn’t really fit into a well-defined task and benchmark, the authors had to come up with their own evaluation suite.

C-SWM evaluation data Figure 3: 5 different datasets C-SWM are evaluated on. Each dataset can aims to highlight an important component of the architecture.

The 2D shapes and 3D Blocks dataset demonstrated that a simple Convolutional Network can captures the object-centric representation in both simple and more challenging environment. The Pong, Space Invaders game environment and 3-Body Physics really highlight the important of the token manipulator module on top of the learned object-centric tokens.

Even though the authors has successfully convinced the ICLR reviewer of their vision through this carefully selected suite of environment, I can still heard the reviewer 2 among you raising your eyebrows about “mah toy datasets”.

Indeed we are still yet to see a fully incorporated architecture like this to “beat SoTA”, but there has been a lot of progress in each individual components.

Object-centric representation

A lot of progress in trying to learn unsupervised an object-centric representation of complex visual scene has taken place in the past few years. From the seminal work on image such as MONet [Burgess, 2018], Slot-Attention [Locatello et al., 2020] toward more powerful model on video such as SIMONe [Kabra, 2021] and SAVi [Kipf et al., 2021].

Neural Reasoning

Alongside scaling pattern recognition side of neural network, there is also a lot of interest in extending neural network reasoning capability, mostly in the form of Graph Neural Network [Battaglia, 2018]. A notable effort in this space is term Neural Algorithmic Reasoning by [Veličković, 2021] in which traditional algorithms are being transformed into their differentiable counterparts.

State of the Art models are secretly token learners and manipulators!?

Even though I believe we won’t see this type of architecture be in the SOTA splotlight anytime soon, but if we squint our eyes enough, the current state of the art models are kind of already are?!

In 2022, we can not mention any SOTA without Transformers architecture [Vaswani, 2017]. Originally for text processing taking as input the language discrete tokens, we can see it as a purely token manipulator. Unsurprisingly, the transformers here also process a fully connected graph of its input tokens.

The Vision Transformer (VIT) [Dosovitsky, 2020] which has challenged CNN on the image domain, its backyard, is the same transformer as token manipulator, but with a simple patch-based token parser.

Another example architecture that I liked the most are DETR [Carion, 2020], where the token parser is still a CNN to take advantage of locality structure of image, with a more complex token manipulator also based on Transformer to perform object detection.

Thoughts

With all the astonishing result of scaling up deep learning model that still left a bitter taste in the mouth of many AI researchers [Sutton, 2019], some can’t help but be repulsed when hear about adding more “structure” to the model. While it might be possible that GPT-10 can internally represents and manipulates all the structures of the world in its massive vector space, I can’t help but wonder how much more inefficient it would take for not incorporating some sort of soft structures in the model itself.

From the perspective of a researcher whose job is to inject inductive biases into model, I believe the weak inductive bias presented in the architecture above will leave a sweet aftertaste, not just the bitterness. After all, even if our GPT-10 overlord will need none of the “structureness”, I can only hope the token parser and manipulator above will help speeding up the research spiral between inductive biases and scaling, so that we can have a more advanced, and hopefully, benevolent AI faster.

Overall, given the simple components yet elegantly composed used in this paper, the C-SWM model really inspired and convinced me, and I hope now you too, that there are a lot more to come for the future of Deep Learning research.

References

Kipf, Thomas, Elise van der Pol, and Max Welling. “Contrastive Learning of Structured World Models.” International Conference on Learning Representations. 2019.

Greff, Klaus, Sjoerd van Steenkiste, and Jürgen Schmidhuber. “On the binding problem in artificial neural networks.” arXiv preprint arXiv:2012.05208 (2020).

Object-Oriented Learning (OOL): Perception, Representation, and Reasoning International Conference on Machine Learning (ICML) July 17, 2020, Virtual Workshop https://oolworkshop.github.io/

Object Representations for Learning and Reasoning Thirty-fourth Conference on Neural Information Processing Systems (NeurIPS) December 11, 2020, Virtual Workshop https://orlrworkshop.github.io

Burgess, Christopher P., et al. “Monet: Unsupervised scene decomposition and representation.” arXiv preprint arXiv:1901.11390 (2019).

Kabra, Rishabh, et al. “SIMONe: View-Invariant, Temporally-Abstracted Object Representations via Unsupervised Video Decomposition.” arXiv preprint arXiv:2106.03849 (2021).

Kipf, Thomas, et al. “Conditional Object-Centric Learning from Video.” arXiv preprint arXiv:2111.12594 (2021).

Locatello, Francesco, et al. “Object-centric learning with slot attention.” arXiv preprint arXiv:2006.15055 (2020).

Battaglia, Peter W., et al. “Relational inductive biases, deep learning, and graph networks.” arXiv preprint arXiv:1806.01261 (2018).

Veličković, Petar, and Charles Blundell. “Neural Algorithmic Reasoning.” arXiv preprint arXiv:2105.02761 (2021).

Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural information processing systems. 2017.

Dosovitskiy, Alexey, et al. “An image is worth 16x16 words: Transformers for image recognition at scale.” arXiv preprint arXiv:2010.11929 (2020).

Carion, Nicolas, et al. “End-to-end object detection with transformers.” European Conference on Computer Vision. Springer, Cham, 2020.

Sutton, Rich. “The Bitter Lesson” March 13, 2019. http://incompleteideas.net/IncIdeas/BitterLesson.html

Footnotes

  1. Spoiler alert, it’s 42! 

  2. “Object” in this paper (and the whole subfield of object-centric representation) is understand to be context-dependent, “you know it when you see it” kind of thing. Let’s not go down the philosophical rabbit hole of asking “What is an object?”.