Alongside scaling pattern recognition side of neural network, there is also a lot of interest in extending neural network reasoning capability, mostly in the form of Graph Neural Network [Battaglia, 2018]. A notable effort in this space is term Neural Algorithmic Reasoning by [Veličković, 2021] in which traditional algorithms are being transformed into their differentiable counterparts.
Even though I believe we won’t see this type of architecture be in the SOTA splotlight anytime soon, but if we squint our eyes enough, the current state of the art models are kind of already are?!
In 2022, we can not mention any SOTA without Transformers architecture [Vaswani, 2017]. Originally for text processing taking as input the language discrete tokens, we can see it as a purely token manipulator. Unsurprisingly, the transformers here also process a fully connected graph of its input tokens.
The Vision Transformer (VIT) [Dosovitsky, 2020] which has challenged CNN on the image domain, its backyard, is the same transformer as token manipulator, but with a simple patch-based token parser.
Another example architecture that I liked the most are DETR [Carion, 2020], where the token parser is still a CNN to take advantage of locality structure of image, with a more complex token manipulator also based on Transformer to perform object detection.
With all the astonishing result of scaling up deep learning model that still left a bitter taste in the mouth of many AI researchers [Sutton, 2019], some can’t help but be repulsed when hear about adding more “structure” to the model. While it might be possible that GPT-10 can internally represents and manipulates all the structures of the world in its massive vector space, I can’t help but wonder how much more inefficient it would take for not incorporating some sort of soft structures in the model itself.
From the perspective of a researcher whose job is to inject inductive biases into model, I believe the weak inductive bias presented in the architecture above will leave a sweet aftertaste, not just the bitterness. After all, even if our GPT-10 overlord will need none of the “structureness”, I can only hope the token parser and manipulator above will help speeding up the research spiral between inductive biases and scaling, so that we can have a more advanced, and hopefully, benevolent AI faster.
Overall, given the simple components yet elegantly composed used in this paper, the C-SWM model really inspired and convinced me, and I hope now you too, that there are a lot more to come for the future of Deep Learning research.
Deep Learning is an excellently scalable approach for processing unstructured, high-dimensional, raw sensory signal. It is so good that these properties also becomes its most popular criticism. At the moment, deep learning is mostly just a giant correlation machine, devouring enormous amount of data to recognise hidden pattern in data, but still lacking in human-like systematic generalisation required in many reasoning tasks. Symbolic AI on the other hand possesses these abilities by design, but relied on handcrafted symbols that has already been abstracted away from the raw information. Among many approach to combine the best of both worlds, I am most excited about the end-to-end trainable architecture with a perception module that structurised the raw input and a reasoning module operates on top of these symbol-like vectors. While there are still a lot of work before such a system becomes practically relevant, in this blog post we will take a look at the paper Contrastive Learning of Structured World Model, an early paper that offer a glimpse into such architecture through a concrete implementation.
This post outlines a few more things you may need to know for creating and configuring your blog posts.