Low-Rank Tensor-Network Encodings for Video-to-Action Behavioral Cloning

Published: 08 Apr 2024, Last Modified: 08 Apr 2024Accepted by TMLREveryoneRevisionsBibTeX
Abstract: We describe a tensor-network latent-space encoding approach for increasing the scalability of behavioral cloning of a video game player’s actions entirely from video streams of the gameplay. Specifically, we address challenges associated with the high computational requirements of traditional deep-learning based encoders such as convolutional variational autoencoders that prohibit their use in widely available hardware or for large scale data. Our approach uses tensor networks instead of deep variational autoencoders for this purpose, and it yields significant speedups with no loss of accuracy. Empirical results on ATARI games demonstrate that our approach leads to a speedup in the time it takes to encode data and train a predictor using the encodings (between 2.6× to 9.6× compared to autoencoders or variational autoencoders). Furthermore, the tensor train encoding can be efficiently trained on CPU as well, which leads to comparable or better training times than the autoencoder and variational autoencoder trained on GPU (0.9× to 5.4× faster). These results suggest significant possibilities in mitigating the need for cost and time-intensive hardware for training deep-learning architectures for behavioral cloning.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: 1) Added section with more discussion of the inference time of TT compared to the AE / VAE. 2) Added details on the computational complexity of the TT.
Assigned Action Editor: ~Pin-Yu_Chen1
Submission Number: 2018