NeRVA: Joint Implicit Neural Representations for Videos and Audios

Published: 01 Jan 2024, Last Modified: 12 Nov 2025ICME 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Neural fields also known as implicit neural representations (INR) have recently been shown to be quite effective at representing video content. However, video content typically contains audio and existing works on INR do not represent both video and audio. Since INR for audio is not well-explored, we propose a novel neural representation for representing audio (NeRA) that can represent audio using a neural network. We also propose a novel neural representation for jointly representing both videos and audios (NeRVA) using a neural network. We represent multimedia as a neural network that takes timestamp as input and outputs the corresponding RGB image frame and the audio samples. The proposed neural network architecture uses a combination of Multi-Layer Perceptron (MLP) and convolutional blocks. We also demonstrate that our joint representation of multimedia content has better performance than individually representing the components of multimedia (An improvement of +2dB PSNR for video and 10× FAD for audio).
Loading