UniINR: Unifying Spatial-Temporal INR for RS Video Correction, Deblur, and Interpolation with an Event Camera

18 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Supplementary Material: zip
Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: implicit neural representation, rolling shutter camera, deblur, frame interpolation
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: UniINR, recoveing arbitrary frame-rate sharp GS frames from an RS blur image and paired event data.
Abstract: Images captured by rolling shutter (RS) cameras under fast camera motion often contain obvious image distortions and blur, which can be modeled as a row-wise combination of a sequence of global shutter (GS) frames within the exposure time. Naturally, recovering high-frame-rate GS sharp frames from an RS blur image needs to simultaneously consider RS correction, deblur, and frame interpolation. Tacking this task is nontrivial, and to the best of our knowledge, no feasible solutions exist by far. A naive way is to decompose the whole process into separate tasks and simply cascade existing methods; however, this results in cumulative errors and noticeable artifacts. Event cameras enjoy many advantages, \eg, high temporal resolution, making them potential for our problem. To this end, we propose the \textbf{first} and novel approach, named \textbf{UniINR}, to recover arbitrary frame-rate sharp GS frames from an RS blur image and paired event data. Our key idea is \textit{unifying spatial-temporal implicit neural representation (INR) to directly map the position and time coordinates to RGB values to address the interlocking degradations in the image restoration process}. Specifically, we introduce spatial-temporal implicit encoding (STE) to convert an RS blur image and events into a spatial-temporal representation (STR). To query a specific sharp frame (GS or RS), we embed the exposure time into STR and decode the embedded features pixel-by-pixel to recover a sharp frame. Our method features a lightweight model with only \textbf{$0.379 M$} parameters, and it also enjoys high inference efficiency, achieving $2.83 ms/frame$ in $31 \times$ frame interpolation of an RS blur frame. Extensive experiments show that our method significantly outperforms prior methods.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 1042
Loading