### Computational Visualization Center Wordpress Link

Video imputation is essential for on-the fly error-recovery in adaptive streaming, compression, and super resolution. Most prior methods rely on the use of generative adversarial networks (GANs) such that given two consecutive video frames, the model attempts to generate spatial interpolatory frames to form coherent video sequences. In this work, we study the use of a factorizable variational auto-encoder based on a second order ordinary differential equation solver (Ode2-FacVAE) to account for separated covariances between the spatial, spectral and temporal video frames in order to improve the visual performance of previously proposed methods for video interpolation.

### Problem Formulation

Let $X=\{x_{t_i}\}_{i=1}^N$ where $x_{t_i} \in \mathbb{R}^d$ be an ordered sequence of video frames. Let $I=\{0,1,...,N\}$ be the frame index, where $t_i \in \mathbb{R}^+$ for $i \in I$ and $t_i=t_j > 0$ for all $i,j \in I$ where $i > j$. Let $t \in \mathbb{R}^+$ be continuous and represent time. We are tasked with recovering $x_t$ from $X$. We will do this by learning an independently factorized latent space manifolds that are linear locally and such that the latent space captures the covariance structure that exists between nearby frames: $\textbf{Objective: } \max p(x_t|X).$

**Video Imputation:**Video can be damaged. The aim of this task is to identify damaged video frames and impute the missing or damaged pixels.**Video Extrapolation:**$\max p(x_t|X)$ where $t > t_N$.**Video Interpolation:**$\max p(x_t|X)$ where $t_i < t < t_{i+1}$ for some $i \in \{0,...,N-1\}$.

### Methodology

We are given video stream $x_{0:N}=\{x_0,x_1,...,x_N\}$. Suppose that the generative process that generates $x$ is described by a latent space that can be factorized into $L$ independent subspaces. That is,

Further, suppose that the data dynamics governing the transition $x_k \to x_{k+1}$ are determined by second order dynamics in the latent spaces with a memory-less property such that

Let $z_k^l=(s_k^l,v_k^l)$, where $\frac{\partial s_k^l}{\partial t}=v_k^l$ is the velocity of latent position $s_k^l$ and $\frac{\partial v_k^l}{\partial t}=f(s_k^l,v_k^l)=a_k^l$ be the acceleration of latent position $s_k^l$. While the true dynamics in ODE2VAE were modeled with a BNN \cite{yildiz2019ode2vae}, in our approach the velocity function is deterministic (although we can easily substitute a BNN).

Our objective is to maximize the likelihood of observing the whole sequence of video frames $\log p(x_{0:N})$, which is intractable. We can make this tractable by maximizing the ELBO of $\log p(x_{0:N})$:

and where $q(z_i^l|z_{i-1}^l)$ is either one or zero since $f$ is deterministic (this can be relaxed), and a superior prior is

where $\Delta t$ is the temporal difference between frames and $\{\gamma_1, \gamma_2, \gamma_3\}$ are hyperparameters. Also, we have a memoryless prior $p(z_i^l|z_{i-1}^l)=p(z_i^l|z_{i-1}^l,z_{i-2}^l,...)$, which can be relaxed if need be.

### Results