Computational Visualization Center

A cross-disciplinary effort to develop and improve the technologies for computational modeling, simulation, analysis, and visualization.

Learning an Optimal Policy for Video Imputation

Computational Visualization Center Wordpress Link

Video imputation is essential for on-the fly error-recovery in adaptive streaming, compression, and super resolution. Most prior methods rely on the use of generative adversarial networks (GANs) such that given two consecutive video frames, the model attempts to generate spatial interpolatory frames to form coherent video sequences. In this work, we study the use of a factorizable variational auto-encoder based on a second order ordinary differential equation solver (Ode2-FacVAE) to account for separated covariances between the spatial, spectral and temporal video frames in order to improve the visual performance of previously proposed methods for video interpolation.

Problem Formulation

Let X={xti}i=1NX=\{x_{t_i}\}_{i=1}^N where xtiRdx_{t_i} \in \mathbb{R}^d be an ordered sequence of video frames. Let I={0,1,...,N}I=\{0,1,...,N\} be the frame index, where tiR+t_i \in \mathbb{R}^+ for iIi \in I and ti=tj>0t_i=t_j > 0 for all i,jIi,j \in I where i>ji > j. Let tR+t \in \mathbb{R}^+ be continuous and represent time. We are tasked with recovering xtx_t from XX. We will do this by learning an independently factorized latent space manifolds that are linear locally and such that the latent space captures the covariance structure that exists between nearby frames: Objective: maxp(xtX).\textbf{Objective: } \max p(x_t|X).

  • Video Imputation: Video can be damaged. The aim of this task is to identify damaged video frames and impute the missing or damaged pixels.
  • Video Extrapolation: maxp(xtX)\max p(x_t|X) where t>tNt > t_N.
  • Video Interpolation: maxp(xtX)\max p(x_t|X) where ti<t<ti+1t_i < t < t_{i+1} for some i{0,...,N1}i \in \{0,...,N-1\}.

Methodology

We are given video stream x0:N={x0,x1,...,xN}x_{0:N}=\{x_0,x_1,...,x_N\}. Suppose that the generative process that generates xx is described by a latent space that can be factorized into LL independent subspaces. That is,

p(z0:N1,z0:N2,...,z0:NL)=p(z0:N1)p(z0:N2)p(z0:NL)% p(x|z_{0:N}^1,z_{0:N}^2,...,z_{0:N}^L)=p(x|z_{0:N}^1)p(x|z_{0:N}^2)\cdots p(x|z_{0:N}^L) p(z_{0:N}^1,z_{0:N}^2,...,z_{0:N}^L)=p(z_{0:N}^1)p(z_{0:N}^2)\cdots p(z_{0:N}^L)

Further, suppose that the data dynamics governing the transition xkxk+1x_k \to x_{k+1} are determined by second order dynamics in the latent spaces with a memory-less property such that

p(z0:Nl)=p(z0l)p(z1lz0)p(zNlzN1l)p(z_{0:N}^l)=p(z_0^l)p(z_1^l|z_0)\cdots p(z_N^l|z_{N-1}^l)

Let zkl=(skl,vkl)z_k^l=(s_k^l,v_k^l), where sklt=vkl\frac{\partial s_k^l}{\partial t}=v_k^l is the velocity of latent position skls_k^l and vklt=f(skl,vkl)=akl\frac{\partial v_k^l}{\partial t}=f(s_k^l,v_k^l)=a_k^l be the acceleration of latent position skls_k^l. While the true dynamics in ODE2VAE were modeled with a BNN \cite{yildiz2019ode2vae}, in our approach the velocity function is deterministic (although we can easily substitute a BNN).

Our objective is to maximize the likelihood of observing the whole sequence of video frames logp(x0:N)\log p(x_{0:N}), which is intractable. We can make this tractable by maximizing the ELBO of logp(x0:N)\log p(x_{0:N}):

ELBO=Eqlog(p(x0:N,z0:N1,...,z0:NL)q(z0:N1,...,z0:NLx0:N))=Eq[logp(x0:Nz0:N1,...,z0:NL)+log(p(z0:N1,...,z0:NL)q(z0:N1,...,z0:NLx0:N))]=Eq[logp(x0:Nz0:N1,...,z0:NL)+k=1Llog(p(z0k)q(z0kx0))+k=1Li=1Nlog(p(zikzi1k)q(zikzi1k))]=logp(x0:Nz0:N1,...,z0:NL)reconstruction error+l=1L(i=1NEqlogp(zilzi1l)prior regularizes latent movementKL(q(z0lx0)p(z0l))VAE KL term on first frame)ELBO = \mathbb{E}_{q} \log \Big( \frac{p(x_{0:N},z_{0:N}^1,...,z_{0:N}^L)}{q(z_{0:N}^1,...,z_{0:N}^L|x_{0:N})} \Big) \\ = \mathbb{E}_{q} \Big [\log p(x_{0:N}|z_{0:N}^1,...,z_{0:N}^L) + \log \Big( \frac{p(z_{0:N}^1,...,z_{0:N}^L)}{q(z_{0:N}^1,...,z_{0:N}^L|x_{0:N})} \Big) \Big ] \\ = \mathbb{E}_{q} \Big [\log p(x_{0:N}|z_{0:N}^1,...,z_{0:N}^L) + \sum_{k=1}^L \log \Big(\frac{p(z_0^k)}{q(z_0^k|x_0)}\Big) + \sum_{k=1}^L \sum_{i=1}^N \log \Big(\frac{p(z_i^k|z_{i-1}^k)}{q(z_i^k|z_{i-1}^k)}\Big) \Big] \\ = \underbrace{\log p(x_{0:N}|z_{0:N}^1,...,z_{0:N}^L)}_{\text{reconstruction error}} + \sum_{l=1}^L \Big( \underbrace{\sum_{i=1}^N \mathbb{E}_q \log p(z_i^l|z_{i-1}^l)}_{\text{prior regularizes latent movement}} - \underbrace{KL(q(z_0^l|x_0)||p(z_0^l))}_{\text{VAE KL term on first frame}} \Big) \\

and where q(zilzi1l)q(z_i^l|z_{i-1}^l) is either one or zero since ff is deterministic (this can be relaxed), and a superior prior is

p(zilzi1l)=p((silvil)(si1lvi1l))N((silvil)((γ1Δt)2I(γ3Δt)2I(γ3Δt)2I(γ2Δt)2I))p(z_i^l|z_{i-1}^l) = p \Big ( \begin{pmatrix} s_i^l \\ v_i^l \end{pmatrix} \Big | \begin{pmatrix} s_{i-1}^l \\ v_{i-1}^l \end{pmatrix} \Big) \sim \mathcal{N} \Big( \begin{pmatrix} s_i^l \\ v_i^l \end{pmatrix} \Big | \begin{pmatrix} (\gamma_1 \Delta t)^2 I & (\gamma_3 \Delta t)^2 I \\ (\gamma_3 \Delta t)^2 I & (\gamma_2 \Delta t)^2 I \\ \end{pmatrix} \Big) \\

where Δt\Delta t is the temporal difference between frames and {γ1,γ2,γ3}\{\gamma_1, \gamma_2, \gamma_3\} are hyperparameters. Also, we have a memoryless prior p(zilzi1l)=p(zilzi1l,zi2l,...)p(z_i^l|z_{i-1}^l)=p(z_i^l|z_{i-1}^l,z_{i-2}^l,...), which can be relaxed if need be.

Results