Video imputation is essential for on-the fly error-recovery in adaptive streaming, compression, and super resolution. Most prior methods rely on the use of generative adversarial networks (GANs) such that given two consecutive video frames, the model attempts to generate spatial interpolatory frames to form coherent video sequences. In this work, we study the use of a factorizable variational auto-encoder based on a second order ordinary differential equation solver (Ode2-FacVAE) to account for separated covariances between the spatial, spectral and temporal video frames in order to improve the visual performance of previously proposed methods for video interpolation.
Problem Formulation
Let X={xti}i=1N where xti∈Rd be an ordered sequence of video frames. Let I={0,1,...,N} be the frame index, where ti∈R+ for i∈I and ti=tj>0 for all i,j∈I where i>j. Let t∈R+ be continuous and represent time. We are tasked with recovering xt from X. We will do this by learning an independently factorized latent space manifolds that are linear locally and such that the latent space captures the covariance structure that exists between nearby frames:
Objective: maxp(xt∣X).
Video Imputation: Video can be damaged. The aim of this task is to identify damaged video frames and impute the missing or damaged pixels.
Video Extrapolation:maxp(xt∣X) where t>tN.
Video Interpolation:maxp(xt∣X) where ti<t<ti+1 for some i∈{0,...,N−1}.
Methodology
We are given video stream x0:N={x0,x1,...,xN}. Suppose that the generative process that generates x is described by a latent space that can be factorized into L independent subspaces. That is,
Further, suppose that the data dynamics governing the transition xk→xk+1 are determined by second order dynamics in the latent spaces with a memory-less property such that
p(z0:Nl)=p(z0l)p(z1l∣z0)⋯p(zNl∣zN−1l)
Let zkl=(skl,vkl), where ∂t∂skl=vkl is the velocity of latent position skl and ∂t∂vkl=f(skl,vkl)=akl be the acceleration of latent position skl. While the true dynamics in ODE2VAE were modeled with a BNN \cite{yildiz2019ode2vae}, in our approach the velocity function is deterministic (although we can easily substitute a BNN).
Our objective is to maximize the likelihood of observing the whole sequence of video frames logp(x0:N), which is intractable. We can make this tractable by maximizing the ELBO of logp(x0:N):
ELBO=Eqlog(q(z0:N1,...,z0:NL∣x0:N)p(x0:N,z0:N1,...,z0:NL))=Eq[logp(x0:N∣z0:N1,...,z0:NL)+log(q(z0:N1,...,z0:NL∣x0:N)p(z0:N1,...,z0:NL))]=Eq[logp(x0:N∣z0:N1,...,z0:NL)+k=1∑Llog(q(z0k∣x0)p(z0k))+k=1∑Li=1∑Nlog(q(zik∣zi−1k)p(zik∣zi−1k))]=reconstruction errorlogp(x0:N∣z0:N1,...,z0:NL)+l=1∑L(prior regularizes latent movementi=1∑NEqlogp(zil∣zi−1l)−VAE KL term on first frameKL(q(z0l∣x0)∣∣p(z0l)))
and where q(zil∣zi−1l) is either one or zero since f is deterministic (this can be relaxed),
and a superior prior is
where Δt is the temporal difference between frames and {γ1,γ2,γ3} are hyperparameters. Also, we have a memoryless prior p(zil∣zi−1l)=p(zil∣zi−1l,zi−2l,...), which can be relaxed if need be.