SGV-Page

Abstract

With the increasing use of video data across a wide range of domains including medical imaging, computer vision, and online streaming platforms, efficient and compact video representation is essential for cost-effective storage without sacrificing video fidelity. Recent methods in Deformable 2D Gaussian Splatting (D2GV) represent video using a canonical set of 2D Gaussians that are deformed over time to render individual frames. Compared to existing techniques in Implicit Neural Representations (INRs), D2GV achieves faster training and rendering times with strong video fidelity. However, storing and deforming Gaussian primitives independently ignores the spatial and temporal similarities among local Gaussians across frames. To exploit these similarities, we incorporate anchor-based neural Gaussians to utilize INR-based parameterization of Gaussian primitives for compact storage. We partition the video sequence into fixed-length subsequences to enable parallel training and linear scalability as the number of frames increases. For each subsequence, a canonical set of anchors is initialized across a structured grid, each governing a group of local Gaussian primitives. From stored anchor features, corresponding shape and color attributes of local Gaussians are predicted via two lightweight multi-layer perceptrons (MLPs). A third MLP is incorporated to predict deformations from anchored canonical frames to the individual frames across time. Our design improves compression ratios without significantly reducing fidelity while maintaining similar training and decoding times.

Method

We partition video sequences into fixed-length segments for parallel training and linear scaling. Each segment uses $N$ grid-positioned anchors with a set of attributes: $$ \mathbf{A} = \{ \mathbf{x}_a \in \mathbb{R}^2, \mathbf{f}_a \in \mathbb{R}^D, \boldsymbol{\delta} \in \mathbb{R}^{K \times 2}, \mathbf{s}_o \in \mathbb{R}^2, \mathbf{s}_a \in \mathbb{R}^2 \} $$

The positions of $K$ associated Gaussians are computed as: $$ \{\boldsymbol{\mu}^{(k)}\}_{k=0}^{K-1} = \mathbf{x}_a + \{\boldsymbol{\delta}^{(k)}\}_{k=0}^{K-1} \odot \mathbf{s}_o $$

$\text{MLP}_c$ predicts weighted colors for $K$ associated Gaussians. $\text{MLP}_{\Sigma}$ predicts scaling and rotation parameters $\mathbf{s}_{\text{base}}, \theta$ to ensure covariance $\Sigma$ is positive semi-definite. $$ \Sigma = RS(RS)^\top; \quad R = \begin{bmatrix} \cos(\theta) & -\sin(\theta) \\ \sin(\theta) & \cos(\theta) \end{bmatrix}, \quad S = \begin{bmatrix} s_1 & 0 \\ 0 & s_2 \end{bmatrix}, \quad (s_1, s_2) = \mathbf{s}_{\text{base}} \odot \mathbf{s}_a $$

$\gamma(\cdot)$ is the positional encoding function where $p$ represents position $\boldsymbol{\mu}$ or time $t$, normalized to (0,1], and $L$ is the number of encoding frequencies. $$ \gamma(p) = (\sin(2^k \pi p), \cos(2^k \pi p))_{k=0}^{L-1} $$

Gaussian primitives from the canonical frame are deformed to render individual frames across time. $\text{MLP}_{\Delta}$ predicts position and color deformations for frame $t$ Gaussians: $$ \boldsymbol{\mu}' = \boldsymbol{\mu} + d\boldsymbol{\mu}, \quad \boldsymbol{c}' = \boldsymbol{c} + d\boldsymbol{c} $$ Following [2], the final pixel color $\boldsymbol{C}$ is then computed using: $$ \boldsymbol{C} = \sum_{i \in I} \boldsymbol{c}'_i G_i $$ Where the spatial density of a Gaussian is defined as: $$ G(\mathbf{x}) = \exp\left(-\frac{1}{2} (\mathbf{x} - \boldsymbol{\mu}')^\top \Sigma^{-1} (\mathbf{x} - \boldsymbol{\mu}')\right) $$

References

*SGI is an unreleased paper, rendering single images using anchored neural Gaussians[1] Liu, M., et al. "D2GV: Deformable 2D Gaussian Splatting for Video Representation in 400FPS," arXiv preprint arXiv:2503.05600, 2025. [2] L. Zhu, G. Lin, J. Chen, X. Zhang, Z. Jin, Z. Wang, and L. Yu. Large Images are Gaussians: High-quality large image representation with levels of 2D Gaussian splatting. In Proceedings of AAAI Conference on Artificial Intelligence, pp. 10977–10985, 2025. [3] A. Mercat, M. Viitanen, J. Vanne, "UVG dataset: 50/120fps 4K sequences for video codec analysis and development," in Proceedings of the 11th ACM multimedia systems conference, 2020, pp. 297–302. †videos used: Bosphorus, Beauty, SetGo, Bee, Yacht, Jockey, Shake

SGV: Deforming Structured 2D Gaussians for Efficient and Compact Video Representation

Abstract

Method

References

Acknowledgements