Gaussian Head Avatars: A Summary. There has been a recent explosion of… | by Jack Saunders

Contents

Mesh Fitting CVAE Training — Before Gaussians

Method Diagram for Relightable Gaussian Codec Avatars. Image reproduced directly from the arxiv paper.

Relightable Gaussian Codec Avatars. Shunsuke Saito, Gabriel Schwartz, Tomas Simon, Junxuan Li, Giljoo Nam. Arxiv preprint, 6th December 2023. Link

This next paper is probably the one that has generated the most hype. It’s a paper coming from Meta’s reality labs. In addition to being animatable, it is also possible to change the lighting for these models, making them easier to composite into varying scenes. As this is Meta and Meta have taken a big bet on the ‘metaverse’ I expect this may lead to a product fairly soon. The paper builds upon the already popular codec avatars, using Gaussian splitting.

Mesh Fitting

Unfortunately, the mesh reconstruction algorithm used by Meta is a bit more complex, and it builds upon several previous papers by the company. It is sufficient to say, however, that they can reconstruct a tracked mesh with consistent topology with temporal consistency over multiple frames. They use a very expensive and complex capture rig to do this.

CVAE Training — Before Gaussians

The CVAE is based on this version from the paper “Deep relightable appearance models for animatable faces”. Image reproduced from the paper.

The previous approach to Meta is based on a CVAE (Conditional Variational Autoencoder). This takes in the tracked mesh and the average texture and encodes them into a latent vector. This is then decoded (after reparameterization) into the mesh and a set of features is used to reproduce the texture. The objective of this current paper is to use a similar model but with Gaussians.

CVAE With Gaussians

To extend this model to Gaussian splatting a few changes need to be made. The encoder, however, is not. This encoder still takes in the vertices V of the tracked mesh and an average texture. The geometry and appearance of the avatar are decoded separately. The geometry is represented using a series of Gaussians. One of the more interesting parts of the paper is the representation of Gaussians in a uv space. Here a uv-texture map is defined for the mesh template, this means that each pixel in the texture map (texel) corresponds to a point on the mesh surface. In this paper, each texel defines a Gaussian. Instead of an absolute position, each texel Gaussian is defined by its displacement from the mesh surface, e.g. the shown texel is a Gaussian that is tied to the eyebrow and moves with it. Each texel also has a value for rotation, scale and opacity, as well as roughness (σ) and SH coefficients for RGB colour and monochrome.

The image on the left hand side is represents the geometry on the right. Each texel represents a Gaussian. Image reproduced directly from the arxiv paper.

In addition to the Gaussians, the decoder also predicts a surface normal map and visibility maps. These are all combined using approximations of the rendering equation for lighting. The following is a very rough explanation that is almost certainly wrong/lacking as I’m not an expert on lighting.

From left to right: Normal maps, specular lighting maps, diffuse lightning map, albedo and final lit appearance. Image reproduced directly from the arxiv paper.

The diffuse component of the light is computed using spherical harmonics. Each Gaussian has an albedo (ρ) and SH coefficients (d). Usually, SH coefficients are only represented up to the 3rd order, however, this is not enough to represent shadows. To balance this with saving space, the authors use 3rd-order RGB coefficients but 5th-order monochrome (grayscale) ones. In addition to diffuse lighting, the paper also models specularities (e.g. reflection) by assigning a roughness to each Gaussian and using the decoded normal maps. If you’re interested in exactly how this works, I would recommend reading the paper and supplementary materials.

Finally, a separate decoder also predicts the vertices of the template mesh. All models are trained together using reconstruction losses at both the image level and mesh level. A variety of regularisation losses are also used. The result is an extremely high-quality avatar with control over the expression and lighting.

TLDR; Represent Gaussians as uv-space images, decompose the lighting and model it explicitly, and improve the codec avatars using this Gaussian representation.