FreeStyleGAN: Free-view Editable Portrait Rendering with the Camera Manifold

ACM Transactions on Graphics (SIGGRAPH Asia 2021)
Thomas Leimkühler and George Drettakis
Inria and Université Côte d'Azur

We introduce a new approach that generates an image with StyleGAN defined by a precise 3D camera. This enables faces synthesized with StyleGAN to be used in 3D free-viewpoint rendering, while also allowing all semantic editing provided by GAN methods. Our method takes as input multiple views of a person (examples in a), used to reconstruct a coarse 3D mesh (b). To render a novel view, we identify the closest camera which corresponds to an image that StyleGAN can generate (c). We lift this view to 3D and obtain free-viewpoint renderings with arbitrary camera models, which allows the integration of our renderings into synthetic 3D scenes (d). We inherit the high-quality semantic editing capabilities from StyleGAN ((e) smile or aging), and enable stereoscopic rendering (f). Our method can be integrated into any rendering pipeline and - for the first time - marries generative image modeling with traditional rendering.

News

March 2021: Our codebase has been upgraded! It now also contains:

  • COLMAP support for camera calibration and geometry reconstruction,
  • OpenGL-based training on headless machines,
  • Support for Windows and Linux,
  • Minor fixes and improvements.

Abstract

Current Generative Adversarial Networks (GANs) produce photorealistic renderings of portrait images. Embedding real images into the latent space of such models enables high-level image editing. While recent methods provide considerable semantic control over the (re-)generated images, they can only generate a limited set of viewpoints and cannot explicitly control the camera. Such 3D camera control is required for 3D virtual and mixed reality applications.

In our solution, we use a few images of a face to perform 3D reconstruction, and we introduce the notion of the GAN camera manifold, the key element allowing us to precisely define the range of images that the GAN can reproduce in a stable manner. We train a small face-specific neural implicit representation network to map a captured face to this manifold and complement it with a warping scheme to obtain free-viewpoint novel-view synthesis. We show how our approach – due to its precise camera control – enables the integration of a pre-trained StyleGAN into standard 3D rendering pipelines, allowing e.g., stereo rendering or consistent insertion of faces in synthetic 3D environments. Our solution proposes the first truly free-viewpoint rendering of realistic faces at interactive rates, using only a small number of casual photos as input, while simultaneously allowing semantic editing capabilities, such as facial expression or lighting changes.

Video

Method

Overview of our method.

Observing that the StyleGAN portrait model can only synthesize a limited range of views, we define a camera manifold which models the corresponding subspace of camera parameters (right). Images from cameras on the manifold (top left) can be generated using latent manipulations. To move away from the manifold, we render a flow field (bottom left) to warp the manifold view, obtaining free-view rendering (center). The flow field is often parallax-free, as perspective effects are already generated in the manifold view.

Free-view Camera Control

Our StyleGAN portrait renderings are generated using physically meaningful cameras. They can therefore be combined with other rendering techniques – path tracing in this case.

View-consistent Editing

We inherit all semantic editing capabilities from StyleGAN and use the method of Härkönen et al. [2020] to demonstrate view-consistent portrait manipulations.

BibTeX

@article{FreeStyleGAN2021,
	author = {Thomas Leimk\"uhler and George Drettakis},
	title = {FreeStyleGAN: Free-view Editable Portrait Rendering with the Camera Manifold},
	booktitle = {ACM Transactions on Graphics (SIGGRAPH Asia)},
	publisher = {ACM},
	volume    = {40},
	number    = {6},
	year      = {2021},
	doi       = {10.1145/3478513.3480538}
}

Acknowledgments and Funding

This research was funded by the ERC Advanced grant FUNGRAPH No 788065. The authors are grateful to the OPAL infrastructure from Université Côte d'Azur for providing resources and support. The authors thank Ayush Tewari, Ohad Fried, and Siddhant Prakash for help with comparisons, Adrien Bousseau, Ayush Tewari, Julien Philip, Miika Aittala, and Stavros Diolatzis for proofreading earlier drafts, the anonymous reviewers for their valuable feedback, and all participants who helped capture the face datasets.