Conference on Computer Vision and Pattern Recognition (CVPR 2022)

Authors
You Xie, Technical University of Munich
Huiqi Mao, National University of Singapore
Angela Yao, National University of Singapore
Nils Thuerey, Technical University of Munich

Abstract

We propose a novel approach to generate temporally coherent UV coordinates for loose clothing. Our method is not constrained by human body outlines and can capture loose garments and hair. We implemented a differentiable pipeline to learn UV mapping between a sequence of RGB inputs and textures via UV coordinates. Instead of treating the UV coordinates of each frame separately, our data generation approach connects all UV coordinates via feature matching for temporal stability. Subsequently, a generative model is trained to balance the spatial quality and temporal stability. It is driven by supervised and unsupervised losses in both UV and image spaces. Our experiments show that the trained models output high-quality UV coordinates and generalize to new poses. Once a sequence of UV coordinates has been inferred by our model, it can be used to flexibly synthesize new looks and modified visual styles. Compared to existing methods, our approach reduces the computational workload to animate new outfits by several orders of magnitude.

Links
Preprint
Code
Video

Motivation

Example mapping from $I_t$ to $T_t$ via $P^r_t$, and back to $I’_t$. $P^r_t$ cannot fully recover the image, and misses skirt, hair, and shoulder parts. Besides, colours inside the human body are also partially incorrect. To improve the spatial and temporal quality of UV coordinate maps, we firstly implemented a differentiable pipeline to learn UV mapping between a sequence of RGB inputs and textures via UV coordinates.

UV Extension & Optimization

a) Raw UV coordinates, b) with application of UV extension and c) optimization. The UV extension allows missing parts such as the dress to be mapped into the correct parts of $T_t$, while UV optimization makes $I’_t$ closer to $I_t$.

UV Temporal Relocation

Overview and results of temporal UV generation. a) Approximate feature matching is achieved via the optical flow (OF) from $T_o$ to $T_t$. b) RGB matching is applied to correct the coordinates resulting from errors in OF. Images in c) are generated with $P^o_t$ and $T_o$, i.e., $I’{t{T_o}} = \mathcal{W}(T_o, \omega_{I}(P^o_t))$. Images in d) are similarly generated with $P^f_t$. Green and blue arrows are shown here to track the two patterns in the images. After the temporal relocation step, results are more temporally coherent.

Model Training

A generative model is trained to balance the spatial quality and temporal stability. It is driven by supervised and unsupervised losses in both UV and image spaces. Trained model G generates temporally coherent UV coordinates that capture loose clothing from off-the-shelf human pose UV estimates such as SMPL and DensePose.

Results

Comparisons between DensePose UVs $P^r_t$ and optimized UVs $P^o_t$. Here, we only show examples of the skirt part in $T_t$ to clarify the differences. We can see that $P^o_t$ can preserve most of the skirt information in $T_t$, and $I’_t$ of $P^o_t$ are closer to $I_t$ than that of $P^r_t$. The quantitative evaluation also shows that our results after UV optimization are closer to the reference.
Generated UV coordinates allow us to recover entire sequences from a constant texture map. Virtual try-on and modifications of the look can be easily achieved with minimal computation via a simple lookup.

To summarize, our main contributions are

  • a model-agnostic method to extend UV coordinates to capture the complete appearance of the human body,
  • an approach to train neural networks that generate completed and temporally coherent UV coordinates without the need for ground truth, and
  • a highly efficient way to generate virtual try-on videos with arbitrary clothing styles and textures.