# Learning Temporal Coherence via Self-Supervision for GAN-based Video Generation

## The Supplemental Web-Page

This document contains the following chapters:
1. Overview
2. Results
3. Evaluations
4. Loss Ablation Study
6. tOF Visualization
7. Triplet Visualization

## 1. Overview

#### For VSR, even under-resolved structures in the input can lead to realistic and coherent outputs with the help of our temporal supervision.

Bridge scene from Tears of Steel. Left: Low-Resolution Input, right: TecoGAN Output.

#### For UVT, our model can learn temporal and spatial consistency simultaneously. It generates realistic details that change naturally over time.

(Input) Trump to Obama (TecoGAN Output)

(Input) Obama to Trump (TecoGAN Output)

(Input) LR Smoke to HR Smoke (TecoGAN Output)

(Input) Smoke Simulations to Real Captures (TecoGAN)

## 2. Results

#### Face comparison with previous work (CycleGAN and RecycleGAN, Fig.8 of the paper, 0.5x speed)

Inputs
CycleGAN
RecycleGAN
STC-V2V [Park et al.]
TecoGAN

## 5. Spatio-temporal Adversarial Equilibrium Analysis

#### Input Ablation for UVT Dst (Fig.8 and Sec.4 in the paper, 0.5x speed)

Baseline:
3 original frames
vid2vid [Wang2018] variant :
3 original frames + estimated motions
Concat version:
3 original frames + 3 warped frames
TecoGAN: {current frame}x3 OR
3 warped frames OR 3 original frames

## 7. Triplet Visualization

#### The following videos provide an intuiton for why the proposed curriculum learning for UVT is important to leverage our spatio-temporal discriminators. The different types of triplets differ in what information they provide for learning.

Original triplets

#### Here we show 3 looped original frames to visualize the motion contained in unwarped triplets. In the left and middle frames, we show the generated original triplets of the DsOnly model and the TecoGAN model. On the right, we show the original triplet of one selected ground-truth frames. These example show that the original triplets contain complex spatial and temporal information, which lead to a very challenging learning task for classifying temporal changes as natural versus fake.

 Warped triplets Next, we highlight several regions of the content of the corresponding warped triplets. In the warped triplets, the eyebrows, eyes, noses and mouths are better aligned. This can make the job easier for discriminators by supplying information “in-place” wherever possible. However, since the flow estimator F is mainly trained to align natural images and the estimated motion fields contain approximation errors, the unatural collar motion (green box) of the DsOnly results cannot be compensated. With a better alignment, these subtle changes are easier to be detected. Likewise for the jittering of the hair (red box) and flickering motions on the cheek (yellow box) in the DsOnly results. These artifacts are successfully removed in the TecoGAN results.

#### Below we show the same sample as above on the left as A, and a new sample B on the right for all triplets variants (from left to right, the original, warped and static triplets) :

 Original A Warped A Static A Original B Warped B Static B DsOnly Generated Triplets TecoGAN Generated Triplets Ground-TruthSelectedTriplets