Bridge scene from Tears of Steel. Left: Low-Resolution Input, right: TecoGAN Output.
(Input) Trump to Obama (TecoGAN Output)
(Input) Obama to Trump (TecoGAN Output)
(Input) LR Smoke to HR Smoke (TecoGAN Output)
(Input) Smoke Simulations to Real Captures (TecoGAN)
VSR | Calendar | Foliage | ||||
---|---|---|---|---|---|---|
methods | FRVSR | DUF | TecoGAN | FRVSR | DUF | TecoGAN |
PSNR↑ | 23.94 | 24.16 | 23.28 | 26.35 | 26.45 | 24.26 |
LPIPS↓ | 0.2976 | 0.3074 | 0.1515 | 0.3242 | 0.3492 | 0.1902 |
tLP↓ | 0.01064 | 0.01596 | 0.0178 | 0.01644 | 0.02034 | 0.00894 |
tOF↓ | 0.1552 | 0.1146 | 0.1357 | 0.1489 | 0.1356 | 0.1238 |
UVT [tLP↓ tOF↓] | Trump→Obama | Obama→Trump | AVG |
---|---|---|---|
CycleGAN | [0.0176, 0.7727] | [0.0277, 1.1841] | [0.0234, 0.9784] |
RecycleGAN | [0.0111, 0.8705] | [0.0248, 1.1237] | [0.0179, 0.9971] |
TecoGAN | [0.0120, 0.6155] | [0.0191, 0.7670] | [0.0156, 0.6913] |
|
4.7. The UVT task yields similar conclusions to the VSR ablation above: While DsOnly can improve the temporal coherence by relying on the frame-recurrent input, temporal adversarial learning in Dst and TecoGAN is the key to a correct spatio-temporal cycle consistency. Without PP loss, Dst shows undesirable smoke in empty regions. The full TecoGAN model can avoid such artificial temporal accumulation. (0.5x speed) |
|
The UVT DsDtPP model contains two generators and four discriminator networks, and is very difficult to balance in practice. By weighting the temporal adversarial losses from Dt with 0.6 and the spatial one from Ds with 1.0, the DsDtPP model can yield a similar performance to the Dst model (on the right). The proposed Dst architecture is the better choice in practice, as it learns a natural balance of temporal and spatial components by itself, and requires fewer resources. |
DsOnly also shows coherent but undesirable motion, e.g. on Obama's right collar. His eyes barely blink, indicating that spatio-temporal cycle consistency cannot be properly established without temporal supervision. |
Original triplets |
|
Warped triplets |
Next, we highlight several regions of the content of the corresponding warped triplets.
In the warped triplets, the eyebrows, eyes, noses and mouths are better aligned.
This can make the job easier for discriminators by supplying information “in-place” wherever possible.
However, since the flow estimator F is mainly trained to align natural images and the estimated motion fields
contain approximation errors, the unatural collar motion (green box) of the DsOnly results cannot be compensated. With a better alignment, these subtle changes are easier to be detected.
Likewise for the jittering of the hair (red box)
and flickering motions on the cheek (yellow box) in the DsOnly results.
These artifacts are successfully removed in the TecoGAN results.
|
Original A | Warped A | Static A | Original B | Warped B | Static B | |
DsOnly Generated Triplets |
||||||
TecoGAN Generated Triplets |
||||||
Ground-Truth Selected Triplets |