0. Introduction
- Tasks :
- Image Synthesis : The task of creating new images from some form of image description.
 
1. GAN (2014)
- 
Introduction : A new framework for estimating generative models via an adversarial process 
- 
Method: simultaneously train two models : a generative model $G$ that captures the data distribution, and a discriminative model $D$ that estimates the probability that a sample came from the training data rather than $G$. 
$$ \min_{G} \max_{D} V(D,G) = \mathbb{E} _ {x \sim p_{data}(x)} logD(x) + \mathbb{E} _ {z \sim p_z(z)}log(1-D(G(z))) $$
- 
in terms of Discriminator : $D(x)$ should be 1 and $D(G(z))$ should be 0. So, $D$ is trained to maximize both left and right terms 
- 
in terms of Generator : Left term can be ignored since it’s independent of $G$, and $D(G(z))$ should be 1. So, $G$ is trained to minimize right terms. 
- 
in terms of implementation : We need to make two optimizer. Also G_loss and D_loss should be defined respectively. 
2. Conditional GAN (2014)
- 
Introduction : The conditional version of generative adversarial nets, which can be constructed by simply feeding the data, y. 
- 
Method : feeding $y$ into the both the discriminator and generator as additional input layer. 
$$ \min_{G} \max_{D} V(D,G) = \mathbb{E} _ {x \sim p_{data}(x)} [logD(x,y)] + \mathbb{E _ {z \sim p_z(z)}[log(1-D(G(z,y),y))].} $$
 
def generator(x,y):
  input = concat([x,y],1)
  layer1 = relu(FC(input, 128))
  layer2 = tanh(FC(layer1, 784))
  return layer2
def discriminator(x,y):
  input = concat([x,y],1)
  layer1 = lrelu(FC(input, 128))
  layer2 = tanh(FC(layer1, 1))
  return layer2
3. DCGAN (2016)
- 
Introduction : Convolutional GANs that make them stable to train in most settings. 
- 
Method : Following techniques were used for stable Deep Conv GANS. - Replace any pooling layers with strided convolutions (discriminator) and fractional strided convolutions (generator)
- Use batchnorm in both the generator and the discriminator
- Remove fully connected hidden layers for deeper architectures
- Use ReLU for all layers except for the output which uses Tanh
- Use LeakyReLU in discriminator for all layers
   
4. BEGAN (2017)
- 
Introduction : A new equilibrium enforcing method paired with a loss derived from the Wasserstein distance for training auto-encoder based GAN. 
- 
Method - use an auto-encoder as a discriminator as was first proposed in EBGAN.
- aims to match auto-encoder loss distributions using a loss derived from the Wasserstein distance while typical GANs try to match data distributions directly.
- This is done using a typical GAN objective with the addition of an equilibrium term to balance the discriminator and the generator.
 
$$ L_D = L(x) - k_t L(G(z_d)) $$
$$L_G = L(G(z_G))$$
$$ k_{t+1} = k_t + \lambda (\gamma L(x) - L(G(z_G)))$$
5. GIRAFFE (2021)
- 
Introduction : Deep generative models allow for photorealistic image synthesis at high resolutions content, But this is not enough => creation also needs to be controllable with 3D representation. - Our key hypothesis is that incorporating a compositional 3D scene representation into the generative model leads to more controllable image synthesis.
 
- 
GIRAFFE : generating scenes in a controllable and photorealistic manner without additional supervision - 
(3.1) model Objects as neural feature fields : - NeRF (Neural Radiance Fields) : a function f that maps 3D point and viewing direction to a volume density and RGB color value.
- Input : single continuous 5D coordinate (spatial location (x, y, z) and viewing direction (θ, φ)),
- Output : the volume density and view-dependent emitted radiance at that spatial location
- 2D images from different view => Neural Network => random view image
 
- GRAF (Generative Neural Feature Fields) : unsupervised version of NeRF, trained from unposed image collections with additional latent vector.
- input : spatial location $\gamma(x)$, viewing direction, $\gamma(d)$, shape code $z_s$, appearance code $z_a$
 
- GIRAFFE : replace 3D color output with M-dimensional feature from GRAF
- represent each object using a separate feature field in combination with an affine transformation
 
 
- NeRF (Neural Radiance Fields) : a function f that maps 3D point and viewing direction to a volume density and RGB color value.
 
- 
- 
(3.2) Scene Compositions - we describe scenes as compositions of N entities where the first N−1 are the objects in the scene and the last represents the background
 
- 
(3.3) Scene Rendering - 
3D Volume Rendering : For given camera extrinsics, maps above evaluations to the pixel’s final feature vector 
- 
2D Neural Rendering : maps the feature image to the final synthesized image (2D CNN) 
 
- 
- 
Result : By representing scenes as compositional generative neural feature fields, we disentangle individual objects from the background as well as their shape and appearance without explicit supervision. Combining this with a neural renderer yields fast and controllable image synthesis. - *disentangle : commonly refer to being able to control an attribute of interest, e.g. object shape, size, or pose, without changing other attributes.
 
