#### Getting the best of both GANs and VAEs

This article presents our research on high resolution image generation using Generative Variational Autoencoder.

### Important Points

- Our work addresses the mode collapse issue of GANs and blurred images generated using VAEs in a single model architecture.
- We use the encoder of VAE as it is while replacing the decoder with a discriminator.
- The encoder is fed data from a normal distribution while the generator is fed that from a gaussian distribution.
- The combination from both is then fed to a discriminator which tells whether the generated images are correct or not.
- We evaluate our network on 3 different datasets: MNIST, fashion MNIST and TCIA Pancreas CT dataset.
- Evaluation on MNIST, Fashion MNIST and TCIA Pancreas CT shows we outperform all previous state-of-the-art methods in terms of MMD, SSIM, log likelihood, reconstruction error, ELBO and KL divergence as the evaluation metrics.

### Introduction

The training of deep neural networks requires hundreds or even thousands of images. Lack of labelled datasets especially for medical images often hinders the progress. Hence it becomes imperative to create additional training data. Another area which is actively researched is using generative adversarial networks for image generation. Using this technique, new images can be generated by training on the existing images present in the dataset. The new images are realistic but different from the original data. There are two main approaches of using data augmentation using GANs: image to image translation and sampling from random distribution. The main challenge with GANs is the mode collapse problem i.e. the generated images are quite similar to each other and there is not enough variety in the images generated.

Another approach for image generation uses Variational Autoencoders. This architecture contains an encoder which is also known as generative network which takes a latent encoding as input and outputs the parameters for a conditional distribution of the observation. The decoder is also known as an inference network which takes as input an observation and outputs a set of parameters for the conditional distribution of the latent representation. During training VAEs use a concept known as reparameterization trick, in which sampling is done from a gaussian distribution. The main challenge with VAEs is that they are not able to generate sharp images.

### Dataset

The following datasets are used for training and evaluation:

- MNIST — This is a large dataset of handwritten digits which has been used successfully for training image classification and image processing algorithms. It contains 60,000 training images and 10,000 test images.
- Fashion MNIST — This dataset is also similar to MNIST with 60,000 training images and 10,000 test images. Each example is a 28x28 grayscale image which is labelled into one of the 10 classes of fashion wear like trouser, top, sandal etc.
- TCIA Pancreas CT — The National Institutes of Health Clinical Center performed 82 abdominal contrast enhanced 3D CT scans. The CT scans have resolutions of 512×512 pixels with varying pixel sizes and slice thickness between 1.5 to 2.5 mm.

### VAEs vs GANs vs Ours

We show how instead of inference made in the way shown in original VAE architecture, we can add the error vector to the original data and multiply by standard distribution. The new term goes to the encoder and gets converted to the latent space. In the decoder, similarly the error vector gets added to the latent vector and multiplied by standard deviation. In this manner, we use the encoder of VAE in a manner similar to that in the original VAE. While we replace the decoder with a discriminator and hence change the loss function accordingly. The comparison between model architectures of VAE and our architecture is shown in Fig 1.

Our architecture can be seen both as an extension of VAE as well as that of GAN. Reasoning it as the former is easy as this requires a change in loss function for decoder, while the latter can be made by recalling the fact that GAN essentially works on the concept of zero sum game maintaining Nash Equilibrium between the generator and discriminator. In our case, both the encoder from VAE and discriminator from GAN are playing zero sum game and are competing with each other. As the training proceeds, the loss decreases in both the cases until it stabilizes.

### Network Architecture

The network architecture used in this work is explained in the below points:

- The discriminator and encoder networks have four convolution layers, each of which uses 3×3 filters.
- We use Batch Normalization and Leaky Rectified Linear Unit (LeakyReLU) layers after each layer.
- In training, we found that our architecture suffers from instability during training. This was solved using WGAN loss function which measures Wasserstein distance between two distributions.
- We used the gradient penalty term to stabilize the training.
- Our loss function has a total for 3 terms. While training, the encoder and the generator are considered as one network. Thus, we sum up the loss functions of the two networks in the order encoder-generator, discriminator as one and train the networks.
- Two latent vectors are sampled one from normal distribution and the other from gaussian distribution. The one from normal distribution is fed to the encoder while the one from gaussian distribution is fed to the generator.
- The outputs from both the vectors are in turn fed to the discriminator to tell whether the generated image is real or not.

Our network architecture is shown in Fig 2.

### Architecture Details

The generator and discriminator layerwise architecture details is shown in Table 1 and Table 2 respectively. We denoted ResNet block as consisting of the following layers — convolutional, max pooling layer, 30 percent dropouts in between the layers and batch normalization layer.

### Algorithm

The algorithm used in this work is trained using Stochastic Gradient Descent (SGD) as shown below:

### Experiments

All the generated samples are generator outputs from random latent vectors. We normalize all data into the range [-1, 1] and use two evaluation metrics to measure the performance of our network. First of them measures the distribution distance between the real and generated samples with maximum mean discrepancy (MMD) scores. The second metric evaluates the generation diversity with multi-scale structural similarity metric (MS-SSIM). Table 4. compares MMD and MS-SSIM scores with previous state of the art architectures.

We noticed the model with a small latent vector size of 100 suffers from severe mode collapse. The best results can be obtained using a moderately large latent vector size. Table 5 compares the effect of different latent variable sizes on the MMD and MS-SSIM scores respectively.

As can be seen, latent variable size with value 1000 produces the best results of those being compared. Both at low and high latent variable size mode collapse is seen which is one of the main challenges faced while training GANs.

Four common evaluation metrics have been used in the literature for testing the performance of generative models. These are log-likelihood, reconstruction error, ELBO and KL divergence.

The log-likelihood is calculated by finding the parameter that maximizes the log-likelihood of the observed sample. The reconstruction error is the distance between the original data point and its projection onto a lower-dimensional subspace. The optimization problem used in our model uses KL divergence error which is intractable hence we maximize ELBO instead of minimizing the KL divergence. KL divergence is a measure of how similar the generated probability distribution is to the true probability distribution. The comparison using these evaluation metrics for our model on MNIST dataset with the original VAE architecture is shown in Table 6.

We compare our log probability distribution value with those obtained by previous state of the art methods which is shown in Table 7. The log probability distribution is an important evaluation metric in the sense that it shows the diversity of the samples generated.

### Results

We present the generated images on all the 3 datasets used for testing. The images were trained for 1000 iterations both in the cases of MNIST and fashion MNIST while was trained for 300 iterations on TCIA Pancreas CT dataset. The generated images are shown in Fig 3.

We compare our results with previous state of the art networks on MNIST dataset in Fig 4.

### Conclusions

In this blog, we presented a new training procedure for Variational Autoencoders based on generative models. This allows us to make the inference model much more flexible, allowing it to represent almost any posterior distributions over the latent variables. Our network was trained and tested on 3 publicly available datasets. On evaluating using MMD, SSIM, log likelihood, reconstruction error, ELBO and KL divergence as the evaluation metrics, our network beats the previous state of the art algorithms. Using generative model approaches to generate additional training data especially in fields like medical imaging could be revolutionary as there is a shortage of medical data for training deep convolutional neural network architectures.

### References

S. U. Dar, M. Yurt, L. Karacan, A. Erdem, E. Erdem, and T. Çukur. Image synthesis in multi-contrast mri with conditional generative adversarial networks. IEEE transactions on medical imaging, 38 (10):2375–2388, 2019.

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.

I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of wasserstein gans. In Advances in neural information processing systems, pages 5767–5777, 2017.

D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.

### Before You Go

Research Paper: https://abhinavsagar.github.io/files/gvae.pdf