Drawing Architecture: Building Deep Convolutional GAN’s In Pytorch

What I cannot create, I do not understand.
Richard Feynman

Feynman did not create GAN’s, unsupervised learning, or adversarial training, but with this quote he did demonstrate that intelligence and the ability to understand something is not merely a supervised, discriminative task. In order to understand something you must do more than give it a label based on something similar that you have seen a million times— to understand what you are looking at you must be able to recreate it. The ability to create is what sets general adversarial networks apart from their predecessors in deep learning. GAN’s are generative models that generate output; this is a departure from discriminative models that label input. This makes them a powerful paradigm shifting force in deep learning and artificial intelligence that is worthy of the hype that Yan Lecun and the other fathers of deep learning have given it. The potential for GAN’s surpasses that of discriminative networks because GAN’s use deep learning to synthesize information and to create something novel from it. As Feynman said, this is the most impactful form of understanding that there is.

In this post I will go over how to use a deep convolutional GAN to create architecture (exteriors of buildings). I wanted to have the experience using and creating a dataset that was not automatically built in — like Imagnet and MINST — and fine tuning it to create realistic looking facades. The dataset I worked with is here. I also augmented the dataset by scraping additional images off of the web.

What is a GAN:

General adversarial networks are two neural networks competing against each other to create an output that closely resembles the input. These two networks — the generator and the discriminator— are playing adversarial roles. The generator network creates a new image from random noise based off of the input image. The random noise evolves from incoherent pixels to a coherent image with discernible forms in it because of how the discriminator told it to change. The discriminator network determines if the image is real or fake. The goal of the GAN is for the generator image to be so like the real image that it is able to trick the discriminator into thinking that the generated image is real. One of the most important features of GANs, is that the neural networks implemented in GANs use a number of parameters significantly smaller than the amount data used to train them. This forces the model to learn and internalize the most important features in the data so the model can generate them.

Vanila GAN Architecture

In order to better understand this, lets look at the analogy that Ian Goodfellow and his colleagues used when they published the original paper in 2014. The generator is like a team of forgers trying to create an output that matches the real paintings (the input), while the discriminator is like a team of detectives trying to determine the difference between the real image and the fake image. For each iteration of the algorithm, the generator never gets to see the original input, instead, it sees the latent random variable (visual noise based off of the real input images) and the judgements of the discriminator. The generator is like a blind forger trying to recreate the Mona Lisa by being given the paint and then she is told by the detective how to use it. The picture that the forger paints looks more and more like the Mona Lisa after each iteration.

Deep Convolutional GAN (DCGAN)

Vanilla GAN architectures are powerful but, this architecture does not allow for any real spatial reasoning since it relies on a feedforward neural network to extract features from an image rather than a convolutional neural network. Feedforward neural networks are unable to reason about features such as sharp edges and accentuated curves because they do not preserve the spatial structure of the image. Convolutional layers preserve the spatial structure of an image which means that the most accurate, detailed features will be extracted from an image. This gives both the generator and the discriminator more advanced spatial reasoning abilities about the output they will generate and about how to discriminate between the features in the real images and the features in the fake images. The enhanced quality of the features extracted is typically why DCGAN’s are used when dealing with images.


Both the generator and the discriminator will be a convolutional neural network. The discriminator has a vanilla CNN architecture since it is performing the discriminative, supervised task of classifying images. The generator will have a modified convolutional architecture like this:

  • Pooling is replaced with convolutional stride. This allows the network to learn its own spatial downsampling (changing the size of the input). It doesn’t need to be set beforehand.
  • No fully connected layer at the end of the CNN. The generator is not a classifier so this part is not needed.
  • Batch normalization is used for every layer except the output layer for the generator and the input layer of the discriminator. This stabilizes the training process by standardizing the activations — prior layers have zero mean and unit variance — and the flow of the gradient. In the paper, the results showed that in the output layer in the generator and the input layer in the discriminator led to model instability and that is why batch norm is not used in those layers.
  • Use ReLU in the generator except for the output which uses tanh. The symmetry of the tanh function allows the model to learn more quickly to saturate and cover the color space of the training distribution.


The generator takes in random noise as input and samples the output of the discriminator in order to produce an image. This process can be thought of as mapping the noise to the input data distribution in order to produce an image. The generator architecture is a modified convolutional neural network comprised of ConvTranspose2d layers that have learnable parameters and the rest of the architecture like the one described above.

nn.ConvTranspose2d( nz, ngf * 8, 4, 1, 0, bias=False),
nn.BatchNorm2d(ngf * 8),
nn.ConvTranspose2d(ngf * 8, ngf * 4, 4, 2, 1, bias=False),
nn.BatchNorm2d(ngf * 4),
nn.ConvTranspose2d( ngf * 4, ngf * 2, 4, 2, 1, bias=False),
nn.BatchNorm2d(ngf * 2),
nn.ConvTranspose2d( ngf * 2, ngf, 4, 2, 1, bias=False),
nn.ConvTranspose2d( ngf, nc, 4, 2, 1, bias=False),

Random Noise:

Random Noise

In generative learning the machine attempts to generate new outputs from a complex probability distribution (the input). In deep learning this same idea is modeled as a neural network — in our case a convolutional neural network — that takes as input a simple random variable and that returns a random variable that follows the targeted distribution as output. The neural network is able to derive hierarchical relationships in the input probability distribution. In this instance, the relationship, the random noise, is some set of pixels that resemble the real images.

On the first iteration of the training loop, the random noise is sent to the discriminator and it determines how much the noise is like the real images. Then once the discriminator tells it how much it is off from the original input, the generator takes that information and adjusts the noise until it turns into an image that resembles the input. Therefore, the generator never works directly with the input. The generator indirectly learns how to transform the noise into something that looks like the input.

fixed_noise = torch.randn(64, nz, 1, 1, device=device)

2D Transpose Layers

The generator layer is composed of ConvTranspose2d layers. These layers will upsample the noise vector which will transform the noise into an image. This is not the same thing as a deconvolution or a convolutional layer. Convolutional layers seek to extract smaller and smaller features that will later be classified. Deconvolution layers seek to reverse the operations of the convolutional layer.

2D Transpose Layer

In the most simplistic sense, a transpose causes at least two things to switch places with each other. The noise vector and the image space will switch places with each other. This means that we are changing the order of their dimensions and, therefore, we will swap the values in the matrices with respect to the diagonal. This process upsamples — enlarges and fills in the details of the final output — the image. This is the part of the generator that “draws” the actual image.

At a high level, this makes more sense if you look at the architecture. We start with a noise vector. This vector is not the image. The noise is a compressed version of what the image will become. The 2D transpose layers in the generator will decompress the noise so it can become an 8x8 image with all of the details in the right place in the image. This is the final product of the generator.

2D Transpose Layer


The discriminator gets both real and fake images and its job is to classify them as real or fake.

nn.Conv2d(nc, ndf, 4, 2, 1, bias=False),
nn.LeakyReLU(0.2, inplace=True),
nn.Conv2d(ndf, ndf * 2, 4, 2, 1, bias=False),
nn.BatchNorm2d(ndf * 2),
nn.LeakyReLU(0.2, inplace=True),
nn.Conv2d(ndf * 2, ndf * 4, 4, 2, 1, bias=False),
nn.BatchNorm2d(ndf * 4),
nn.LeakyReLU(0.2, inplace=True),
nn.Conv2d(ndf * 4, ndf * 8, 4, 2, 1, bias=False),
nn.BatchNorm2d(ndf * 8),
nn.LeakyReLU(0.2, inplace=True),
nn.Conv2d(ndf * 8, 1, 4, 1, 0, bias=False),


The discriminator is a convolutional neural network with the architecture above. This gives the discriminator the spatial reasoning capabilities that it needs to learn what exact spatially preserved features make an image real and then use those spatially preserved features to classify an image as real or fake. The discriminator does not use the same 2D transpose layers that the generator does because the discriminator is performing a supervised task — isolating and extracting features — not a generative task — upsampling features to create in an image.

Indirect Training

The discriminator will provide information to the generator so it will learn how to create the real features that the discriminator found in the real images. Ultimately, we want the generated images to be so good that the discriminator can’t tell the difference between real images and fake images. This means that we need to train the DCGAN indirectly. The generator in the DCGAN is not trying to match the input’s probability distribution exactly, but rather the generator is creating a probability distribution that can fool the discriminator network into thinking that the generator’s output distribution came from the same distribution as the real images.

The discriminator will then assign an image a score showing how much it looks like a real or fake image. In training, the generator wants to make the discriminator give the generated image a high score so the discriminator will think that the fake image is real. This score is later used in the learning process which allows both the generator to get better at generating images and the discriminator to get better at classifying images.

Loss Function and Learning Process

Training Process


The discriminator outputs a value D that indicates how close the generated image is to the real image. The goal is to maximize the chance that the discriminator recognizes real images as real and generated images as fake. The cross entropy function measures the loss p log(q) for this process. This function measures the performance of a classification model. This function works particularly well for this task because the loss will increase the further the predicted probability diverges from the label.

Discriminator Loss

The real image, p equals 1 the maximum chance that it is real. For the generated images, we need to reverse the label (subtract 1) minimize the chance that it is one. This is objective function for the discriminator.


The goal of the generator is to create images that will fool the discriminator. This means that its objective function wants to encourage the model to create images with the highest possible value D to fool the discriminator.

Generator Loss

MinMax Game

A GAN is a type of minimax game where G wants to minimize V while D wants to maximize it. This is a type of zero-sum non cooperative game where your opponent wants to minimize their actions and you want to maximize them; both players are maximizing their own payoff. The goal is to minimize the maximum loss (minimize the worst case scenario).

Nash Equilibrium source

MinMax games come from game theory. GANs are designed to reach a Nash equilibrium which is a point at which each player cannot reduce their cost without changing the other player’s parameters. GAN’s converge when the discriminator and the generator reach a Nash equilibrium. This is the optimal point for the minimax equations above. The Nash equilibrium point means that the costs are reduced and there is nothing to change. The generator is able to create images that fool the discriminator. At this point, both of the objectives are met.

Some functions will not converge. This occurs most commonly with a non-convex function. In a game theory sense, it is hard to make your model converge when your opponent is always countering your actions. This is why GAN’s have a reputation of being notoriously hard to train.


Now for the results! Ultimately, after 500 training epochs on the GPU the results came in. This was the training process as a GIF:

You can see the process of going from random pixels to pixelated images to images of architecture.

These were some of my favorite (cherry-picked) images:

I would live there!

Not all of the generated images looked perfect. Some had a distorted, discolored, and pixelated facade that only a mother could love…

distorted, discolored pixelated images


General adversarial networks tackle all of these areas at once: unsupervised learning, game theory, and supervised learning. This demonstrates what Feynman said, in order to understand something you have to be able to create it. Creation is inherently multidimensional — you have to be able to understand the interaction of more than one idea or feature at once in order to make something new— and when deep learning has this kind of power its capabilities could be limitless. This power obviously come with a remarkable and unprecedented amount of responsibility, but going forward unsupervised learning and specifically generative models, like GAN’s, have enormous potential to redefine deep learning and artificial intelligence.

Check out the entire project here!


Drawing Architecure: Building Deep Convolutional GAN’s In Pytorch was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.