When I first began studying neural networks, I was immediately confronted with formulas for backpropagating gradients, starting from the loss function computed at the end of the network and working back layer-by-layer. These formulas already looked quite complex even despite making several basic assumptions — such as a fully-connected network and sigmoid activation functions after each neuron. I wondered how popular packages like TensorFlow and Pytorch perform the same operations but for arbitrary mathematical functions.

Modern machine learning packages use “automatic differentiation” (or “autograd”) to handle this, and with that name it sounds like everything just happens; as if you write the operations and the computer just figures out all the derivatives. But it can’t be this simple. In an operation such as convolution, a multiplicative sum is computed using a weight filter of some chosen size positioned at various locations over an input image, and the result is a new image in which each value corresponds to one such sum. How does “autograd” handle this?

A 2x2 convolution: How do you take the derivative of that? (Figure by author)

Here we will:

  1. Show how backpropagation emerges from the “chain rule” of differentiation.
  2. Work through the computation of derivatives in some specific examples. We’ll use TensorFlow and aim to understand exactly what is computed when the gradient function is called on a tensor.

The chain rule

In training neural networks, our goal is to minimize some loss function by adjusting the trainable parameters of the network. This means, at each training step, computing the derivative of the loss function with respect to each parameter W, and adjusting these parameters in a way that decreases the loss L — by pushing them a tiny step in the opposite (-) direction of their derivatives,

The task is then to find ∂L/∂W for every parameter. Modern machine learning packages approach this problem using computational graphs, and we will see how this allows us to break the problem down into manageable pieces. The loss function is a multivariate function of the parameters of the graph, and so to find all of the derivatives, we can apply the chain rule.

The chain rule states for example that for a function f of two variables x1 and x2, which are both functions of a third variable t,

Let’s consider the following graph:

(Figure by author)

Here we assume W, V, and U each represent several parameters — they could be vectors, matrices, or some higher-order tensors — but for now we will just denote them by single subscripts. We’ll also assume all inputs and outputs (x1, x2, y1, y2, and z) are vectors and will use for simplicity a single subscripted x to run over all elements of x1 and x2, and similarly for y. Now using the chain rule to find the derivative of the loss L with respect to one of the W parameters,

It’s starting to look like for each parameter, we’re going to need to solve a long equation such as this one, but now let’s work on this a bit more.

We can define the quantity

as a small displacement in z that is then “sent backward” along the graph.

where

and so the small displacement in Wi is then, following the same pattern,

Two important things to note:

  • Values are only backpropagated along paths for which there is some dependency. For example, none of the values dy2 in the graph will influence any of the dWi because all partial derivatives of components of y2 with respect to Wi are zero.
To fulfill the chain rule, gradients are sent backward along the same paths as the information was originally sent forward. (Figure by author)
  • One does not need to worry about the entire network at once when performing backpropagation. For every node, we only need to consider the gradients sent through the output channels, use them to compute the derivatives of the parameters at that node, and then send back through the input channels the correct gradients to be used in prior nodes. If we do this correctly for every node, the gradients for the entire network can be solved.
Our task for each node of the network: use the incoming gradients dy1 and dy2 to compute displacements in the parameters dW, and to compute the outgoing gradients dx1 and dx2. (Figure by author)

Autograd handles the systematics of this, but it’s not magic — given the derivatives of the functions applied at each node, which must be specifically defined, it can backpropagate the gradients through the entire network using this procedure.

Let’s work through some specific examples in TensorFlow that highlight the details on how to take derivatives with tensor inputs and outputs.

You can work through these examples interactively in Colab using this Jupyter notebook.

Example 1: the sigmoid function

(Figure by author)

The sigmoid function produces an output zi for each element xi given by:

Given:

compute the gradient dx. Remember that as derived above, this means compute the vector with components

TensorFlow Code

Here’s the problem setup:

import tensorflow as tf
# Define inputs and output gradients.
x = tf.constant([3.0, 4.0, 5.0])
dz = tf.constant([1.0, 2.0, 3.0])
# Define the gradient.
def grad_sigmoid(x, dz):
# (Add implementation here)
dx = grad_sigmoid(x,dz)
# Compute the gradient with Tensorflow.
with tf.GradientTape() as g:
g.watch(x)
z = tf.sigmoid(x)
dx_tf = g.gradient(z, x, output_gradients=dz)
# Check the answer.
print(dx == dx_tf)

Solution

Note that we will make use of the Kronecker delta which is equal to 1 if its two indices are the same and equal to 0 otherwise,

Taking the derivative,

and computing the gradient,

We can write this using vector dot-product notation as (rounding the numerical result to 3 significant figures)

where

And the above implemented in Tensorflow is as follows:

def grad_sigmoid(x, dz):
return dz*tf.sigmoid(x)*(1-tf.sigmoid(x))

Example 2: the softmax function

(Figure by author)

The sigmoid function produces for an input vector x, a vector z with elements

Given:

compute the gradient dx.

TensorFlow Code

Here’s the problem setup:

import tensorflow as tf
# Define inputs
x = tf.constant([3.0, 4.0, 5.0])
dz = tf.constant([1.0, 2.0, 3.0])
# Define the gradient.
def grad_softmax(x, dz):
# (Add implementation here)
dx = grad_softmax(x, dz)
# Compute the gradient with Tensorflow.
with tf.GradientTape() as g:
g.watch(x)
z = tf.nn.softmax(x)
dx_tf = g.gradient(z, x, output_gradients=dz)
# Check the answer.
print(dx == dx_tf)

Solution

Taking the derivative,

and computing the gradient,

Here again we have rounded to 3 significant figures and used the dot-product notation,

And the above implemented in TensorFlow is as follows:

def grad_softmax(x,dz):
return tf.nn.softmax(x) * (dz - tf.tensordot(tf.nn.softmax(x), dz, 1))

Example 3: matrix multiplication

(Figure by author)

Now let’s consider a case in which the inputs are not vectors (rank 1 tensors) but matrices (rank 2 tensors). The computations will be similar, but the elements will be identified by 2 indices rather than just one.

We multiply two matrices x and y to produce a matrix z with elements

Given

compute the gradient dx. Note that in computing the elements of the gradient dx, all elements of dz must be included in the sum. Therefore we must now sum over both indices as

TensorFlow Code

Here’s the problem setup:

import tensorflow as tf
# Define inputs
x = tf.constant([[3.0, 4.0], [5.0, 6.0]])
y = tf.constant([[4.0, 5.0], [6.0, 7.0]])
dz = tf.constant([[1.0, 2.0], [3.0, 4.0]])
# Define the gradient.
def grad_matmul(x, y, dz):
# (Add implementation here)
dx = grad_matmul(x, y, dz)
# Compute the gradient with Tensorflow.
with tf.GradientTape() as g:
g.watch(x)
z = tf.matmul(x, y)
dx_tf = g.gradient(z, x, output_gradients=dz)
# Check the answer.
print(dx == dx_tf)

Solution

Because we are now dealing with matrices, a partial derivative with respect to some matrix element will produce two Kronecker deltas, as

Using this to take the derivative,

and then computing the gradient,

This is in fact just another matrix multiplication of dz with the transposed y matrix,

This can be implemented in TensorFlow as follows:

def grad_matmul(x, y, dz):
return tf.matmul(dz,tf.transpose(y))

Example 4: convolution

(Figure by author)

Now let’s try a more complicated example with matrix inputs and outputs. We write the convolution z of matrix x with the filter w as

Given 4x4 input x, a 2x2 filter w, and therefore a 3x3 output z,

find the gradient dw. (Note: we’re looking for dw this time, not dx!)

TensorFlow Code

Here’s the problem setup:

import tensorflow as tf
# Use this method to perform the convolution.
def conv2d(x, w):
return tf.nn.conv2d(tf.reshape(x,[1,x.shape[0],x.shape[1],1]),
tf.reshape(w,[w.shape[0],w.shape[1],1,1]),
strides=[1,1],
padding="VALID")
# Define inputs
x = tf.constant([[3.0, 4.0, 5.0, 6.0],
[4.0, 5.0, 6.0, 7.0],
[5.0, 6.0, 7.0, 8.0],
[6.0, 7.0, 8.0, 9.0]])
w = tf.constant([[1.0, 2.0],
[3.0, 4.0]])
dz = tf.constant([[1.0, 1.0, 1.0],
[2.0, 2.0, 2.0],
[3.0, 3.0, 3.0]])
# Define the gradient.
def grad_conv2d(x, w, dz):
# (Add implementation here)
dx = grad_conv2d(x, w, dz)
# Reshape to remove channel and batch number dimensions.
dx = tf.reshape(dx, [w.shape[0],w.shape[1]])
# Compute the gradient with Tensorflow.
with tf.GradientTape() as g:
g.watch(w)
z = conv2d(x,w)
dx_tf = g.gradient(z, w,
output_gradients = tf.reshape(dz,
[1,dz.shape[0],dz.shape[1],1]))
# Check the answer.
print(dx == dx_tf)

Solution

Computing the derivative,

we then have the gradient

We note that this is just the convolution

This can be implemented in TensorFlow using our function conv2d from above as:

def grad_conv2d(x, w, dz):
return conv2d(x, dz)

Conclusions

Not all gradients will end up being a neat one-line expressions like the above examples, but hopefully this helped clarify some of the core concepts in performing the “backward pass” in computational graphs.


Understanding Gradients in Machine Learning was originally published in Analytics Vidhya on Medium, where people are continuing the conversation by highlighting and responding to this story.