In this assignment you will learn how to implement and train basic neural architectures like MLPs and CNNs for classification tasks. Modern deep learning libraries come with sophisticated functionalities like abstracted layer classes, automatic differentiation, optimizers, etc.
- To gain an in-depth understanding we will, however, first focus on a basic implementation of a MLP in numpy in exercise 1. This will require you to understand backpropagation in detail and to derive the necessary equations first.
- In exercise 2 you will implement a MLP in PyTorch and tune its performance by adding additional layers provided by the library.
- In order to learn how to implement custom operations in PyTorch you will reimplement a batch-normalization layer in exercise 3.
- Exercise 4 aims at implementing a simple CNN in PyTorch.
Python and PyTorch have a large community of people eager to help other people. If you have coding related questions: (1) read the documentation, (2) search on Google and StackOverflow, (3) ask your question on StackOverflow or Piazza and finally (4) ask the teaching assistants.
1 MLP backprop and NumPy implementation (50 points)
Consider a generic MLP for classification tasks which is consisting of N layers. We want to train the MLP to learn a mapping from the input space Rd0 to a probability mass function (PMF) over dN classes given a dataset consisting of S tuples of input vectors and targets. The superscript (0) is added to the input vectors x(0) ∈ Rd0 as notational convenience for identifying the input as the activation of a 0-th layer. The targets could be any pmf but we assume them to be one-hot
encoded in the following. Each layer l will first apply an affine mapping
LinearModule.forward: which is parameterized by weights W(l) ∈ Rdl×dl−1 and biases b(l) ∈ Rdl. Subsequently, nonlinearities are applied to compute activations x(l) from the pre-nonlinearity activations x˜(l). For all hidden layers we choose leaky rectified linear units (leaky ReLU), that is,
x˜(l) 7→ x(l) := LeakyReLU
:= max .
where a defines the slope in the negative part of the domain.
Since the desired output should be a valid pmf a softmax activation is applied in the output layer:
SoftmaxModule.forward: x˜(N) →7 x(N) := softmax
Note that both the maximum operation and the exponential function are applied elementwise when acting on a vector. A categorical cross entropy loss between the predicted and target distribution is applied,
where the last step holds for t being one-hot. Here denotes the t-th component of the softmax output.
1.1 Analytical derivation of gradients
For optimizing the network via gradient descent it is necessary to compute the gradients w.r.t. all weights and biases. In the definition of the MLP above we split the forward computation into several modules. It turns out that, making use of the chain rule, the gradient calculations can be split in a similar way into gradients of modules. For each operation performed in the forward pass, a corresponding gradient computation can be performed to propagate the gradients back through the network. You will first compute the partial derivatives of each module w.r.t. its inputs and subsequently put these together to a chain to get the final backpropagation computations. Your answers will be the cornerstone of the MLP NumPy implementation that follows afterwards.
Note that is the backpropagation equation of the last layer, that is, it corresponds to
CrossEntropyModule.backward : .
Question 1.1 b) (15 points)
Using the gradients of the modules calculate the gradients
by propagating back the gradients of the output of each module to the parameters and inputs of the module. The errors on the outputs of each operation occurring on the right-hand sides do not have to be expanded in the result since they were computed in the previous step of the backpropagation algorithm. Please give the final result in form of matrix (or in general tensor-) multiplications.
Hint: In index notation the products on the right hand side can be written in components like e.g. . Make sure to not confuse the indices which might occur in a transposed form.
The gradients calculated in the last exercise are the gradients occurring in the backpropagation equations:
LeakyReLUModule.backward : LinearModule.backward :
The backpropagation algorithm can be seen as a form of dynamic programming since it makes use of previously computed gradients to compute the current gradient. Note that it requires all activations x(l) to be stored in order to propagate the gradients back from to . In the case of a MLP, the memory cost of storing the weights exceeds the cost of storing the activations but for CNNs the latter typically make up the largest part of the memory consumption.
So far we only considered single samples being fed into the network. In practice we typically use batches of input samples which are processed by the network in parallel. The total loss
is then defined as the mean value of the individual samples’ losses. Here Lindividual is the cross entropy loss as used before which depends on x(0),s via x(N),s. In addition to major computational benefits when running on GPU, performing gradient descent in batches helps to reduce the variance of the gradients.
1.2 NumPy implementation
To simplify implementation and testing we have provided to you an interface to work with
CIFAR-10 data in cifar10_utils.py. The CIFAR-10 dataset consists of 60000 32×32 color images in 10 classes, with 6000 images per class. The file cifar10_utils.py contains utility functions that you can use to read CIFAR-10 data. Read through this file to get familiar with the interface of the Dataset class. The main goal of this class is to sample new batches, so you don’t need to worry about it. To encode labels we are using an one-hot encoding of labels.
Please do not change anything in this file. Usage examples:
- Prepare CIFAR10 data:
import cifar10_utils cifar10 = cifar10_utils . get_cifar10 ( ’ cifar10 / cifar −10−batches−py ’ )
- Get a new batch with the size of batch_size from the train set: x , y = cifar10 [ ’ train ’ ] . next_batch( batch_size )
Variables x and y are numpy arrays. The shape of x is [batch_size,3,32,32], the shape of y is [batch_size,10].
- Get test images and labels: x , y = cifar10 . test . images , cifar10 . test . labels
Note: For multi-layer perceptron you will need to reshape x that each sample is represented by a vector.
Question 1.2 (15 points)
Implement a multi-layer perceptron using purely NumPy routines. The network should consist of N linear layers with leaky ReLU activation functions followed by a final linear layer. The number of hidden layers and hidden units in each layer are specified through the command line argument dnn_hidden_units. As loss function, use the common cross-entropy loss for classification tasks. To optimize your network you will use the mini-batch stochastic gradient descent algorithm. Implement all modules in the files modules.py and mlp_numpy.py.
Part of the success of neural networks is the high efficiency on graphical processing units (GPUs) through matrix multiplications. Therefore, all of your code should make use of matrix multiplications rather than iterating over samples in the batch or weight rows/columns. Implementing multiplications by iteration will result in a penalty.
Implement training and testing scripts for the MLP inside train_mlp_numpy.py. Using the default parameters provided in this file you should get an accuracy of around 0.46 for the entire test set for an MLP with one hidden layer of 100 units. Carefully go through all possible command line parameters and their possible values for running train_mlp_numpy.py. You will need to implement each of these into your code. Otherwise we can not test your code. Provide accuracy and loss curves in your report for the default values of parameters.
2 PyTorch MLP (20 points)
- Tensor computation (like NumPy) with strong GPU acceleration
- Deep Neural Networks built on a tape-based autodiff system
You can also reuse your favorite python packages such as NumPy, SciPy and Cython to extend PyTorch when needed.
There are several tutorials available for PyTorch:
- Deep Learning with PyTorch: A 60 Minute Blitz
- Learning PyTorch with Examples
- PyTorch for former Torch users
Question 2 (20 points)
Implement the MLP in mlp_pytorch.py file by following the instructions inside the file. The interface is similar to mlp_numpy.py. Implement training and testing procedures for your model in train_mlp_pytorch.py by following instructions inside the file. Using the same parameters as in Question 1.2, you should get similar accuracy on the test set.
Before proceeding with this question, convince yourself that your MLP implementation is correct. For this question you need to perform a number of experiments on your MLP to get familiar with several parameters and their effect on training and performance. For example you may want to try different regularization types, run your network for more iterations, add more layers, change the learning rate and other parameters as you like. Your goal is to get the best test accuracy you can. You should be able to get at least 0.52 accuracy on the test set but we challenge you to improve this. List modifications that you have tried in the report with the results that you got using them. Explain in the report how you are choosing new modifications to test. Study your best model by plotting accuracy and loss curves.
3 Custom Module: Batch Normalization (20 points)
Deep learning frameworks come with a big palette of preimplemented operations. In research it is, however, often necessary to experiment with new custom operations. As an example you will reimplement the Batch Normalization module as a custom operations in PyTorch. This can be done by either relying on automatic differentiation (Sec. 3.1) or by a manual implementation of the backward pass of the operation (Sec. 3.2).
The batch normalization operation takes as input a minibatch consisting of B samples in RC where C denotes the number of channels. It first normalizes each neuron’svalue over the batch dimension to zero mean and unit variance. In order to allow for different values it subsequently rescales and shifts the normalized values by learnable parameters γ ∈ RC and β ∈ RC. Writing the neuron index out explicitly by a subscript, e.g. xsi, i = 1…C, Batch Normalization can be defined by:
- compute mean:
- compute variance:
- normalize:, with a constant to avoid numerical instability.
- scale and shift:
Note that the notation differs from the one chosen in the original paper where the authors chose to not write out the channel index explicitly.
3.1 Automatic differentiation
The suggested way of joining a series of elementary operations to form a more complex computation in PyTorch is via nn.Modules. Modules implement a method forward which, when called, simply executes the elementary operations as specified in this function. The autograd functionality of PyTorch records these operations as usual such that the backpropagation works as expected. The advantage of using modules over standard objects or functions packing together these operations lies in the additional functionality which they provide. For example, modules can be associated with nn.Parameters. All parameters of a module or whole network can be easily accessed via model.parameters(), types can be changed via e.g. model.float() or parameters can be pushed to the GPU via model.cuda(), see the documentation for more information.
Question 3.1 (10 points)
Implement the Batch Normalization operation as a nn.Module at the designated position in the file custom_batchnorm.py. To do this, register γ and β as nn.Parameters in the __init__ method. In the forward method, implement a check of the correctness of the input’s shape and perform the forward pass.
3.2 Manual implementation of backward pass
In some cases it is useful or even necessary to implement the backward pass of a custom operation manually. This is done in terms of torch.autograd.Functions. Autograd function objects necessarily implement a forward and a backward method. A call of a function instance records its usage in the computational graph such that the corresponding gradient computation can be performed during backpropagation. Tensors which are passed as inputs to the function will automatically get their attribute requires_grad set to False inside the scope of forward. This guarantees that the operations performed inside the forward method are not recorded by the autograd system which is necessary to ensure that the gradient computation is not done twice. Autograd functions are automatically passed a
context object in the forward and backward method which can
- store tensors via ctx.save_for_backward in the forward method
- access stored tensors via ctx.saved_tensors in the backward method
- store non-tensorial constants as attributes, e.g. ctx.foo = bar
- keep track of which inputs require a gradients via ctx.needs_input_grad
The forward and backward methods of a torch.autograd.Function object are typically not called manually but via the apply method which keeps track of registering the use of the function and creates and passes the context object. For more information you can read Extending PyTorch and Defining new autograd functions.
Since we want to implement the backward pass of the Batch Norm operation manually we first need to compute its gradients.
Having calculated all necessary equations we can implement the forward- and backward pass as a torch.autograd.Function. It is very important to validate the correctness of the manually implemented backward computation. This can be easily done via torch.autograd.gradcheck which compares the analytic solution with a finite differences approximation. These checks are recommended to be done in double precision.
Question 3.2 b) (3 points)
Implement the Batch Norm operation as a torch.autograd.Function. Make use of the context object described above. To save memory do not store tensors which are not needed in the backward operation. Do not perform unnecessary computations, that is, if the gradient w.r.t. an input of the autograd function is not required, return None for it.
Hint: If you choose to use torch.var for computing the variance be aware that this function uses Bessel’s correction by default. Since the variance of the Batch Norm operation is defined without this correction you have to set the option unbiased=False as otherwise your gradient check will fail.
Since the Batch Norm operation involves learnable parameters, we need to create a nn.Module which registers these as nn.Parameters and calls the autograd function in its forward method.
4 PyTorch CNN (10 points)
At this point you should have already noticed that the accuracy of MLP networks is far from being perfect. A more suitable type of architecture to process image data is the CNN. The main advantage of it is applying convolutional filters to the input images. In this part of the assignment you are going to implement a small version of the popular VGG network.
Table 1. Specification of ConvNet architecture. All conv blocks consist of 2D-convolutional layer, followed by Batch Normalization layer and ReLU layer.
Question 4 (10 points)
Implement the ConvNet specified in Table 1 inside convnet_pytorch.py file by following the instructions inside the file. Implement training and testing procedures for your model in train_convnet_pytorch.py by following instructions inside the file. Use Adam optimizer with default learning rate. Use default PyTorch parameters to initialize convolutional and linear layers. With default parameters you should get around 0.75 accuracy on the test set. Study the model by plotting accuracy and loss curves.
Create ZIP archive containing your report and all Python code. Please preserve the directory structure as provided in the Github repository for this assignment. Give the ZIP file the following name: lastname_assignment1.zip where you insert your lastname. Please submit your deliverable through Canvas. We cannot guarantee a grade for the assignment if the deliverables are not handed in according to these instructions.
 We are counting the output as a layer but not the input.
 In the case of CNNs, normalization is done for each channel individually with statistics computed over the batch- and spatial dimensions.