CS4803-7643: Deep Learning Solved

44.99 $

Description

5/5 - (4 votes)

Instructions
1. We will be using Gradescope to collect your assignments. Please read the following instructions for submitting to Gradescope carefully!
• For the HW1 component on Gradescope, you could upload one single PDF containing the answers to all the theory questions and the completed Jupyter notebooks for the coding problems. However, the solution to each problem or subproblem must be on a separate page. When submitting to Gradescope, please make sure to mark the page(s) corresponding to each problem/sub-problem. Likewise, the pages of the Jupyter notebooks must also be marked to their corresponding subproblems.
• For the HW1 Code component on Gradescope, please use the collect_submission.sh script provided and upload the resulting hw1_code.zip here. Please make sure you have saved the most recent version of your Jupyter notebook before running this script.
2. LATEX’d solutions are strongly encouraged (solution template available at cc.gatech.edu/classes/AY2021/cs7643_fall/assets/sol1.tex), but scanned handwritten copies are acceptable. Hard copies are not accepted.
3. We generally encourage you to collaborate with other students.
1 Optimization
1. [2 points] In homework 0, we derived the gradient of the log-sum-exp function. Now we will consider a similar function – the softmax function s(z), which takes a vector input z and outputs a vector whose ith entry si is
(1) The input vector z to s(·) is sometimes called the “logits”, which just means the unscaled output of previous layers. Derive the gradient of s with respect to the logits, i.e. derive . Consider re-using your work from HW0.
2. [2 points] Consider a (not necessarily convex) differentiable function g : Rn → R. g has a local minimum at some wt if there exists some γ > 0 such that for all w γ ⇒ g(wt) ≤ g(w).
Prove that if g has a local minimum at some wt then the gradient at wt = 0, and that the converse is not necessarily true.
3. [3 points] Prove that if a differentiable function g : Rn → R is convex and the gradient at some w∗ is 0, then w∗ is the global minimum of g.
4. [2 points] Consider an objective function comprised of N = 2 terms:
(2)
Now consider using SGD (with a batch-size B = 1) to minimize this objective. Specifically, in each iteration, we will pick one of the two terms (uniformly at random), and take a step in the direction of the negative gradient, with a constant step-size of η. You can assume η is small enough that every update does result in improvement (aka descent) on the sampled term.
Is SGD guaranteed to decrease the overall loss function in every iteration? If yes, provide a proof. If no, provide a counter-example.
5. [3 points: Extra credit for both 4803 and 7643] Recall that a d−1-dimensional simplex is defined as:
∆d−1 = {p ∈ Rd | 1T p = 1 and p ≥ 0} (3)
In this question, you will develop an interpretation of softmax as a projection operator – that it projects an arbitrary point x ∈ Rd onto the interior of the d − 1 simplex. Specifically, let s(x) denote the softmax function (as defined above). Now prove that,
s(x) = argmin
y∈Rd − xT y − H(y) (4)
s.t. 1T y = 1, 0 ≤ y ≤ 1 (5)
where H is the entropy function:
(6)
Now, what does this formal interpretation tell you about the softmax layer in a neural network? Hint: Look at the KKT conditions.
2 Directed Acyclic Graphs (DAG)
One important property for a feed-forward network, as discussed in the lectures, is that it must be a directed acyclic graph (DAG). Recall that a DAG is a directed graph that contains no directed cycles. We will study some of its properties in this question.
Let’s define a graph G = (V,E) in which V is the set of all nodes as {v1,v2,…,vi,…vn} and E is the set of edges .
A topological order of a directed graph G = (V,E) is an ordering of its nodes as {v1,v2,…,vi,…vn} so that for every edge (vi,vj) we have i < j.
There are several lemmas can be inferred from the definition of DAG. One lemma is: if G is a DAG, then G has a node with no incoming edges.
Given the above lemma, prove the following two lemmas:
6. [3 points] If the graph G is a DAG, then G has a topological ordering.
7. [3 points] If the graph G has a topological order, then G is a DAG.
3 Paper Review [Extra credit for 4803, regular credit for 7643]
The paper presents an interesting proposition that, through a series of experiments, re-examines some fundamental notions about neural networks – in particular, the comparative importance of architectures and weights in a network’s predictive performance.
The paper can be viewed here. There’s also a helpful interactive webpage with intuitive visualizations to help you understand its key concepts better. The evaluation rubric for this section is as follows :
8. [2 points] Briefly summarize the key contributions, strengths and weaknesses of this paper.
9. [2 points] What is your personal takeaway from this paper? This could be expressed either in terms of relating the approaches adopted in this paper to your traditional understanding of learning parameterized models, or potential future directions of research in the area which the authors haven’t addressed, or anything else that struck you as being noteworthy.
Guidelines: Please restrict your reviews to no more than 350 words (total length for answers to both the above questions).
4 Implement and train a network on CIFAR-10
Setup Instructions: Before attempting this question, look at setup instructions at here.
10. [Up to 20 points] Now, we will learn how to implement a softmax classifier and vanilla neural networks (or Multi-Layer Perceptrons). You will begin by writing the forward and backward passes for different types of layers, and then go on to train a shallow neural network on the CIFAR-10 dataset in Python. Next you will learn to use PyTorch, a popular open-source deep learning framework, and use it to replicate the experiments from before.
Refer to the instructions linked here to get started.

  • HW1-yncky3.zip