Description
Visualizing Gradients
- On the left is a 3D plot of f (x, y) — @ — + (y — 3) 2 . On the right is a plot of its gradient field. Note that the arrows show the relative magnitudes of the gradient vector.
- From the visualization, what do you think is the minimal value of this function and where does it occur?
- Calculate the gradient V f =
- When Vf 6, what are the values of and y?
1
2
Gradient Descent Algorithm
- Given the following loss function and = [Xi i=l, y = and Ot , explicitly write out the update equation for 9t+ l in terms of :ti, Yi, 9t , and a, where a is the constant learning rate.
L(O, i, j) = —         — log(Yi))
Convexity
- Convexity allows optimization problems to be solved more efficiently and for global optimums to be realized. Mainly, it gives us a nice way to minimize loss (i.e. gradient descent). There are three ways to informally define convexity.
- Walking in a straight line between points on the function keeps you at or above the function. This works for any function.
- The tangent line at any point lies at or below the function, globally. To use this definition, the function must be differentiable.
- The second derivative is non-negative everywhere (in other words, the function is “concave up” everywhere). To use this definition, the function must be twice differentiable.
Is the function described in Question 1 convex? Make an argument visually.
3
GPA Descent
- Consider the following non-linear model with two parameters:
f9(x) = 90 • 0.5 +00 • 01 • + sin(91) • .T2
For some nonsensical reason, we decide to use the residuals of our model as the loss function.
That is, the loss for a single observation is
We want to use gradient descent to determine the optimal model parameters, 90 and 91.
- Suppose we have just one observation in our training data, (Xl = 1, .T2
Assume that we set the learning rate a to 1. An incomplete version of the gradient descent update equation for O is shown below. 9!) and 91(t) denote the guesses for 90 and 91 at timestep t, respectively.
9(t+1)
9(t+1)
Express both A and B in terms of Of), Â Â Â , and any necessary constants.
- Assume we initialize both 080) and 91(0 to O. Determine 981 ) and Of) (i.e. the guesses for 90 and 91 after one iteration of gradient descent).
- What happens to 90 as t + 00 (i.e. as we run more and more iterations of gradient descent)?