CS 189  Introduction to Machine Learning  HW4 Solved

35.00 $

Category: Tags: , ,

Description

5/5 - (5 votes)

Deliverables:

  1. Submit your predictions for the test sets to Kaggle as early as possible. Include your Kaggle scores in your write-up (see below). The Kaggle competition for this assignment can be found at:
  2. Submit a PDF of your homework, with an appendix listing all your code, to the Gradescope assignment entitled “Homework 4 Write-Up”. In addition, please include, as your solutions to each coding problem, the specific subset of code relevant to that part of the problem. You may typeset your homework in LaTeX or Word (submit PDF format, not .doc/.docx format) or submit neatly handwritten and scanned solutions. Please start each question on a new page. If there are graphs, include those graphs in the correct sections. Do not put them in an appendix. We need each solution to be self-contained on pages of its own.
    • In your write-up, please state with whom you worked on the homework. • In your write-up, please copy the following statement and sign your signature next to it. (Mac Preview and FoxIt PDF Reader, among others, have tools to let you sign a PDF file.) We want to make it extra clear so that no one inadverdently cheats. “I certify that all solutions are entirely in my own words and that I have not looked at another student’s solutions. I have given credit to all external sources I consulted.”
  3. Submit all the code needed to reproduce your results to the Gradescope assignment entitled “Homework 4 Code”. Yes, you must submit your code twice: in your PDF write-up following the directions as described above so the readers can easily read it, and once in compilable/interpretable form so the readers can easily run it. Do NOT include any data files we provided. Please include a short file named README listing your name, student ID, and instructions on how to reproduce your results. Please take care that your code doesn’t take up inordinate amounts of time or memory. If your code cannot be executed, your solution cannot be verified.

1          Logistic Regression with Newton’s Method

Consider sample points X1, X2,…, Xn ∈ Rd and associated values y1,y2,…,yn ∈ {0,1}, an n × d design matrix X = [X1 Xn]> and an n-vector y = [y1 yn]>.

If we add `2-regularization to logistic regression, the cost function is

n                                           

X

J(w) = λ |w|2 − yi ln si + (1 − yi)ln(1 − si)

i=1

where si = s(Xi · w), s(γ) = 1/(1 + e−γ), and λ > 0 is the regularization parameter. As in lecture, the vector s = [s1 …         sn]> is a useful shorthand.

In this problem, you will use Newton’s method to minimize this cost function on the four-point, two-dimensional training set

                                         

X = 1010 1331,            y = 1001.

You may want to draw these points on paper to see what they look like. The y-vector implies that the first two sample points are in class 1, and the last two are in class 0.

These sample points cannot be separated by a linear decision boundary that passes through the origin. As described in lecture, append a 1 to each Xi vector and use a weight vector w ∈ R3 whose last component is the bias term (the term we call α in lecture).

  1. Derive the gradient of the cost function J(w). Your final answer should be a simple matrixvector expression. While you may derive the matrix-vector form by first deriving the components of the gradient vector, do NOT write your answer in terms of these individual components.
  2. Derive the Hessian of J(w). Again, your answer should be a simple matrix-vector expression.
  3. State the update equation for one iteration of Newton’s method for this problem.

>

  1. We are given a regularization parameter of λ = 0.07 and a starting point of w(0) = h−2 1 0i . For the following four parts, you need only state the final solution. Thus you may derive the solution by hand or implement Newton’s algorithm and report the final result. If you do the latter, you do not need to submit code for this part.
    • State the value of s(0) (the value of s before any iterations).
    • State the value of w(1) (the value of w after one iteration).
    • State the value of s(1).
    • State the value of w(2) (the value of w after two iterations).

2          `1– and `2-Regularization

Consider sample points X1, X2,…, Xn ∈ Rd and associated values y1,y2,…,yn ∈ R, an n×d design matrix X = [X1 Xn]> and an n-vector y = [y1 yn]>. For the sake of simplicity, assume that the sample data has been centered and whitened so that each feature has mean 0 and variance

>

1 and the features are uncorrelated; i.e., X X = nI. For this question, we will not use a fictitious dimension nor a bias term; our linear regression function will output zero for x = 0.

Consider linear least-squares regression with regularization in the `1-norm, also known as Lasso. The Lasso cost function is

J(w) = |Xw y|2 + λ kwk1

where w ∈ Rd and λ > 0 is the regularization parameter. Let w= argmin w∈Rd J(w) denote the weights that minimize the cost function.

In the following steps, we will show that whitened training data decouples the features, so that wi is determined by the ith feature alone (i.e., column i of the design matrix X), regardless of the other features. This is true for both Lasso and ridge regression.

  1. We use the notation Xi to denote column i of the design matrix X, which represents the ith Write J(w) in the following form for appropriate functions g and f.

d

X

J(w) = g(y) +           f(Xi,wi,y,λ)

i=1

  1. If wi > 0, what is the value of wi ?
  2. If wi < 0, what is the value of wi ?
  3. Considering parts 2 and 3, what is the condition for wi to be zero?
  4. Now consider ridge regression, which uses the `2 regularization term λ |w|2. How does this change the function f(·) from part 1? What is the new condition in which wi = 0? How does it differ from the condition you obtained in part 4?

3          Regression and Dual Solutions

  1. For a vector w, derive ∇ |w|4. Then derive ∇w |Xw y|4.
  2. Consider sample points X1, X2,…, Xn ∈ Rd and associated values y1,y2,…,yn ∈ R, an n × d design matrix X = [X1 Xn]> and an n-vector y = [y1 yn]>, and the regularized regression problem

w= argmin |Xw y|4 + λ |w|2,

w∈Rd

which is similar to ridge regression, but we take the fourth power of the error instead of the squared error. (It is not possible to write the optimal solution was the solution of a system of linear equations, but it can be found by gradient descent or Newton’s method.)

Show that the optimum wis unique. By setting the gradient of the objective function to zero, show that wcan be written as a linear combination w= Pni=1 aiXi for some scalars a1,…,an. Write the vector a of dual coefficients in terms of X, y, λ, and the optimal solution w∗.

  1. Consider the regularized regression problem

n

1

∗ X > ,yi) + λ |w|2 w = argmin L(w Xi

w∈Rd             n i=1

where the cost function L is convex in its first argument. Prove that the optimal solution has the form w= Pni=1 aiXi. If the cost function is not convex, does the optimal solution always have the form w= Pni=1 aiXi? Justify your answer.

4          Wine Classification with Logistic Regression

The wine dataset given to you as part of the homework in data.mat consists of 6,497 data points, each having 12 features. The description of these features is provided in data.mat. The dataset includes a training set of 6,000 data points and a test set of 497 data points. Your classifier needs to predict whether a wine is white (class label 0) or red (class label 1).

For this homework, you need to do the following.

  1. Derive and write down the batch gradient descent update equation for logistic regression with `2 (Not Newton’s method, but the slower batch gradient descent.)

Choose a reasonable regularization parameter value and a reasonable learning rate. Run your algorithm and plot the cost function as a function of the number of iterations. (As this is batch descent, one “iteration” should use every sample point once.)

  1. Derive and write down the stochastic gradient descent update equation for logistic regression with `2 Choose a suitable learning rate. Run your algorithm and plot the cost function as a function of the number of iterations—where now each “iteration” uses just one sample point.

Comment on the differences between the convergence of batch and stochastic gradient descent.

  1. Instead of a constant learning rate , repeat part 2 where the learning rate decreases as ∝ 1/t for the tth Plot the cost function vs. the number of iterations. Is this strategy better than having a constant ?
  2. Finally, train your classifier on the entire training set. Submit your results to Kaggle. Your classifier, when given the test points, should output a CSV file. (There is a sample one on Kaggle.) You’ll upload this CSV file to Kaggle where it’ll be scored with both a public test set, and a private test set. You will be able to see only your public score. You can only submit twice per day, so get started early! In your writeup, for this problem, report your Kaggle username, the best score you achieved on Kaggle, and a short writeup describing what you did to achieve that score.

IMPORTANT: Do NOT use any software package for logistic regression that you didn’t write yourself!

5          Real World Spam Classification

Motivation: After taking CS 189 or CS 289A, students should be able to wrestle with “real-world” data and problems. These issues might be deeply technical and require a theoretical background, or might demand specific domain knowledge. Here is an example that a past TA encountered.

Daniel (a past CS 189 TA) interned as an anti-spam product manager for an email service provider. His company uses a linear SVM to predict whether an incoming spam message is spam or ham. He notices that the number of spam messages received tends to spike upwards a few minutes before and after midnight. Eager to obtain a return offer, he adds the timestamp of the received message, stored as number of milliseconds since the previous midnight, to each feature vector for the SVM to train on, in hopes that the ML model will identify the abnormal spike in spam volume at night. To his dismay, after testing with the new feature, Daniel discovers that the linear SVM’s success rate barely improves.

Why can’t the linear SVM utilize the new feature well, and what can Daniel do to improve his results? Daniel is unfortunately limited to only a quadratic kernel. This is an actual interview question Daniel received for a machine learning engineering position!

Write a short explanation. This question is open ended and there can be many correct answers.

  • hw4.zip