CptS 570 – Machine Learning – Homework #2 Solved

35.00 $

Category:
Click Category Button to View Your Next Assignment | Homework

You'll get a download link with a: zip solution files instantly, after Payment

Securely Powered by: Secure Checkout

Description

5/5 - (4 votes)

 

 

NOTE 1: Please use a word processing software (e.g., Microsoft word or Latex) to write your answers. The rationale is that it is sometimes hard to read and understand the hand-written answers. Thanks for your understanding.

NOTE 2: Please ensure that all the graphs are appropriately labeled (x-axis, y-axis, and each curve). The caption or heading of each graph should be informative and self-contained.

1           Analytical Part (3 percent grade)

This part will be graded as a PASS or FAIL.

  1. Suppose we have n+ positive training examples and nnegative training examples. Let C+ be the center of the positive examples and Cbe the center of the negative examples, i.e., and. Consider a simple classifier called CLOSE

that classifies a test example x by assigning it to the class whose center is closest.

  • Show that the decision boundary of the CLOSE classifier is a linear hyperplane of the form sign(w x + b). Compute the values of w and b in terms of C+ and C.
  • Recall that the weight vector can be written as a linear combination of all the training examples: . Compute the dual weights (α’s). How many of the training examples are support vectors?
  1. Suppose we use the following radial basis function (RBF) kernel:), which has some implicit unknown mapping φ(x).
    • Prove that the mapping φ(x) corresponding to RBF kernel has infinite dimensions.
    • Prove that for any two input examples xi and xj, the squared Euclidean distance of their corresponding points in the higher-dimensional space defined by φ is less than 2, i.e., kφ(xi) − φ(xj)k2 ≤ 2.
  2. The decision boundary of a SVM with a kernel function (via implicit feature mapping φ(.)) is defined as follows:

w · φ(x) + b = PiSV yiαiK(xi,x) + b = f(x;α,b)

, where w and b are parameters of the decision boundary in the feature space phi defined by the kernel function K, SV is the set of support vectors, and αi is the dual weight of the ith support vector.

Let us assume that we use the radial basis function (RBF) kernel); also assume that the training examples are linearly separable in the feature space φ and SVM finds a decision boundary that perfectly separates the training examples.

If we choose a testing example xfar that is far away from any training instance xi (distance here is measured in the original feature space <d). Prove that f(xfar;α,b) ≈ b.

  1. The function K(xi,xj) = −hxi,xji is a valid kernel. Prove or Disprove it.
  2. You are provided with n training examples: (x1,y1),(x2,y2),··,(xn,yn,), where xi is the input example, yi is the class label (+1 or -1). The teacher gave you some additional information by specifying the costs for different mistakes C+ and C, where C+ and Cstand for the cost of misclassifying a positive and negative example respectively.
  3. How will you modify the Soft-margin SVM formulation to be able to leverage this extra information? Please justify your answer.
  4. You are provided with a set of n training examples: (x1,y1),(x2,y2),··,(xn,yn,), where xi is the input example, yi is the class label (+1 or -1). Suppose n is very large (say in the order of millions). In this case, standard SVM training algorithms will not scale due to large training set.

Tom wants to devise a solution based on “Coarse-to-Fine” framework of problem solving. The basic idea is to cluster the training data; train a SVM classifier based on the clusters (coarse problem); refine the clusters as needed (fine problem); perform training on the finer problem; and repeat until convergence. Suppose we start with k+ positive clusters and knegative clusters to begin with (a cluster is defined as a set of examples). Please specify the mathematical formulation (define all the variables used in your formulation) and concrete algorithm for each of the following steps to instantiate this idea:

  1. How to define the SVM training formulation for a given level of coarseness: a set of k+ positive clusters and a set of knegative clusters?
  2. How to refine the clusters based on the resulting SVM classifier?
  3. What is the stopping criteria?

Optional question: For what kind of problems will this solution fail?

  1. You are provided with a set of n training examples: (x1,y1),(x2,y2),··,(xn,yn,), where xi is the input example, yi is the class label (+1 or -1). Suppose n is very large (say in the order of millions). In this case, online kernelized Perceptron algorithms will not scale if the number of allowed support vectors are unbounded.
    1. Suppose you have trained using kernelized Perceptron algorithm (without any bounds onsupport vectors) and got a set of support vectors SV . Tom wants to use this classifier for real-time prediction and cannot afford more than B kernel evaluations for each classification decision. Please give an algorithm to select B support vectors from SV . You need to motivate your design choices in order to convince Tom to use your solution.
    2. Tom wants to train using kernelized Perceptron algorithm, but wants to use at most B support vectors during the training process. Please modify the standard kernelized Perceptron training algorithm (from class slides) for this new setting. You need to motivate your design choices in order to convince Tom to use your solution.

2           Programming and Empirical Analysis Part (5 percent grade)

  1. Empirical analysis question. You can use a publicly available SVM classifier implementation(e.g., scikit-learn) for SVM related experiments. scikit-learn (http://scikit-learn.org/ stable/modules/svm.html).

You will use the Fashion MNIST data (https://github.com/zalandoresearch/fashion-mnist). There is a fixed training and testing set. From training data, use first 80 percent for “training” and last 20 percent as “validation data”.

Each example is a 28×28 grayscale image, associated with a label from 10 classes: 0 Tshirt/top, 1 Trouser, 2 Pullover, 3 Dress, 4 Coat, 5 Sandal, 6 Shirt, 7 Sneaker, 8 Bag, 9 Ankle boot.

You will use ONLY training data for training and testing data for evaluation.

  • Using a linear kernel, train the SVM on the training data for different values of C parameter: 10−4,10−3,10−2,10−1,100,101,102,103,104. Compute the training accuracy, validation accuracy, and testing accuracy for the SVM obtained with different values of the C parameter. Plot the training accuracy, validation accuracy, and testing accuracy as a function of C (C value on x-axis and Accuracy on y-axis) – one curve each for training, validation, and testing data. Also, plot the number of support vectors (if applicable for the SVM toolkit you are using) as a function of C. List your observations.
  • Select the best value of hyper-parameter C based on the accuracy on validation set and train a linear SVM on the combined set of training and validation examples. Compute the testing accuracy and the corresponding confusion matrix: a 10 × 10 matrix.
  • Repeat the experiment (a) with the best C value from (a) with polynomial kernel of degree 2, 3, and 4. Compare the training, validation, testing accuracies, and the number of support vectors for differnt kernels (linear, polynomial kernel of degree 2, polynomial kernel of degree 3, and polynomial kernel of degree 4). List your observations.
  1. Programming question. You will implement the kernelized Perceptron training algorithm (discussed in the class) for multi-class
    • You will use the Fashion MNIST data. Train the kernelized Perceptron classifier for 5 iterations with polynomial kernel (pick the best degree out of 2, 3, and 4 from the above experiment). Plot the number of mistakes as a function of training iterations. Compute the training, validation, and testing accuracy at the end of 5 iterations.
  2. Programming question. “Breast Cancer” Classifier using Decision Trees. You will use thefollowing dataset for this question: https://archive.ics.uci.edu/ml/datasets/Breast+ Cancer+Wisconsin+%28Diagnostic%29. You will use the first 70 percent examples for training, next 10 percent examples for validation, and last 20 percent examples for testing.
    • Implement the ID3 decision tree learning algorithm that we discussed in the class. Thekey step in the decision tree learning is choosing the next feature to split on. Implement the information gain heuristic for selecting the next feature. Please see lecture notes or https://en.wikipedia.org/wiki/ID3_algorithm for more details. Do the following to select candidate thresholds for continuous features: Sort all candidate values for feature f from training data. Suppose f1,f2,··,fn is the sorted list. The candidate thresholds are chosen as fi + (fi+1 fi)/2 for i=1 to n.
    • Run the decision tree construction algorithm on the training examples. Compute theaccuracy on validation examples and testing examples.
    • Implement the decision tree pruning algorithm discussed in the class (via validation data).
    • Run the pruning algorithm on the decision tree constructed using training examples.Compute the accuracy on validation examples and testing examples. List your observations by comparing the performance of decision tree with and without pruning.

To debug and test your implementation, you can employ scikit-learn (http://scikit-learn. org/stable/modules/tree.html).

3           Instructions for Submission

Please follow the below instructions. If you do not follow them, your homework will not be graded. We will provide a dropbox folder link for the homework submission.

PDF submission. One PDF file with both answers for analytical part (Part I) and empirical analysis questions with results/analysis (Part-II). Please label x-axis, y-axis, and name of the graphs appropriately. Please name this file as WWSUID-LASTNAME.pdf (e.g., 111222-Fern.pdf).

Code submission. You will submit one zip file for your code as per the instructions below. If your script and/or code does not execute, we will try to give some partial credit by looking at the overall code contents.

  • Mention the programming language and version (e.g., Python 2.5) that you used.
  • Submit one folder with name WSUID-LASTNAME.zip (e.g., 111222-Fern.zip) and include a README file.
  • Include a script to run the code and it should be referred in the README file. Please make sure that your script runs on a standard linux machine.
  • Don’t submit the data folder. Assume there is a folder “data” with all the files.
  • Output of your programs should be well-formatted in order to answer the empirical analysis questions.
  • Structure your code meaningfully and add comments to make it readable.

If you have collaborated or discussed the homework with some student, please provide this information with all the relevant details. If we find that the code of two different students has traces of plagiarism, both students will be penalized and we will report the academic dishonesty case to graduate school (see https://communitystandards.wsu.edu/policies-and-reporting/academic-integritypolicy/). The bottom line is please DO NOT even think of going this route. It is very unpleasant to deal with these things for both faculty, TA, and students involved.

 

  • hw2-vzd4xc.zip