ISYE 6740 Homework 5 Solved

35.00 $

Category: Tags: ,

Description

5/5 - (6 votes)

where αi ≥ 0 are the dual variables. What does this imply in terms of how to relate data to w?

Explain why only the data points on the “margin” will contribute to the sum above,i.e., playing a role in defining w.

Suppose we only have four training examples in two dimensions as shown in Fig. Thepositive samples at x1 = (0,0), x2 = (2,2) and negative samples at x3 = (h,1) and x4 = (0,3).

For what range of parameter h > 0, the training points are still linearly separable? ii. (10 points) Does the orientation of the maximum margin decision boundary change as h changes, when the points are separable?

  1. Multi-class classification for MNIST data set, comparison. 

This question is to compare different classifiers and their performance for multi-class classifications on the complete MNIST dataset at http://yann.lecun.com/exdb/mnist/. You can find the data

1

file mnist 10digits.mat in the homework folder. The MNIST database of handwritten digits has a training set of 60,000 examples, and a test set of 10,000 examples. We will compare KNN, logistic regression, SVM, kernel SVM, and neural networks. We suggest to use Scikit-learn, which is a commonly-used and powerful Python library with various machine learning tools. But you can also use other similar libraries in other programming languages of your choice to perform the tasks. Below are some tips.

  • We suggest you to “standardize” the features before training the classifiers, by dividing the values of the features by 255 (thus map the range of the features from [0, 255] to [0, 1]).
  • You may adjust the number of neighbors K used in KNN to have a reasonable result (you may use cross validation but it is not required; any reasonable tuning to get good result is acceptable).
  • You may use a neural networks function neural network with hidden layer sizes = (20, 10). • For kernel SVM, you may use radial basis function kernel, and a heuristic called “median trick”: choose the parameter of the kernel K(x,x0) = exp{−kx x0k2/(2σ2)}. Choose the bandwidth

as σ = pM/2 where M = the median of {kxi xjk2,1 ≤ i,j m0,i 6= j} for pairs of training samples. Here you can randomly choose m0 = 1000 samples from training data to use for the “median trick”[1].

  • For KNN and SVM, you can randomly downsample the training data to size m = 5000, to improve computation efficiency.

Train the classifiers on training dataset and evaluate on the test dataset.

  • Report confusion matrix, precision, recall, and F-1 score for each of the classifiers. Forprecision, recall, and F-1 score of each classifier, we will need to report these for each of the digits. So you can create a table for this. For this question, each of the 5 classifier, KNN, logistic regression, SVM, kernel SVM, and neural networks,
  • Comment on the performance of the classifier and give your explanation why some ofthem perform better than the otherNeural networks. 
    • Consider a neural networks for a binary classification using sigmoid function for eachunit. If the network has no hidden layer, explain why the model is equivalent to logistic regression.
    • Consider a simple two-layer network in the lecture slides. Given m training data (xi,yi), i = 1,…,m, the cost function used to training the neural networks

where σ(x) = 1/(1 + ex) is the sigmoid function, zi is a two-dimensional vector such that ), and). Show the that the gradient is given by

,

where ui = wTzi. Also find the gradient of `(w,α,β) with respect to α and β and write down their expression.

  • homework5-i1uxhb.zip