Name: Homework #3 CS 260: Machine Learning Algorithms
SKU: 10930
Price: 30.00 USD
Availability: InStock

5/5 - (2 votes)

Gradient Descent and Newton’s Method

In this problem, you will implement (unregularized/regularized) logistic regression for binary classification problems using two di↵erent types of optimization approaches, namely, batch gradient descent and Newton’s method. Two data sets are given; one of which is text data from which you will learn to construct features. For each of the problems, you need to report your results on both of the datasets. Please note that there are 9 problems in total. Also, cross validation is NOT needed for this programming assignment. For any plot you make, you MUST at least provide title, x-label, y-label and legend (if there is more than one curves). Please read submission instructions carefully before submitting your report and source codes.

1 Data

Ionosphere This dataset contains 351 instances and each instance has 34 attributes (features). All feature values are continuous and the last column is the class label (“bad” = b = 1, “good” = g = 0). Your goal is to predict the correct label, which is either “b” or “g”. We already divided the dataset into training and test sets (iono train.dat and iono test.dat). Please use the training set to build your model and use the test set to evaluate your model. For more details about Ionosphere dataset, please refer to UCI website: https://archive.ics.uci.edu/ml/datasets/Ionosphere .

EmailSpam This dataset contains 943 instances and each of them is labeled as either a spam or ham (not spam) email. Each data instance is the text which is the content of the email (subject and body), and your goal is to classify each email as either a spam or a ham. We have already divided this dataset into training and test datasets, and each of these datasets is stored in two distinct folders (one corresponding to ham and the other corresponding the spam). In other words, to build your model, you need to iterate through all the text files within /train/spam and /train/ham, and to test your model you need to iterate through /test/spam and /test/ham.

2 Feature Representation

An essential part of machine learning is to build the feature representation for raw input data, which is often unstructured. Once we construct the features, each data instance x_ican be represented as:

x_i= (x_i₁,…,x_id)

where x_ijdenotes the jth feature for ith instance.

Algorithm 1 Pseudocode for generating bag-of-word features from text

1: Initialize feature vector bg feature = [0,0,…,0]

2: for token in text.tokenize() do

3: if token in dict then

4: token idx = getIndex(dict, token)

5: bg feature[token idx]++

6: else

7: continue

8: end if

9: end for

10: return bg feature

Ionosphere The dataset is well-formatted, so you can directly use the raw feature values to build your model. Please do not do any normalization. Also, since there is no categorical attribute, converting to binary features (which you did for HW #1) is not needed.

EmailSpam Since each data instance is text, you need to find a way converting it into a feature vector. In this homework, you will use the “Bag-of-Words” representation. More specifically, you will convert the text into a feature vector in which each entry of the vector is the count of words occur in that text. You will be provided with the predefined dictionary (dic.dat), and you should only consider words that appear in that dictionary and ignore all others.^[1] Below is the pseudocode for generating bag-of-word features from text. For tokenization, please tokenize the string only using whitespace and these three delimiters: ’.,?’. See below for an example:²

Email: hey, i have a better o↵er for you, o↵er. better than all other spam filters. Do you like accepting o↵er?

Pre-defined Dictionary: [add, better, email, filters, hey, o↵er,like, spam,special]

Bag-of-words feature vector: [0,2,0,1,1,3,1,1,0]

(Q1). After converting all training data into bag-of-words feature vectors, what are the 3 words that occur most frequently? Report the results using this format:

{(word1: # of occurrences), (word2: # of occurrences), (word3: # of occurrences)}

3 Implementation

The regularized cross-entropy function can be written as:

where x_i2 R^d, w 2 R^d, and (·) is the sigmoid function. is regularization coe cient and w₀is bias parameter. Note that we don’t regularize bias term b.

Stopping Criteria. For both algorithms, run for 50 iterations.

Step size. For gradient method, you will use a fixed step size. Recall that Newton’s method does not require a step size.

Initialization. For batch gradient descent, initialize the weight w to 0, and b to 0.1. For Newton’s method, set initial weights to the ones we got from batch gradient descent after 5 iterations (when = 0.05,⌘ = 0.01). (Please note that Newton’s method may diverge if initialization is not proper).

Extreme Condition. It is possible that when (b+ w^Tx) approaches 0, log( (b+ w^Tx)) goes to -infinity. In order to prevent such case, please bound the value (b+w^Tx) using small constant value, 1e 16. You can use the following logic in your code:

tmp = (b + w^Tx) if tmp < 1e 16 then tmp = 1e 16

3.1 Batch Gradient Descent

(Q2) Please write down the updating equation for w and b, for both unregularized logistic regression and regularized logistic regression. In particular, at iteration t using data points X = [x₁,x₂,…,x_n], where x_i= [x_i₁,…,x_id], and y_i2 {0,1} is the label, how do we compute w^t⁺¹and b^t⁺¹from w^tand b^t?

(Q3) For step sizes ⌘ = {0.001,0.01,0.05,0.1,0.5} and without regularization, implement Batch gradient descent (without cross-validation, use the whole training data for the gradient calculation).

Plot the cross-entropy function value with respect to the number of steps (T = [1,…,50]) for the training data for each step size. Note: you need to make two plots, one for each dataset.
Report the L₂norm of vector w after 50 iterations for each step size ⌘_i(fill in the table below)

L₂norm (without regularization)	0.001	0.01	0.05	0.1	0.5
Ionosphere EmailSpam

(Q4) For step sizes ⌘ = {0.001,0.01,0.05,0.1,0.5} and with regularization coe cients = {0,0.05,0.1,0.15,…,0.5}, do the following:

Given = 0.1, plot the cross-entropy function value with respect to the number of steps (T = [1,…,50]) for the training data for each step size using di↵erent step sizes. Note: you need to make two plots, one for each dataset.
Given ⌘ = 0.01, report the L₂norm of vector w after 50 iterations for each regularization coe cient i (fill in the table below).
Plot the cross entropy function value at T = 50 for di↵erent regularization coe cients, for both the training and test data. The x-axis will be the regularization coe cient and y-axis will be the cross entropy function value after 50 iterations. Each plot should contain two curves, and you should make 2 (two data sets) ⇤⇥ 5 (five di↵erent step sizes) = 10 plots.

L₂norm (with regularization, ⌘ = 0.01)	0	0.05	0.1	0.15	0.2	0.25	0.3	0.35	0.4	0.45	0.5
Ionosphere EmailSpam

3.2 Newton’s method

Optimization also can be done using the 2nd order technique called “Newton’s method”. We define H as the Hessian matrix and r“_tas the gradient of objective function at iteration t.

(Q5) Using the notation above, write down the updating equation for w and b at time t, for both unregularized logistic regression and regularized logistic regression.

(Q6) Implement Newton’s method for logistic regression without regularization, and run it for 50 iterations on both of datasets.

Plot the cross-entropy function value with respect to the number of steps (T = [1,…,50]) for the training data. Note: you need to make two plots, one for each dataset.
Report the L₂norm of vector w after 50 iterations.
Report the cross-entropy function value for the test data.

(Q7) Repeat (Q6) for the regularized case using = {0,0.05,0.1,0.15,…,0.5} (report values for each setting).

3.3 Analysis and Comparison of Gradient Descent and Newton’s Method

(Q8) Briefly (in no more than 4 sentences) explain your results from (Q3) and (Q4). You should discuss the rate of convergence as a function ⌘, the change in magnitude of w as a function of , the value of the cross entropy function for di↵erent values of and ⌘, and any other interesting trends you have observed.

(Q9) Briefly (in no more than 4 sentences) discuss the di↵erences between gradient descent and Newton’s method based on the results from (Q4) and (Q7). In particular, discuss the di↵erence between gradient descent and Newton’s method in terms of convergence and computation time.

Assignment-3.zip

[SOLVED] Homework #3 CS 260: Machine Learning Algorithms

Gradient Descent and Newton’s Method

1 Data

2 Feature Representation

3 Implementation

3.1 Batch Gradient Descent

3.2 Newton’s method

3.3 Analysis and Comparison of Gradient Descent and Newton’s Method

Want to See Past Work First?

[SOLVED] Homework #3 CS 260: Machine Learning Algorithms

Gradient Descent and Newton’s Method

1 Data

2 Feature Representation

3 Implementation

3.1 Batch Gradient Descent

3.2 Newton’s method

3.3 Analysis and Comparison of Gradient Descent and Newton’s Method

Related products

CS260: Machine Learning Problem 1. True or False

CS260- Machine Learning Algorithms

Homework #4 CS 260: Machine Learning Algorithms

Related in this category

More in this category

Homework #5 CS 260: Machine Learning Algorithms

CS 260: Machine Learning-Homework 4 -Spring 2020

CS 260: Machine Learning-Homework 5-Spring 2020

Homework #2 CS 260: Machine Learning Algorithms

CS260: Machine Learning Algorithms HomeWork 3

CS260- Machine Learning Algorithms

Want to See Past Work First?