## Description

# Gradient Descent and Newton’s Method

In this problem, you will implement (unregularized/regularized) logistic regression for binary classification problems using two di↵erent types of optimization approaches, namely, **batch gradient descent **and **Newton’s method**. Two data sets are given; one of which is text data from which you will learn to construct features. For each of the problems, you need to *report your results on both of the datasets*. Please note that there are 9 problems in total. Also, *cross validation is NOT needed *for this programming assignment. *For any plot you make, you MUST at least provide title, x-label, y-label and legend (if there is more than one curves)*. Please read submission instructions carefully before submitting your report and source codes.

# 1 Data

**Ionosphere **This dataset contains 351 instances and each instance has 34 attributes (features). All feature values are continuous and the last column is the class label (“bad” = b = 1, “good” = g = 0). Your goal is to predict the correct label, which is either “b” or “g”. We already divided the dataset into training and test sets (iono train.dat and iono test.dat). Please use the training set to build your model and use the test set to evaluate your model. For more details about Ionosphere dataset, please refer to UCI website: https://archive.ics.uci.edu/ml/datasets/Ionosphere.

**EmailSpam **This dataset contains 943 instances and each of them is labeled as either a spam or ham (not spam) email. Each data instance is the text which is the content of the email (subject and body), and your goal is to classify each email as either a spam or a ham. We have already divided this dataset into training and test datasets, and each of these datasets is stored in two distinct folders (one corresponding to ham and the other corresponding the spam). In other words, to build your model, you need to iterate through all the text files within /train/spam and /train/ham, and to test your model you need to iterate through /test/spam and /test/ham.

# 2 Feature Representation

An essential part of machine learning is to build the feature representation for raw input data, which is often unstructured. Once we construct the features, each data instance **x*** _{i }*can be represented as:

**x*** _{i }*= (

*x*

_{i}_{1}

*,…,x*)

_{id}where *x _{ij }*denotes the

*j*th feature for

*i*th instance.

**Algorithm 1 **Pseudocode for generating bag-of-word features from text

1: Initialize feature vector bg feature = [0,0,…,0]

2: **for **token in text.tokenize() **do**

3: **if **token in dict **then**

4: token idx = getIndex(dict, token)

5: bg feature[token idx]++

6: **else**

7: continue

8: **end if**

9: **end for**

10: return bg feature

**Ionosphere **The dataset is well-formatted, so you can directly use the raw feature values to build your model. Please do not do any normalization. Also, since there is no categorical attribute, converting to binary features (which you did for HW #1) is not needed.

**EmailSpam **Since each data instance is text, you need to find a way converting it into a feature vector. In this homework, you will use the “Bag-of-Words” representation. More specifically, you will convert the text into a feature vector in which each entry of the vector is the count of words occur in that text. You will be provided with the predefined dictionary (dic.dat), and you should only consider words that appear in that dictionary and ignore all others.^{[1]} Below is the pseudocode for generating bag-of-word features from text. For tokenization, please tokenize the string only using **whitespace and these three delimiters: ’.,?’**. See below for an example:^{2}

__Email__: *hey, i have a better o↵er for you, o↵er. better than all other spam filters. Do you like accepting o↵er?*

Pre-defined Dictionary: *[add, better, email, filters, hey, o↵er,like, spam,special]*

Bag-of-words feature vector: [0*,*2*,*0*,*1*,*1*,*3*,*1*,*1*,*0]

**(Q1)**. After converting all training data into bag-of-words feature vectors, what are the 3 words that occur most frequently? Report the results using this format:

{(word1: # of occurrences), (word2: # of occurrences), (word3: # of occurrences)}

# 3 Implementation

The regularized cross-entropy function can be written as:

where **x*** _{i }*2 R

*,*

^{d}**w**2 R

*, and (·) is the sigmoid function. is regularization coe cient and*

^{d}*w*

_{0 }is bias parameter. Note that we don’t regularize bias term

*b*.

*Stopping Criteria*. For both algorithms, run for 50 iterations.

*Step size*. For gradient method, you will use a fixed step size. Recall that Newton’s method does not require a step size.

*Initialization*. For batch gradient descent, initialize the weight **w **to 0, and *b *to 0.1. For Newton’s method, set initial weights to the ones we got from batch gradient descent after 5 iterations (when = 0*.*05*,⌘ *= 0*.*01). (Please note that Newton’s method may diverge if initialization is not proper).

*Extreme Condition*. It is possible that when (*b*+ **w**^{T}**x**) approaches 0, log( (*b*+ **w**^{T}**x**)) goes to -infinity. In order to prevent such case, please bound the value (*b*+**w**^{T}**x**) using small constant value, 1*e *16. You can use the following logic in your code:

*tmp *= (*b *+ **w**^{T}**x**) if *tmp < *1*e *16 then *tmp *= 1*e *16

## 3.1 Batch Gradient Descent

**(Q2) **Please write down the updating equation for **w **and *b*, for both unregularized logistic regression and regularized logistic regression. In particular, at iteration *t *using data points *X *= [**x**_{1}*,***x**_{2}*,…,***x*** _{n}*], where

**x**

*= [*

_{i }*x*

_{i}_{1}

*,…,x*], and

_{id}*y*2 {0

_{i }*,*1} is the label, how do we compute

**w**

^{t}^{+1 }and

*b*

^{t}^{+1 }from

**w**

*and*

^{t }*b*?

^{t}**(Q3) **For step sizes *⌘ *= {0*.*001*,*0*.*01*,*0*.*05*,*0*.*1*,*0*.*5} and **without regularization**, implement Batch gradient descent (without cross-validation, use the whole training data for the gradient calculation).

- Plot the cross-entropy function value with respect to the number of steps (
*T*= [1*,…,*50]) for the training data for each step size. Note: you need to make two plots, one for each dataset. - Report the
*L*_{2 }norm of vector**w**after 50 iterations for each step size*⌘*(fill in the table below)_{i }

L_{2 }norm (without regularization) |
0.001 | 0.01 | 0.05 | 0.1 | 0.5 |

Ionosphere EmailSpam |

**(Q4) **For step sizes *⌘ *= {0*.*001*,*0*.*01*,*0*.*05*,*0*.*1*,*0*.*5} and with regularization coe cients = {0*,*0*.*05*,*0*.*1*,*0*.*15*,…,*0*.*5}, do the following:

- Given = 0
*.*1, plot the cross-entropy function value with respect to the number of steps (*T*= [1*,…,*50]) for the training data for each step size using di↵erent step sizes. Note: you need to make two plots, one for each dataset. - Given
*⌘*= 0*.*01, report the*L*_{2 }norm of vector**w**after 50 iterations for each regularization coe cient*i*(fill in the table below). - Plot the cross entropy function value at
*T*= 50 for di↵erent regularization coe cients, for both the training and test data. The x-axis will be the regularization coe cient and y-axis will be the cross entropy function value after 50 iterations. Each plot should contain two curves, and you should make 2 (two data sets) ⇤⇥ 5 (five di↵erent step sizes) = 10 plots.

L_{2 }norm (with regularization, ⌘ = 0.01) |
0 | 0.05 | 0.1 | 0.15 | 0.2 | 0.25 | 0.3 | 0.35 | 0.4 | 0.45 | 0.5 |

Ionosphere EmailSpam |

## 3.2 Newton’s method

Optimization also can be done using the 2nd order technique called “Newton’s method”. We define H as the Hessian matrix and r*“ _{t }*as the gradient of objective function at iteration

*t*.

**(Q5) **Using the notation above, write down the updating equation for **w **and *b *at time *t*, for both unregularized logistic regression and regularized logistic regression.

**(Q6) **Implement Newton’s method for logistic regression without regularization, and run it for 50 iterations on both of datasets.

- Plot the cross-entropy function value with respect to the number of steps (
*T*= [1*,…,*50]) for the training data. Note: you need to make two plots, one for each dataset. - Report the
*L*_{2 }norm of vector**w**after 50 iterations. - Report the cross-entropy function value for the test data.

**(Q7) **Repeat (Q6) for the regularized case using = {0*,*0*.*05*,*0*.*1*,*0*.*15*,…,*0*.*5} (report values for each setting).

## 3.3 Analysis and Comparison of Gradient Descent and Newton’s Method

**(Q8) **Briefly (in no more than 4 sentences) explain your results from (Q3) and (Q4). You should discuss the rate of convergence as a function *⌘*, the change in magnitude of **w **as a function of , the value of the cross entropy function for di↵erent values of and *⌘*, and any other interesting trends you have observed.

**(Q9) **Briefly (in no more than 4 sentences) discuss the di↵erences between gradient descent and Newton’s method based on the results from (Q4) and (Q7). In particular, discuss the di↵erence between gradient descent and Newton’s method in terms of convergence and computation time.