CS 7641 -A Machine Learning -Homework 3 Solved

40.00 $

Category:

Description

5/5 - (2 votes)

Discussion is encouraged on Piazza as part of the Q/A. However, all assignments should be completed individually.
Instructions for the assignment
In this assignment, we have programming and writing questions.
To switch between cell for code and for markdown, see the menu -> Cell -> Cell Type
Typing with LaTeX is required for all the written questions, and can be done in markdown cell types.
Handwritten answers will not be accepted.
If a question requires a picture, you could use this syntax
“< 𝑖𝑚𝑔𝑠𝑟𝑐 =”” 𝑠𝑡𝑦𝑙𝑒 =” 𝑤𝑖𝑑𝑡ℎ : 300𝑝𝑥; ” / >” to include them within your ipython notebook.
Questions marked with **[P]** are programming only and should be submitted to the autograder. Questions marked with **[W]** may require that you code a small function or generate plots, but should NOT be submitted to the autograder. It should be submitted on the written portion of the assignment on gradescope
The outline of the assignment is as follows:
Q1 [30 pts] > Image compression with SVD **[W]** 1.2 and 1.3 | **[P]** items 1.1
Q2 [15 pts] > Understanding PCA **[W]** items 2.2 | **[P]** 2.1
Q3 [60+(20 bonus for undergrads)]> Regression and regularization **[W]** items 3.2, 3.3, 3.4 ,3.5 and 3.6 | **[P]** items 3.1
Q4 [25 pts] > Naive Bayes classification. **[W]** items 4.1 | **[P]** items 4.2
Q5 [15 pts] > Noise in PCA and Linear Regression. **[W]** item 5.3 | **[P]** items 5.1 and 5.2
Q6 [Bonus for all][25 pts] > Feature Selection. **[W]** items 6.2 | **[P]** items 6.1
Bonus for undergrads in Q3: For undergraduate students, you are required to implement the closed form for linear regression and for ridge regression, the other 4 methods are bonus questions. For graduate students, you are required to implement all of them.
Using the autograder
Undergrad students will find four assignments on Gradescope that correspond to HW3: “HW3 Programming”, “HW3 – Programming (Bonus)”, “HW3 – Programming (Bonus for all)”, and “HW3 – Nonprogramming”.
Graduate students will find three assignments on Gradescope that correspond to HW3: “HW3 Programming”, “HW3 – Programming (Bonus for all)”, and “HW3 – Non-programming”.
You will submit your code for the autograder on “HW3 – Programming” in the following format:
imgcompression.py pca.py regression.py nb.py slope.py
feature_selection.py
All you will have to do is implement the classes “ImgCompression”, “PCA”, “Regression”, “NaiveBayes”, “Slope”, “FeatureSelection” in the respective files. We have provided you different .py files and added libraries in those files. Please DO NOT remove those lines and add your code after those lines. Note that these are the only allowed libraries that you can use for the homework.
You are allowed to make as many submissions until the deadline as you like. Additionally, note that the autograder tests each function separately, therefore it can serve as a useful tool to help you debug your code if you are not sure of what part of your implementation might have an issue.
For the “HW3 – Non-programming” part, you will download your jupyter notebook as HTML, print it as a PDF from your browser and submit it on Gradescope. To download the notebook as html, click on “File” on the top left corner of this page and select “Download as > HTML”. The non-
programming part corresponds to Q1.2 – 1.3, Q2.2, Q3.2 – 3.6, Q4.1, Q5.3 and Q6.2. For questions that include images include both your response and the generated images in your submission
In [51]: # HELPER CELL, DO NOT MODIFY
# This is cell which sets up some of the modules you might need # Please do not change the cell or import any additional packages.

import numpy as np import json
from matplotlib import pyplot as plt from mpl_toolkits.mplot3d import Axes3D from sklearn.feature_extraction import text
from sklearn.datasets import load_boston, load_diabetes, load_digits, lo ad_breast_cancer, load_iris, load_wine
from sklearn.linear_model import Ridge, LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error, accuracy_score import warnings

warnings.filterwarnings(‘ignore’)

%matplotlib inline
%load_ext autoreload
%autoreload 2
The autoreload extension is already loaded. To reload it, use: %reload_ext autoreload
1. Image compression with SVD [30 pts] **[P]** **[W]**
Load images data and plot

Out[56]: <matplotlib.image.AxesImage at 0x7fb4ad9de9a0>

In [57]: # HELPER CELL, DO NOT MODIFY def rgb2gray(rgb):
return np.dot(rgb[…,:3], [0.299, 0.587, 0.114])

fig = plt.figure(figsize=(10, 10))
# plot several images plt.imshow(rgb2gray(image), cmap=plt.cm.bone)
Out[57]: <matplotlib.image.AxesImage at 0x7fb4ad99ec10>

1.1 Image compression [20pts] **[P]**
SVD is a dimensionality reduction technique that allows us to compress images by throwing away the least important information.
Higher singular values capture greater variance and thus capture greater information from the corresponding singular vector. To perform image compression, apply SVD on each matrix and get rid of the small singular values to compress the image. The loss of information through this process is negligible and the difference between the images can hardly be spotted. For example, the variance captured by the first component
𝜎1

∑𝑛𝑖=1 𝜎𝑖 where 𝜎𝑖 is the 𝑖𝑡ℎ singular value.
In the ImageCompression.py file, complete the svd, rebuild_svd, compression_ratio, and recovered_variance_proportion functions.
Hint 1: http://timbaumann.info/svd-image-compression-demo/ (http://timbaumann.info/svd-imagecompression-demo/) is an useful article on image compression.
1.2 Black and white [5 pts] **[W]**
Use your implementation to generate a set of images compressed to different degrees. Include the images in your non-programming submission of the assignment.

1.3 Color image [5 pts] **[W]**
Use your implementation to generate a set of images compressed to different degrees. Include the images in your non-programming submission of the assignment.
Note: You might get warning “Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).” This warning is acceptable since while rebuilding some of the pixels may go above 1.0.
You should see similar image to original even with such clipping.
In [59]: # HELPER CELL, DO NOT MODIFY from imgcompression import ImgCompression

imcompression = ImgCompression() U, S, V = imcompression.svd(image)

# component_num = [1,2,5,10,20,40,80,160,256] component_num = [1,2,5,10,20,40,80,160,256]

fig = plt.figure(figsize=(18, 18))

# plot several images i=0 for k in component_num: img_rebuild = imcompression.rebuild_svd(U, S, V, k) c = np.around(imcompression.compression_ratio(image, k), 4) r = np.around(imcompression.recovered_variance_proportion(S, k), 3) ax = fig.add_subplot(3, 3, i + 1, xticks=[], yticks=[]) ax.imshow(img_rebuild) ax.set_title(f”{k} Components”) ax.set_xlabel(f”Compression: {np.around(c,4)},\nRecovered Variance: R: {r[0]} G: {r[1]} B: {r[2]}”) i = i+1
Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers). Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers). Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers). Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers). Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers). Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers). Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers). Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).

2 Understanding PCA [15 pts] **[P]** | **[W]**
2.1 Implementation [10 pts] **[P]**
Principal Component Analysis (https://en.wikipedia.org/wiki/Principal_component_analysis) (PCA) is another dimensionality reduction technique that reduces dimensions by eliminating small variance eigenvalues and their vectors. With PCA, we center the data first by subtracting the mean. Each singular value tells us how much of the variance of a matrix (e.g. image) is captured in each component. In this problem, we will investigate how PCA can be used to improve features for regression and classification tasks and how the data itself affects the behavior of PCA.
Implement PCA in the pca.py file.
Assume a dataset is composed of N datapoints, each of which has D features with D < N. The dimension of our data would be D. It is possible, however, that many of these dimensions contain redundant information. Each feature explains part of the variance in our dataset. Some features may explain more variance than others.
In the pca.py file, complete the PCA class by completing functions fit, transform and transform_rv.
2.2 Visualize [5 pts] **[W]**
PCA is used to transform multivariate data tables into smaller sets so as to observe the hidden trends and variations in the data. It can also be used as a feature extractor for images. Here you will visualize two datasets using PCA, first a breast cancer dataset and then a dataset of masked and unmasked images.
The masked and unmasked dataset is made up of grayscale images of human faces facing forward. Half of these images are faces that are completely unmasked, and the remaining images show half of the face covered with an artificially generated face mask. The images have been reduced to a very small size and reshaped into a feature vector of pixels.
Use the above implementation of PCA and reduce the datasets such that they contain only two features. Replicate the 2-D scatter plots shown below of the data points using these features. Make sure to differentiate the data points according to their true labels using color. The datasets have already been loaded for you.
In the pca.py file, implement the visualize function.

data shape after PCA (569, 2)

*In this plot, the 0 points are malignant and the 1 points are benign.
In [62]: # HELPER CELL, DO NOT MODIFY
# Use PCA for visualization of masked and unmasked images

X = np.load(‘../data/smallflat.npy’) y = np.load(‘../data/masked_labels.npy’)

plt.title(‘Facemask Dataset Visualization with Dimensionality Reduction’
)
plt.xlabel(“Feature 1”) plt.ylabel(“Feature 2″) PCA().visualize(X,y)
print(‘*In this plot, the 0 points are unmasked images and the 1 points are masked images.’)
data shape before PCA (300, 1024) data shape after PCA (300, 2)

*In this plot, the 0 points are unmasked images and the 1 points are ma sked images.
Notice the distinct separation between the data points with different labels in both plots above.
Now you will use PCA on an actual real-world dataset. We will use your implementation of PCA function to reduce the dataset with 99% retained variance and use it to obtain the reduced features. On the reduced dataset, we will use logistic and linear regression to compare results between PCA and non-PCA datasets. Run the following cells to see how PCA works on regression and classification tasks.

In [64]: # HELPER CELL, DO NOT MODIFY
# Train, test splits
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, stratify=y, random_state=42)
# Use logistic regression to predict classes for test set clf = LogisticRegression() clf.fit(X_train, y_train) preds = clf.predict_proba(X_test)
print(‘Accuracy Using Logistic Regression before PCA: {:.5f}’.format(acc uracy_score(y_test, preds.argmax(axis=1))))
Accuracy Using Logistic Regression before PCA: 0.93333
In [65]: # HELPER CELL, DO NOT MODIFY
# Train, test splits
X_train, X_test, y_train, y_test = train_test_split(X_pca, y, test_size
=.3,
stratify=y, random_state=42)
# Use logistic regression to predict classes for test set clf = LogisticRegression() clf.fit(X_train, y_train) preds = clf.predict_proba(X_test)
print(‘Accuracy using Logistic Regression after PCA: {:.5f}’.format(accu racy_score(y_test, preds.argmax(axis=1))))
Accuracy using Logistic Regression after PCA: 0.95556

In [68]: # HELPER CELL, DO NOT MODIFY
# Train, test splits
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=42)

#Ridge regression without PCA
y_pred = apply_regression(X_train, y_train, X_test)
# calculate RMSE
rmse_score = np.sqrt(mean_squared_error(y_pred, y_test))
print(‘RMSE score using Ridge Regression before PCA: {:.5}’.format(rmse_ score))
RMSE score using Ridge Regression before PCA: 55.794
In [69]: # HELPER CELL, DO NOT MODIFY
#Ridge regression with PCA
X_train, X_test, y_train, y_test = train_test_split(X_pca, y, test_size
=.3, random_state=42)

#use Ridge Regression for getting predicted labels y_pred = apply_regression(X_train,y_train,X_test)

#calculate RMSE
rmse_score = np.sqrt(mean_squared_error(y_pred, y_test))
print(‘RMSE score using Ridge Regression after PCA: {:.5}’.format(rmse_s core))
RMSE score using Ridge Regression after PCA: 55.725
For both the tasks above we see an improvement in performance by reducing our dataset with PCA.
Feel free to add other datasets in cell below and play around with what kind of improvement you get with using PCA. There are no points for playing around with other datasets.

3 Polynomial regression and regularization [60 pts + 20 pts bonus for CS 4641] **[P]** | **[W]**
3.1 Regression and regularization implementations [30 pts + 20 pts bonus for CS
4641] **[P]**
We have three methods to fit linear and ridge regression models: 1) close form; 2) gradient descent (GD); 3) Stochastic gradient descent (SGD). For undergraduate students, you are required to implement the closed form for linear regression and for ridge regression, the others 4 methods are bonus parts. For graduate students, you are required to implement all of them. We use the term weight in the following code. Weights and parameters (𝜃) have the same meaning here. We used parameters (𝜃) in the lecture slides.
In the regression.py file, complete the Regression class by completing functions rmse, construct_polynomial_feates, predict first. Then, construct linear_fit_closed, linear_fit_GD, linear_fit_SGD for linear regression and ridge_fit_closed, ridge_fit_GD, and ridge_fit_SGD for ridge regression. For undergraduate students, you are required to implement the closed form for linear regression and for ridge regression, the other 4 methods are bonus questions. For graduate students, you are required to implement all of them. The points for each function is in regression.py

3.2 About RMSE [3 pts] **[W]**
What is a good RMSE value? If we normalize our labels between 0 and 1, what does it mean when normalized RMSE = 1? Please provide an example with your explanation.
Hint: Think of the way that you can enforce your RMSE = 1. Note that you can not change the actual labels to make RMSE = 1.
A: The closer the value of rmse to 0 the better it is. It means the worst performance when normalized RMSE=1. For example, all the labels of training data are 0 and so model always predicts 0. However, the labels of testing data are all 1 and the RMSE = 1. It’s the worst prediction.
3.3 Testing: general functions and linear regression [5 pts] **[W]**
In this section. we will test the performance of the linear regression. As long as your test rmse score is close to the TA’s answer (TA’s answer ±0.5), you can get full points. Let’s first construct a dataset for polynomial regression.
In this case, we construct the polynomial features up to degree 5. Each data sample consists of two features
[𝑎, 𝑏]. We compute the polynomial features of both a and b in order to yield the vectors
[1, 𝑎, 𝑎2, 𝑎3, . . . 𝑎𝑑𝑒𝑔𝑟𝑒𝑒] and [1, 𝑏, 𝑏2, 𝑏3, . . . , 𝑏𝑑𝑒𝑔𝑟𝑒𝑒]. We train our model with the cartesian product of these polynomial features. The cartesian product generates a new feature vector consisting of all polynomial combinations of the features with degree less than or equal to the specified degree.
For example, for degree = 2, we will have the polynomial features [1, 𝑎, 𝑎2] and [1, 𝑏, 𝑏2] for the datapoint
[𝑎, 𝑏]. The cartesian product of these two vectors will be [1, 𝑎, 𝑏, 𝑎𝑏, 𝑎2, 𝑏2]. We do not generate 𝑎3 and 𝑏3 since their degree is greater than 2 (specified degree).
In [72]: # HELPER CELL, DO NOT MODIFY
POLY_DEGREE = 5
NUM_OBS = 1500

rng = np.random.RandomState(seed=5)

true_weight = -rng.rand((POLY_DEGREE)**2+2, 1) true_weight[2:, :] = 0
x_all1 = np.linspace(-5, 5, NUM_OBS) x_all2 = np.linspace(-3, 3, NUM_OBS) x_all = np.stack((x_all1,x_all2), axis=1)

reg = Regression()
x_all_feat = reg.construct_polynomial_feats(x_all, POLY_DEGREE) x_cart_flat = [] for i in range(x_all_feat.shape[0]):
point = x_all_feat[i] x1 = point[:,0] x2 = point[:,1] x1_end = x1[-1] x2_end = x2[-1] x1 = x1[:-1] x2 = x2[:-1]
x3 = np.asarray([[m*n for m in x1] for n in x2])

x3_flat = np.reshape(x3, (x3.shape[0]**2)) x3_flat = list(x3_flat) x3_flat.append(x1_end) x3_flat.append(x2_end) x3_flat = np.asarray(x3_flat) x_cart_flat.append(x3_flat)

x_cart_flat = np.asarray(x_cart_flat) x_all_feat = np.copy(x_cart_flat)

y_all = np.dot(x_cart_flat, true_weight) + rng.randn(x_all_feat.shape[0
], 1) # in the second term, we add noise to data print(x_all.shape, y_all.shape)

# Note that here we try to produce y_all as our training data
#plot_curve(x_all, y_all) # Data with noise that we are going to predict #plot_curve(x_all, np.dot(x_cart_flat, true_weight), curve_type=’-‘, col or=’r’, lw=4) # the groundtruth information

indices = rng.permutation(NUM_OBS)
(1500, 2) (1500, 1)
In [73]: # HELPER CELL, DO NOT MODIFY fig = plt.figure(figsize=(8,5))
ax = fig.add_subplot(111, projection=’3d’)

p = np.reshape(np.dot(x_cart_flat, true_weight), (1500,)) print(x_all[:,0].shape, x_all[:,1].shape,p.shape) ax.plot(x_all[:,0], x_all[:,1], p, c=”red”,linewidth=4) ax.scatter(x_all[:,0], x_all[:,1], y_all,s=4) ax.set_xlabel(“feature_1”) ax.set_ylabel(“feature_2”) ax.set_zlabel(“y”)
ax.text2D(0.05, 0.95, “Data point & True regression”, transform=ax.trans Axes) plt.show()
(1500,) (1500,) (1500,)

In the figure above, the red curve is the true fuction we want to learn, while the blue dots are the noisy data points. The data points are generated by 𝑌 = 𝑋𝜃 + σ , where σ∼N(0,1) are i.i.d. generated noise.
Now let’s split the data into two parts, namely the training set and test set. The red dots are for training, while the blue dots are for testing.
In [74]: # HELPER CELL, DO NOT MODIFY train_indices = indices[:NUM_OBS//2] test_indices = indices[NUM_OBS//2:]

fig = plt.figure(figsize=(8,5))
ax = fig.add_subplot(111, projection=’3d’)

xtrain = x_all[train_indices] ytrain = y_all[train_indices] xtest = x_all[test_indices] ytest = y_all[test_indices]

print(xtrain.shape, xtest.shape, y_all.shape) ax.scatter(xtrain[:,0], xtrain[:,1], ytrain, c=’r’,s=4) ax.scatter(xtest[:,1], xtest[:,1], ytest, c=’b’,s=4) ax.set_xlabel(“feature_1”) ax.set_ylabel(“feature_2”) ax.set_zlabel(“y”)
ax.text2D(0.05, 0.95, “Test set & Training set”, transform=ax.transAxes)
(750, 2) (750, 2) (1500, 1)
Out[74]: Text(0.05, 0.95, ‘Test set & Training set’)

Now let’s first train using the entire training set, and see how we perform on the test set and how the learned function look like. (No need to answer following questions. Just run the helper cells.)
In [75]: # HELPER CELL, DO NOT MODIFY weight = reg.linear_fit_closed(x_all_feat[train_indices], y_all[train_in dices])
y_test_pred = reg.predict(x_all_feat[test_indices], weight) test_rmse = reg.rmse(y_test_pred, y_all[test_indices]) print(‘test rmse: %.4f’ % test_rmse) test rmse: 1.0026
In [76]: # HELPER CELL, DO NOT MODIFY
#This cell may take more than 1 minute
weight = reg.linear_fit_GD(x_all_feat[train_indices], y_all[train_indice s], epochs=50000, learning_rate=1e-9)
y_test_pred = reg.predict(x_all_feat[test_indices], weight) test_rmse = reg.rmse(y_test_pred, y_all[test_indices]) print(‘test rmse: %.4f’ % test_rmse) test rmse: 1.3547
And what if we just use the first 10 data points to train?
In [77]: # HELPER CELL, DO NOT MODIFY sub_train = train_indices[:10]
weight = reg.linear_fit_closed(x_all_feat[sub_train], y_all[sub_train]) y_test_pred = reg.predict(x_all_feat[test_indices], weight) test_rmse = reg.rmse(y_test_pred, y_all[test_indices]) print(‘test rmse: %.4f’ % test_rmse) test rmse: 8.7198
Did you see a worse performance? Let’s take a closer look at what we have learned.

Out[78]: Text(0.05, 0.95, ‘Linear Regression Result’)

3.4 Testing: Testing ridge regression [5 pts] **[W]**
Now let’s try ridge regression. Similarly, undergraduate students need to implement the closed form, and graduate students need to implement all the three methods. We will call the prediction function from linear regression part. As long as your test rmse score is close to the TA’s answer (TA’s answer ±0.5), you can get full points.
Again, let’s see what we have learned. You only need to run the cell corresponding to your specific implementation.
In [79]: # HELPER CELL, DO NOT MODIFY sub_train = train_indices[:10] print(x_all_feat[sub_train].shape) print(y_all[sub_train].shape)
weight = reg.ridge_fit_closed(x_all_feat[sub_train], y_all[sub_train], c
_lambda=1000)

y_pred = reg.predict(x_all_feat, weight)

y_test_pred = reg.predict(x_all_feat[test_indices], weight) test_rmse = reg.rmse(y_test_pred, y_all[test_indices]) print(‘test rmse: %.4f’ % test_rmse)

fig = plt.figure(figsize=(8,5))
ax = fig.add_subplot(111, projection=’3d’)

x1 = x_all[:,0] x2 = x_all[:,0]
y_pred = np.reshape(y_pred, (1500,)) ax.plot(x1, x2, y_pred, color=’b’, lw=4)

x3 = x_all[sub_train,0] x4 = x_all[sub_train,1]
ax.scatter(x3, x4, y_all[sub_train], s=100, c=’r’, marker=’x’) ax.set_xlabel(“feature_1”) ax.set_ylabel(“feature_2”) ax.set_zlabel(“y”)
ax.text2D(0.05, 0.95, “Ridge Regression(closed) Result”, transform=ax.tr ansAxes) y_test_pred = reg.predict(x_all_feat[test_indices], weight)
(10, 27) (10, 1) test rmse: 1.5695

In [80]: # HELPER CELL, DO NOT MODIFY sub_train = train_indices[:10]
weight = reg.ridge_fit_GD(x_all_feat[sub_train], y_all[sub_train], c_lam bda=1000, learning_rate=1e-9)

y_pred = reg.predict(x_all_feat, weight)

y_test_pred = reg.predict(x_all_feat[test_indices], weight) test_rmse = reg.rmse(y_test_pred, y_all[test_indices]) print(‘test rmse: %.4f’ % test_rmse)

fig = plt.figure(figsize=(8,5))
ax = fig.add_subplot(111, projection=’3d’)

x1 = x_all[:,0] x2 = x_all[:,0]
y_pred = np.reshape(y_pred, (1500,)) ax.plot(x1, x2, y_pred, color=’b’, lw=4)

x3 = x_all[sub_train,0] x4 = x_all[sub_train,1]
ax.scatter(x3, x4, y_all[sub_train], s=100, c=’r’, marker=’x’) ax.set_xlabel(“feature_1”) ax.set_ylabel(“feature_2”) ax.set_zlabel(“y”)
ax.text2D(0.05, 0.95, “Ridge Regression(GD) Result”, transform=ax.transA xes)
y_test_pred = reg.predict(x_all_feat[test_indices], weight) test rmse: 1.9763

In [81]: # HELPER CELL, DO NOT MODIFY sub_train = train_indices[:10]
weight = reg.ridge_fit_SGD(x_all_feat[sub_train], y_all[sub_train], c_la mbda=1000, learning_rate=1e-9)

y_pred = reg.predict(x_all_feat, weight)

y_test_pred = reg.predict(x_all_feat[test_indices], weight) test_rmse = reg.rmse(y_test_pred, y_all[test_indices]) print(‘test rmse: %.4f’ % test_rmse)

fig = plt.figure(figsize=(8,5))
ax = fig.add_subplot(111, projection=’3d’)

x1 = x_all[:,0] x2 = x_all[:,0]
y_pred = np.reshape(y_pred, (1500,)) ax.plot(x1, x2, y_pred, color=’b’, lw=4)

x3 = x_all[sub_train,0] x4 = x_all[sub_train,1]
ax.scatter(x3, x4, y_all[sub_train], s=100, c=’r’, marker=’x’) ax.set_xlabel(“feature_1”) ax.set_ylabel(“feature_2”) ax.set_zlabel(“y”)
ax.text2D(0.05, 0.95, “Ridge Regression(SGD) Result”, transform=ax.trans
Axes) y_test_pred = reg.predict(x_all_feat[test_indices], weight) test rmse: 1.9862

3.5 Cross validation [7 pts] **[W]**
Let’s use Cross Validation to find the best value for c_lambda in ridge regression.
In [82]: # HELPER CELL, DO NOT MODIFY
# We provided 6 possible values for lambda, and you will use them in cro ss validation.
# For cross validation, use 10-fold method and only use it for your trai ning data (you already have the train_indices to get training data). # For the training data, split them in 10 folds which means that use 10 percent of training data for test and 90 percent for training. # At the end for each lambda, you have caluclated 10 rmse and get the me an value of that.
# That’s it. Pick up the lambda with the lowest mean value of rmse.
# Hint: np.concatenate is your friend.
best_lambda = None best_error = None kfold = 10
lambda_list = [0.1, 1, 5, 10, 100, 1000]
for lm in lambda_list: err = reg.ridge_cross_validation(x_all_feat[train_indices], y_all[tr ain_indices], kfold, lm)
print(‘lambda: %.2f’ % lm, ‘error: %.6f’% err) if best_error is None or err < best_error: best_error = err best_lambda = lm

print(‘best_lambda: %.2f’ % best_lambda)
weight = reg.ridge_fit_closed(x_all_feat[train_indices], y_all[train_ind ices], c_lambda=10)
y_test_pred = reg.predict(x_all_feat[test_indices], weight) test_rmse = reg.rmse(y_test_pred, y_all[test_indices]) print(‘test rmse: %.4f’ % test_rmse)
lambda: 0.10 error: 1.004417 lambda: 1.00 error: 1.002867 lambda: 5.00 error: 1.005909 lambda: 10.00 error: 1.002961 lambda: 100.00 error: 1.010178 lambda: 1000.00 error: 1.039508 best_lambda: 1.00 test rmse: 1.0030
3.6 Noisy Input Samples in Linear Regression [10 pts] **[W]**
Consider a linear model of the form:
𝐷
𝑦(𝑥𝑛, 𝜃) = 𝜃0 + ∑ 𝜃𝑑𝑥𝑛𝑑
𝑑=1 where 𝑥𝑛 = (𝑥𝑛1, . . . , 𝑥𝑛𝐷) and weights 𝜃 = (𝜃0, . . . , 𝜃𝐷). Given the the D-dimension input sample set
𝑥 = {𝑥1, . . . , 𝑥𝑛} with corresponding target value 𝑦 = {𝑦1, . . . , 𝑦𝑛}, the sum-of-squares error function is:
𝑁
𝐸𝐷(𝑥𝑛, 𝜃) − 𝑦𝑛}2
2
𝑛=1
Now, suppose that Gaussian noise 𝜖𝑛 with zero mean and variance 𝜎2 is added independently to each of the input sample 𝑥𝑛 to generate a new sample set 𝑥′ = {𝑥1 + 𝜖1, . . . , 𝑥𝑛 + 𝜖𝑛}. For each sample 𝑥𝑛, 𝑥′𝑛 = (𝑥𝑛1 + 𝜖𝑛1, . . . , 𝑥𝑛𝐷 + 𝜖𝑛𝑑), where 𝑛 and 𝑑 is independent across both 𝑛 and 𝑑 indices.
1. (3pts) Show that 𝑦(𝑥′𝑛, 𝜃) = 𝑦(𝑥𝑛, 𝜃) + ∑𝐷𝑑=1 𝜃𝑑𝜖𝑛𝑑
2. (7pts) Assume the sum-of-squares error function of the noise sample set 𝑥′ = {𝑥1 + 𝜖1, . . . , 𝑥𝑛 + 𝜖𝑛} is 𝐸𝐷(𝜃)′. Prove the expectation of 𝐸𝐷(𝜃)′ is equivalent to the sum-of-squares error 𝐸𝐷(𝜃) for noise-free input samples with the addition of a weight-decay regularization term (e.g. 𝐿2 norm) , in which the bias parameter 𝜃0 is omitted from the regularizer. In other words, show that
𝐸[𝐸𝐷(𝜃)′] = 𝐸𝐷(𝜃) + 𝑟𝑒𝑔𝑢𝑙𝑎𝑟𝑖𝑧𝑒𝑟
Hint:
During the class, we have discussed how to solve for the weight 𝜃 for ridge regression, the function looks like this:
𝑁 𝑑
𝜆
𝐸 ∑ {𝑦(𝑥𝑖, 𝜃) − 𝑦𝑖}2 + ∑ ||𝜃𝑖||2
𝑁 𝑁
𝑖=1 𝑖=1
where the first term is the sum-of-squares error and the second term is the regularization term. N is the number of samples. In this question, we use another form of the ridge regression, which is:
𝑁 𝑑
𝜆
𝐸 (𝑥𝑖, 𝜃) − 𝑦𝑖}2 + ∑ ||𝜃𝑖||2
2 2
𝑖=1 𝑖=1
For the Gaussian noise 𝜖𝑛, we have 𝐸[𝜖𝑛] = 0
Assume the noise 𝜖 = (𝜖1, . . . , 𝜖𝑛) are independent to each other, we have
𝑚 = 𝑛
𝐸[𝜖𝑛𝜖𝑚]
𝑚 ≠ 𝑛
Answer to Q1
𝐷 𝐷 𝐷 𝐷
𝑦(𝑥′𝑛, 𝜃) = 𝜃0 + ∑ 𝜃𝑑(𝑥𝑛𝑑 + 𝜖𝑛𝑑) = 𝜃0 + ∑ 𝜃𝑑𝑥𝑛𝑑 + ∑ 𝜃𝑑𝜖𝑛𝑑 = 𝑦(𝑥𝑛, 𝜃) + ∑ 𝜃𝑑𝜖𝑛𝑑
𝑑=1 𝑑=1 𝑑=1 𝑑=1
Answer to Q2
The expectation of 𝐸𝐷(𝜃)′ is
𝑁
𝐸𝐷(𝜃)′ = ∑ {𝑦(𝑥𝑛, 𝜃)′ − 𝑦𝑛}2
𝑛=1
2
𝐷
(𝑥𝑛, 𝜃) + ∑ 𝜃𝑑𝜖𝑛𝑑 − 𝑦𝑛}
𝑑=1
𝑁 𝑁 𝐷 𝑁 𝐷
(𝑥𝑛, 𝜃)(𝑥𝑛, 𝜃) − 𝑦𝑛} ∑ 𝜃𝑑𝜖𝑛𝑑𝜃𝑑𝜖𝑛𝑑)2
2 2
𝑛=1 𝑛=1 𝑑=1 𝑛=1 𝑑=1

Since 𝜖𝑛 = (𝜖𝑛1, . . . , 𝜖𝑛𝑑) is independent across both 𝑛 and 𝑑 indices, we have

𝐸[𝜖𝑛] = 0
𝐸[𝜖𝑛𝜖𝑚] = { 𝜎2
0

Then we have,
𝑁 𝐷 𝑁
𝐸(𝑥𝑛, 𝜃) − 𝑦𝑛} ∑ 𝜃𝑑𝜖𝑛𝑑]
2
𝑛=1 𝑑=1 𝑛=1
𝑁 𝐷 𝑁 𝐷
𝐸𝜃𝑑𝜖𝑛𝑑)𝜃𝑑𝜖𝑛𝑑)
2
𝑛=1 𝑑=1 𝑛=1 𝑑=1

Thus,
𝑚 = 𝑛
𝑚 ≠ 𝑛
𝐷
(𝑥𝑛, 𝜃) − 𝑦𝑛} ∑ 𝜃𝑑𝐸[𝜖𝑛𝑑] = 0
𝑑=1
𝑁 𝐷 2 𝐷
𝜃2
2 2 𝑑
𝑛=1 𝑑=1 𝑑=1

𝑁 𝑁 𝐷 𝑁 𝐷
1

𝐸[𝐸𝐷(𝜃)′] = 𝐸[∑ {𝑦(𝑥𝑛, 𝜃)(𝑥𝑛, 𝜃) − 𝑦𝑛} ∑ 𝜃𝑑𝜖𝑛𝑑𝜃𝑑𝜖𝑛𝑑)2] = 𝐸𝐷(𝜃) +
2 2
𝑛=1 𝑛=1 𝑑=1 𝑛=1 𝑑=1
𝐷
∑ 𝜃𝐸𝐷(𝜃) + 𝑟𝑒𝑔𝑢𝑙𝑎𝑟𝑖𝑧𝑒𝑟
𝑑=1
4. Naive Bayes Classification [25pts]**[P]** | **[W]**
4.1 Bayes in Advertisements [5pts] **[W]**
A doctor wants to evaluate her patients’ health conditions and their relations to lifestyle. She sampled 12 patients randomly and conducted a survey to learn about their lifestyles. The table below shows the current health risk a patient is at and his/her lifestyle.
Health Risk Smoker? Exercise Frequency (days/wk) Average Sleeping hours per day

High Yes 0-3 <6
Medium Yes 0-3 6-9
Low Yes >3 6-9
Medium No >3 6-9
Low No 0-3 6-9
Low No >3 <6
High Yes >3 <6
Medium No >3 <6
High No 0-3 6-9
Low No 0-3 6-9
Medium Yes 0-3 6-9
Low Yes >3 6-9
Given that a smoker who exercises >3 days/wk and sleeps 6-9 hours on a daily average, assess the health condition this person is most likely to be in using Naive Bayes.
Note: You can assume that each habit of a person is independent from other habits i.e. A person who exercises regularly does not tell any information about his sleeping pattern or whether he is a smoker.
Prior probs:
𝑃(ℎ𝑖𝑔ℎ) =
𝑃(𝑚𝑒𝑑𝑖𝑢𝑚) = 4
𝑃(𝑙𝑜𝑤) =
Likelihood:(use Add-1 Smoothing)
𝑃(𝑠𝑚𝑜𝑘𝑒|ℎ𝑖𝑔ℎ) =
𝑃(𝑠𝑚𝑜𝑘𝑒|𝑚𝑒𝑑𝑖𝑢𝑚) =
𝑃(𝑠𝑚𝑜𝑘𝑒|𝑙𝑜𝑤) =
𝑃(> 3|ℎ𝑖𝑔ℎ) =
𝑃(> 3|𝑚𝑒𝑑𝑖𝑢𝑚) =
𝑃(> 3|𝑙𝑜𝑤) =
𝑃(6 − 9|ℎ𝑖𝑔ℎ) =
𝑃(6 − 9|𝑚𝑒𝑑𝑖𝑢𝑚) =
𝑃(6 − 9|𝑚𝑒𝑑𝑖𝑢𝑚) = Post probs:
𝑃(ℎ𝑖𝑔ℎ|𝑠𝑚𝑜𝑘𝑒, > 3, 6 − 9)
𝑃(𝑠𝑚𝑜𝑘𝑒|ℎ𝑖𝑔ℎ)𝑃(>3|ℎ𝑖𝑔ℎ)𝑃(6−9|ℎ𝑖𝑔ℎ)𝑃(ℎ𝑖𝑔ℎ)
=
𝑃(𝑠𝑚𝑜𝑘𝑒|ℎ𝑖𝑔ℎ)𝑃(>3|ℎ𝑖𝑔ℎ)𝑃(6−9|ℎ𝑖𝑔ℎ)𝑃(ℎ𝑖𝑔ℎ)+𝑃(𝑠𝑚𝑜𝑘𝑒|𝑚𝑒𝑑𝑖𝑢𝑚)𝑃(>3|𝑚𝑒𝑑𝑖𝑢𝑚)𝑃(6−9|𝑚𝑒𝑑𝑖𝑢𝑚)𝑃(𝑚𝑒𝑑𝑖𝑢𝑚)+𝑃(𝑠𝑚𝑜𝑘𝑒|𝑙𝑜𝑤)𝑃(>3|𝑙𝑜𝑤)𝑃(6−9|𝑙
= 0.2149
𝑃(𝑚𝑒𝑑𝑖𝑢𝑚|𝑠𝑚𝑜𝑘𝑒, > 3, 6 − 9) = 0.2546
𝑃(𝑙𝑜𝑤|𝑠𝑚𝑜𝑘𝑒, > 3, 6 − 9) = 0.5305
Thus, the health condition of the person is most likely to be low.
4.2 Sentiment Analysis for News [15pts] **[P]**
This dataset contains the sentiments for financial news headlines from the perspective of a retail investor. The sentiment of news has 3 classes, negative(class label = 0), neutral(class label = 1) and positive(class label = 2). There are 4846 news in total with 9 duplicates. We remove those duplicates to achieve 4837 unique news values and then randomly split the 4837 news into training set and evaluation set with 8:2 ratio. We use the training set to fit the Naive Bayes model and use the evaluation set to evaluate the accuracy of our model.
The code which is provided loads the documents and builds a “bag of words” representation
(https://en.wikipedia.org/wiki/Bag-of-words_model) of each document. Your task is to complete the missing portions of the code and to determine whether a news is negative, neutral or positive. (Hint: Label 0 denotes the news is negative, label 1 denotes the news is neutral and 2 means the news is positive. Our job here is to determine whether a news is negative, neutral or positive using Naive Bayes).
priors_prob function calculates the ratio of class probabilities of negative, neutral or positive. We do this based on word counts rather than document counts.
likelihood_ratio function calculates the ratio of word probablities given the label of whether the news is negative, neutral or positive.
analyze_sentiment function takes in the likelihood ratio, priors probabilities for each class and a number of test news represented in Bag-of-Words representation, and analyzes the sentiment for each news.
For example, if we have a matrix like: (the first cloumn denotes the class label, the entries in the remaining
columns denote the number of occurrances for each word) ⎡ ⎤
𝑢𝑠𝑒𝑙𝑒𝑠𝑠 ⎥⎥⎥⎥
4
6 2 ⎥⎥⎥⎦
1
0
𝑙𝑎𝑏𝑒𝑙
⎢⎢ 0(𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒)
0
⎢⎢⎢ 1(𝑛𝑒𝑢𝑡𝑟𝑎𝑙)
2(𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒)
2 ℎ𝑎𝑝𝑝𝑦
1
0
3
3
4
Then we have
1 + 4 + 0 + 6 11
𝑝𝑟𝑖𝑜𝑟(𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒) = =
1 + 4 + 0 + 6 + 3 + 2 + 3 + 1 + 4 + 0 24
3 + 2 5
𝑝𝑟𝑖𝑜𝑟(𝑛𝑒𝑢𝑡𝑟𝑎𝑙) = =
1 + 4 + 0 + 6 + 3 + 2 + 3 + 1 + 4 + 0 24
3 + 1 + 4 + 0 8
𝑝𝑟𝑖𝑜𝑟(𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒) = =
1 + 4 + 0 + 6 + 3 + 2 + 3 + 1 + 4 + 0 24
Note 1: In likelihood_ratio() add one to each word count so as to avoid issues with zero word count. This is known as Add-1 smoothing. It is a type of additive smoothing.

4.3 Accuracy result analysis [5pts] **[W]**
Do you think this is a good accuracy? What assumptions can you make that limit the accuracy? (This is an open question, any reasonable assumptions will be acceptable).
A: No, I think it’s not good enough. The features are too sparse, perform feature selection to reduce the dimension of data and extract useful features may improve the accuracy.
5 Noise in PCA and Linear Regression (15 Pts) **[P]****[W]**
Both PCA and least squares regression can be viewed as algorithms for inferring (linear) relationships among data variables. In this part of the assignment, you will develop some intuition for the differences between these two approaches, and an understanding of the settings that are better suited to using PCA or better suited to using the least squares fit.
The high level bit is that PCA is useful when there is a set of latent (hidden/underlying) variables, and all the coordinates of your data are linear combinations (plus noise) of those variables. The least squares fit is useful when you have direct access to the independent variables, so any noisy coordinates are linear combinations (plus noise) of known variables.
5.1 Slope Functions (5 Pts) **[P]**
In slope.py complete the following:
1. For this function, assume that X is the first feature and Y is the second feature for the data. Write a function, pca_slope, that takes in the first feature vector X and the second feature vector Y. Concatenate these two feature vectors into a single Nx2 matrix and use this to determine the first principal component vector of this dataset. Finally, return the slope of this first component. You should use the PCA implementation from Q2.

2. Write a function lr_slope that takes X and y and returns the slope of the least squares fit. You should use the Linear Regression implementation from Q3 but do not use any kind of regularization. Think about how weight could relate to slope.
In later subparts, we consider the case where our data consists of noisy measurements of x and y. For each part, we will evaluate the quality of the relationship recovered by PCA, and that recovered by standard least squares regression.
As a reminder, least squares regression minimizes the squared error of the dependent variable from its prediction. Namely, given (𝑥𝑖, 𝑦𝑖) pairs, least squares returns the line 𝑙(𝑥) that minimizes ∑𝑖(𝑦𝑖 − 𝑙(𝑥𝑖))2.

We will consider a simple example with two variables, x and y, where the true relationship between the variables is y = 4x. Our goal is to recover this relationship—namely, recover the coefficient “4”. We set X = [0, .02, .04, .06, . . . , 1] and y = 4x. Make sure both functions return 4.

Slope of best linear fit 4.0

5.2 Analysis Setup (5 Pts) **[P]**
Error in y
In this subpart, we consider the setting where our data consists of the actual values of 𝑥, and noisy estimates of 𝑦. Run the following cell to see how the data looks when there is error in 𝑦.

In slope.py, you will implement the addNoise function:
1. Create a vector 𝑋 where 𝑋 = [𝑥1, 𝑥2, . . . , 𝑥1000] = [.001, .002, .003, . . . , 1].

2. For a given noise level 𝑐, set 𝑦𝑖 ̂ ∼ 2𝑥𝑖 + (0, 𝑐) = 2𝑖/1000 + (0, 𝑐), and 𝑌 ̂ = [𝑦1 ̂ , 𝑦2 ̂ , . . . , 𝑦1000 ̂ ].
You can use the np.random.normal function, where scale is equal to noise level, to add noise to your points.

3. Return the pca_slope and lr_slope values of this 𝑋 and 𝑌 ̂ dataset you have created where only 𝑌 ̂ has noise.

A scatter plot with c on the horizontal axis, and the output of pca_slope and lr_slope on the vertical axis has already been implemented for you.
A sample 𝑌 ̂ has been taken for each 𝑐 in [0, 0.05, 0.1, . . . , .95, 1.0]. The output of pca_slope is plotted as a red dot, and the output of lr_slope as a blue dot. This has been repeated 30 times, you can see that we end up with a plot of 1260 dots, in 21 columns of 60, half red and half blue.

Error in x and y
We will now examine the case where our data consists of noisy estimates of both 𝑥 and 𝑦. Run the following cell to see how the data looks when there is error in both.

In slope.py you will modify the addNoise function you created in the previous step:
1. Notice the parameter x_noise in the addNoise function. When this parameter is set to 𝑇𝑟𝑢𝑒, you will have to add noise to 𝑋. For a given noise level c, let 𝑥 ̂𝑖 ∼ 𝑥𝑖 + (0, 𝑐) = 𝑖/1000 + (0, 𝑐), and
𝑋 ̂ = [𝑥 ̂1, 𝑥 ̂2, . . . . 𝑥 ̂1000]

2. For the same noise level c, set 𝑦𝑖 ̂ ∼ 2𝑥𝑖 + (0, 𝑐) = 2𝑖/1000 + (0, 𝑐), and 𝑌 ̂ = [𝑦1̂ , 𝑦2 ̂ , . . . , 𝑦1000 ̂ ]. Again, you can use *np.random.normal function, where scale is equal to noise level, to add noise to your points.

3. Return the pca_slope and lr_slope values of this 𝑋 ̂ and 𝑌 ̂ dataset you have created where both 𝑋 ̂ and 𝑌 ̂ have noise.

[*HINT: Make sure to apply np.random.normal for the 𝑋 ̂ values first and then apply np.random.normal for the 𝑌 ̂ values second. This is so that the random seed values match in the autograder tests.]
A scatter plot with c on the horizontal axis, and the output of pca-slope and lr-slope on the vertical axis has already been implemented for you. A sample 𝑋 ̂ and 𝑌 ̂ has been taken for each 𝑐 in [0, 0.05, 0.1, . . . , .95, 1.0]. The output of pca-slope is plotted as a red dot, and the output of lr-slope as a blue dot. This has been repeated 30 times, you can see that we end up with a plot of 1260 dots, in 21 columns of 60, half red and half blue.

5.3. Analysis (5 Pts) **[W]**
Based on your observations from previous subsections answer the following questions about the two cases (error in 𝑋 and error in both 𝑋 and 𝑌 ) in 2-3 lines.
Note:
1. The closer the value of slope to actual slope (“2” here) the better the algorithm is performing.
2. You don’t need to provide a mathematical proof for this question.
Questions:
1. Which case does PCA perform worse in? Why does PCA perform worse in this case? (2 Pts)
2. Why does PCA perform better in the other case? (1 Pt)
3. Which case does Linear Regression perform well? Why does Linear Regression perform well in this case? (2 Pts)
A:
1. PCA performs worse in the case error in 𝑌 . PCA is based on extracting the axes on which data shows the highest variability. Noise in 𝑌 increases the variance in y-axes, which misleadind the PCA.
2. Adding noise in both 𝑋 and 𝑌 in some way maintains the relationship between 𝑋 and 𝑌 .
3. Linear Regression performs well in the case error in 𝑌 . Noise in 𝑌 doesn’t change the linear relationship between 𝑋 and 𝑌 , so Linear Regression still works well.
6 Feature Selection [Bonus for everyone] [25 Points] **[P]** | ** [W]**
6.1 Implementation [18 Points] **[P]**
Feature selection is an integral aspect of machine learning. It is the process of selecting a subset of relevant features that are to be used as the input for the machine learning task. Feature selection may lead to simpler models for easier interpretation, shorter training times, avoidance of the curse of dimensionality, and better generalization by reducing overfitting.
Implement a method to find the final list of significant features due to forward selection and backward elimination.
Forward Selection:
In forward selection, we start with a null model, start fitting the model with one individual feature at a time, and select the feature with the minimum p-value. We continue to do this until we have a set of features where one feature’s p-value is less than the confidence level.
Steps to implement it:
1: Choose a significance level (given to you).
2: Fit all possible simple regression models by considering one feature at a time.
3: Select the feature with the lowest p-value.
4: Fit all possible models with one extra feature added to the previously selected feature(s).
5: Select the feature with the minimum p-value again. if p_value < significance, go to Step 4. Otherwise, terminate.
Backward Elimination:
In backward elimination, we start with a full model, and then remove the insignificant feature with the highest pvalue (that is greater than the significance level). We continue to do this until we have a final set of significant features.
Steps to implement it:
1: Choose a significance level (given to you).
2: Fit a full model including all the features.
3: Select the feature with the highest p-value. If (p_value > significance level), go to Step 4, otherwise terminate.
4: Remove the feature under consideration.
5: Fit a model without this feature. Repeat entire process from Step 3 onwards.
TIP 1: The p-value is known as the observed significance value for a test hypothesis. It tests all the assumptions about how the data was generated in the model, not just the target hypothesis it was supposed to test. Some more information about p-values can be found here: https://towardsdatascience.com/what-is-a-p-valueb9e6c207247f (https://towardsdatascience.com/what-is-a-p-value-b9e6c207247f)
TIP 2: For this function, you will have to install statsmodels if not installed already. Run ‘pip install statsmodels’ in command line/terminal. In the case that you are using an Anaconda environment, run ‘conda install -c condaforge statsmodels’ in the command line/terminal. For more information about installation, refer to https://www.statsmodels.org/stable/install.html (https://www.statsmodels.org/stable/install.html). The statsmodels library is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration. You will have to use this library to choose a regression model to fit your data against. Some more information about this module can be found here: https://www.statsmodels.org/stable/index.html (https://www.statsmodels.org/stable/index.html)
TIP 3: For step 2 in each of the forward and backward selection functions, you can use the ‘sm.OLS’ function as your regression model. Also, do not forget to add a bias to your regression model. A function that may help you is the ‘sm.add_constants’ function.
TIP 4 Y h ld b bl t i l t th f ti i l th lib i id d i th ll b l
In [ ]: # HELPER CELL, DO NOT MODIFY

import pandas as pd
import statsmodels.api as sm

from feature_selection import FeatureSelection

boston = load_boston()
bos = pd.DataFrame(boston.data, columns = boston.feature_names) bos[‘Price’] = boston.target
X = bos.drop(“Price”, 1) # feature matrix y = bos[‘Price’] # target feature featureselection = FeatureSelection()
#Run the functions to make sure two lists are generated, one for each me thod
print(“Features selected by forward selection:”, featureselection.forwar d_selection(X, y))
print(“Features selected by backward elimination:”, featureselection.bac kward_elimination(X, y)[0])
Features selected by forward selection: [‘LSTAT’, ‘RM’, ‘PTRATIO’, ‘DI
S’, ‘NOX’, ‘CHAS’, ‘B’, ‘ZN’, ‘CRIM’, ‘RAD’, ‘TAX’] Features selected by backward elimination: [‘CRIM’, ‘ZN’, ‘CHAS’, ‘NO
X’, ‘RM’, ‘DIS’, ‘RAD’, ‘TAX’, ‘PTRATIO’, ‘B’, ‘LSTAT’]
6.2 Feature Selection – Discussion [7pts] **[W]**
Question 6.2.1:
We have seen two regression methods namely Lasso and Ridge regression earlier in this assignment. Another extremely important and common use-case of these methods is to perform feature selection. According to you, which of these two methods are more appropriate for feature selection? Why? (3 pts)
A: Lasso is more appropriate for feature selection. L1 penalty is sparse and give out zero weight on useless features, which provide clear selection. The ridge regression, l2 penalty give out small weight on useless features and the threshold cutting is required for the following feature selection.
Question 6.2.2:
We have seen that we use different subsets of features to get different regression models. These models depend on the relevant features that we have selected. Using forward and backward selection, what fraction of the total possible models are we exploring? Assume that the total number of features that we have at our disposal is N. (4 pts)
A:
The number of total possible models is 2𝑁;
The maximum number of forward selection possible models is 𝑁 + (𝑁 − 1)+. . . +1 = 𝑁(𝑁2+1) , the
corresponding fraction is 𝑁(𝑁𝑁++11) ;
2
The maximum number of forward selection possible models is
1 + (𝑁 − 1) + (𝑁 − 2)+. . . +1 = 𝑁(𝑁2−1) + 1, the corresponding fraction is 𝑁(𝑁2𝑁−+11)+2 .

  • Assignment3-3yccyb.zip