## Description

Introduction:

Regression analysis is a statistical procedure for estimating the relationship between a target variable and a set of potentially relevant variables. Usually, in a stochastic setting, regression analysis estimates the conditional expectation of the response variable given the other variables; roughly speaking, the average value of the response variable for a realization of the other variables.

Such an analysis is highly dependent on the underlying data generating process, the assumptions on which guide the choice of the regression function and the constraint we impose on the relationship that we want to estimate. If the assumed model is excessively complex, over-fitting occurs, which diminishes

the predictive performance of the model.

In this project, we explore basic regression models on various datasets, along with basic techniques to handle over-fitting; namely cross-validation, and regularization. With cross-validation, we test how the model generalizes to unseen data by evaluating its performance on a set of data not used for training,

while with regularization we penalize overly complex models.

Network backup Dataset

1) Load the dataset. You can download the dataset from this link. The dataset is comprised of simulated traffic data on a backup system in a network. The system monitors the files residing in a destination machine and copies their changes in four hours cycles. At the end of each backup process, the size of

the data moved to the destination as well as the duration it took are logged, to be used for developing prediction models. We define a workflow as a task that backs up data from a group of files, which have similar patterns of change in the files size over time. In other words, how the files are changing varies among different workflows and it depends on different factors like the day of the week it happens and the time of the day. The dataset has around 18000 data points with the

following columns:

• Week index

• Day of the week: at which the file is backed up starts

• Backup start time-Hour of the day: the exact time that the backup process is completed

• Workflow ID

• File name

• Backup size: the size of the file that is backed up in that cycle in GB

• Backup time: the duration of the backup procedure in hour

Given this dataset, we want to develop prediction models for predicting the size of the data being backed up as well as the time a backup process may take (refer to “Size of Backup” and “Backup Time” columns in the dataset). To get an idea on the type of relationships in your dataset, for each workflow, plot the actual copy sizes of all the files on a time period of 20 days. Can you identify any repeating patterns?

2) Let us now predict the copy size of a file given the other attributes.

a) Fit a linear regression model with copy size as the target variable and the other attributes as the features. We use ordinary least square as the penalty function. That is

min ∥ ! − !” ∥!,

where the minimization is on the coefficient vector !.

Perform a 10-fold cross validation. That is, split the data randomly into 10 parts and each time take 90% of the data for training and intentionally regard the other 10% to have an unknown response variable for testing. After training the model compare the predicted value of the 10% testing data with their

actual values. If we split the data into 10 equally sized parts and test 10 times, each time testing for one of these 10 parts while training on the other 9 parts, we would achieve “10-fold Cross-validation”.

Analyze the significance of different variables with the statistics obtained from the model you have trained and report your obtained Root Mean Squared Error (RMSE). Evaluate how well your model fits the data by providing “Fitted values and actual values scattered plot over time”, and “residuals versus

fitted values plot”.

b) Use a random forest regression model for this same task. Set the parameters of your model with the following initial values

• Number of trees: 20

• Depth of each tree: 4

And you can initialize the maximum number of features at each node to be the number of features you have. By tuning the parameters you can improve the performance of the model. Deeper tree reduces the bias and having more trees reduces the variance. Tune your parameters of your model and report

the best RMSE you can get. Compare the performance in RMSE with the linear regression model developed earlier.

Interpret the output of your random forest model. Can you identify the patterns you observed in part 1 in your fitted model?

c) Now use a neural network regression model. Explain the major parameters of your model and how they affect your performance in RMSE.

3) Predict the Backup size for each of the workflows separately. Explain if the fit is improved? Note that in this case, you are fitting a piece-wise linear regression model.

Now, try fitting a more complex regression function to your data. You can try a polynomial function of your variables? Try increasing the degree of the polynomial to improve your fit. Again, use a 10 fold cross validation to evaluate your results. Plot the RMSE of the trained model against the degree of the

polynomial you fit first for a fixed training and test set, and then for the average RMSE using cross validation. Can you find a threshold on the degree of the fitted polynomial beyond which the generalization error of your model gets worse?

Can you explain how cross validation helps controlling the complexity of your model?

Boston Housing Dataset

Load the dataset. You can download the dataset from this link. This dataset concerns housing values in

the suburbs of the greater Boston area and is taken from the StatLib library which is maintained at

Carnegie Mellon University. There are around 500 data points with the following features

• CRIM: per capita crime rate by town

• ZN: proportion of residential land zoned for lots over 25,000 sq. ft.

• INDUS: proportion of non-retail business acres per town

• CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)

• NOX: nitric oxides concentration (parts per 10 million)

• RM: average number of rooms per dwelling

• AGE: proportion of owner-occupied units built prior to 1940

• DIS: weighted distances to five Boston employment centers

• RAD: index of accessibility to radial highways

• TAX: full-value property-tax rate per $10,000

• PTRATIO: pupil-teacher ratio by town

• B: 1000(Bk – 0.63)^2 where Bk is the proportion of blacks by town

• LSTAT: % lower status of the population

• MEDV: Median value of owner-occupied homes in $1000’s

4) Fit a linear regression model with MEDV as the target variable and the other attributes as the features and ordinary least square as the penalty function. Perform a 10 fold cross validation, analyze the significance of different variables with the statistics obtained from the model you have trained, and the averaged Root Mean Squared Error (RMSE), and plot the same curves as in part 2. Repeat the same steps for a polynomial regression function and find the optimal degree of fit as in part 3.

5) In this part, we try to control over fitting via regularization of the parameters. The idea behind

regularization is to constrain the coefficient vector to lie in a less complex manifold rather than Rp, with

p being the number of features. In this part we explore common regularization techniques that impose

a further penalty on the size of the regression coefficients along with the sum of residuals. Namely we

consider ridge and lasso regression techniques, which correspond to ℓ! and ℓ! regularizations

respectively.

a) Tune the complexity parameter ! of the ridge regression below in the range {0.1,0.01,0.001} and

report the best RMSE obtained via 10-fold cross validation.

min ∥ ! − !” ∥! ! + ! ∥ ! ∥! !

b) Repeat the previous part for Lasso regularization as formulated below

min !

!! ∥ ! − !” ∥! ! + ! ∥ ! ∥!,

where n is the number of samples.

Submission: Please submit a zip file containing your report, and your codes with a readme file on

how to run your code to [email protected]. The zip file should be named as

“Project1_UID1_UID2_…_UIDn.zip” where UIDx are student ID numbers of the team members. If you

had any questions you can send an email to the same address.