DSGA1003 Homework 2- Lasso Regression Solved

35.00 $

Category:

Description

5/5 - (1 vote)

In this homework you will investigate regression with `1 regularization, both implementation techniques and theoretical properties. On the methods side, you’ll work on coordinate descent (the “shooting algorithm”), homotopy methods, and [optionally] projected SGD. On the theory side you’ll derive the largest `1 regularization parameter you’ll ever need to try, and optionally you’ll derive the explicit solution to the coordinate minimizers used in coordinate descent, you’ll investigate what happens with ridge and lasso regression when you have two copies of the same feature, and you’ll work out the details of the classic picture that “explains” why `1 regularization leads to sparsity.

1.1           Data Set and Programming Problem Overview

For the experiments, we are generating some artifical data using code in the file setup_problem.py. We are considering the regression setting with the 1-dimensional input space R. An image of the training data, along with the target function (i.e. the Bayes prediction function for the square loss function) is shown in Figure 1 below.

You can examine how the target function and the data were generated by looking at setup_problem.py.

The figure can be reproduced by running the LOAD_PROBLEM branch of the main function.

As you can see, the target function is a highly nonlinear function of the input. To handle this sort of problem with linear hypothesis spaces, we will need to create a set of features that perform nonlinear transforms of the input. A detailed description of the technique we will use can be found in the Jupyter notebook basis-fns.ipynb, included in the zip file.

In this assignment, we are providing you with a function that takes care of the featurization. This is the “featurize” function, returned by the generate_problem function in setup_problem.py. The generate_problem function also gives the true target function, which has been constructed to be a sparse linear combination of our features. The coefficients of this linear combination are also provided by generate_problem, so you can compare the coefficients of the linear functions you find to the target function coefficients. The generate_problem function also gives you the train and validation sets that you should use.

Figure 1: Training data and target function we will be considering in this assignment.

To get familiar with using the data, and perhaps to learn some techniques, it’s recommended that you work through the main() function of the include file ridge_regression.py. You’ll go through the following steps (on your own – no need to submit):

  1. Load the problem from disk into memory with load_problem.
  2. Use the featurize function to map from a one-dimensional input space to a d-dimensional feature space.
  3. Visualize the design matrix of the featurized data. (All entries are binary, so we will not do any data normalization or standardization in this problem, though you may experiment with that on your own.)
  4. Take a look at the class RidgeRegression. Here we’ve implemented our own RidgeRegression using the general purpose optimizer provided by scipy.optimize. This is primarily to introduce you to the sklearn framework, if you are not already familiar with it. It can help with hyperparameter tuning, as we will see shortly.
  5. Take a look at compare_our_ridge_with_sklearn. In this function, we want to get some evidence that our implementation is correct, so we compare to sklearn’s ridge regression. Comparing the outputs of two implementations is not always trivial – often the objective functions are slightly different, so you may need to think a bit about how to compare the results. In this case, sklearn has total square loss rather than average square loss, so we needed to account for that. In this case, we get an almost exact match with sklearn. This is because ridge regression is a rather easy objective function to optimize. You may not get as exact a match for other objective functions, even if both methods are “correct.”
  6. Next take a look at do_grid_search, in which we demonstrate how to take advantage of the fact that we’ve wrapped our ridge regression in an sklearn “Estimator” to do hyperparameter tuning. It’s a little tricky to get GridSearchCV to use the train/test split that you want, but

an approach is demonstrated in this function. In the line assigning the param_grid variable, you can see my attempts at doing hyperparameter search on a different problem. Below you will be modifying this (or using some other method, if you prefer) to find the optimal L2 regularization parameter for the data provided.

  1. Next is some code to plot the results of the hyperparameter search.
  2. Next we want to visualize some prediction functions. We plotted the target function, along with several prediction functions corresponding to different regularization parameters, as functions of the original input space R, along with the training data. Next we visualize the coefficients of each feature with bar charts. Take note of the scale of the y-axis, as they may vary substantially, buy default.

2           Ridge Regression

In the problems below, you do not need to implement ridge regression. You may use any of the code provided in the assignment, or you may use other packages. However, your results must correspond to the ridge regression objective function that we use, namely

.

  1. Run ridge regression on the provided training dataset. Choose the λ that minimizes the empirical risk (i.e. the average square loss) on the validation set. Include a table of the parameter values you tried and the validation performance for each. Also include a plot of the results.
  2. Now we want to visualize the prediction functions. On the same axes, plot the following: the training data, the target function, an unregularized least squares fit (still using the featurized data), and the prediction function chosen in the previous problem. Next, along the lines of the bar charts produced by the code in compare_parameter_vectors, visualize the coefficients for each of the prediction functions plotted, including the target function. Describe the patterns, including the scale of the coefficients, as well as which coefficients have the most weight.
  3. For the chosen λ, examine the model coefficients. For ridge regression, we don’t expect any parameters to be exactly 0. However, let’s investigate whether we can predict the sparsity pattern of the true parameters (i.e. which parameters are 0 and which are nonzero) by thresholding the parameter estimates we get from ridge regression. We’ll predict that wi = 0 if |wˆi| < ε and wi 6= 0 Give the confusion matrix for ε = 10−6,10−3,10−1, and any other thresholds you would like to try.

3           Coordinate Descent for Lasso (a.k.a. The Shooting algorithm)

The Lasso optimization problem can be formulated as1

m wˆ ∈ argminX(hw(xi) − yi)2 + λkwk1, wRd i=1

where hw(x) = wTx, and . Note that to align with Murpy’s formulation below, and for historical reasons, we are using the total square loss, rather than the average square loss, in the objective function.

Since the `1-regularization term in the objective function is non-differentiable, it’s not immediately clear how gradient descent or SGD could be used to solve this optimization problem directly. (In fact, as we’ll see in the next homework on SVMs, we can use “subgradient” methods when the objective function is not differentiable, in addition to the two methods discussed in this homework assignment.)

Another approach to solving optimization problems is coordinate descent, in which at each step we optimize over one component of the unknown parameter vector, fixing all other components. The descent path so obtained is a sequence of steps, each of which is parallel to a coordinate axis in Rd, hence the name. It turns out that for the Lasso optimization problem, we can find a closed form solution for optimization over a single component fixing all other components. This gives us the following algorithm, known as the shooting algorithm:

(Source: Murphy, Kevin P. Machine learning: a probabilistic perspective. MIT press, 2012.) The “soft thresholding” function is defined as

soft(a,δ) = sign(a)(|a| − δ)+ ,

for any a,δ R.

NOTE: Algorithm 13.1 does not account for the case that aj = cj = 0, which occurs when the jth column of X is identically 0. One can either eliminate the column (as it cannot possibly help the solution), or you can set wj = 0 in that case since it is, as you can easily verify, the coordinate minimizer. Note also that Murphy is suggesting to initialize the optimization with the

1

ridge regession solution. Although theoretically this is not necessary (with exact computations and enough time, coordinate descent will converge for lasso from any starting point), in practice it’s helpful to start as close to the solution as we’re able.

There are a few tricks that can make selecting the hyperparameter λ easier and faster. First, as we’ll see in a later problem, you can show that for any λ ≥ 2kXT(y y¯)k, the estimated weight vector wˆ is entirely zero, where y¯ is the mean of values in the vector y, and k · kis the infinity norm (or supremum norm), which is the maximum over the absolute values of the components of a vector. Thus we need to search for an optimal λ in [0max], where λmax = 2kXT(yy¯)k. (Note: This expression for λmax assumes we have an unregularized bias term in our model. That is, our decision functions are of the form hw,b(x) = wTx + b. In our the experiments, we do not have an unregularized bias term, so we should use λmax = 2kXTyk.)

The second trick is to use the fact that when λ and λ0 are close, the corresponding solutions wˆ(λ) and wˆ(λ0) are also close. Start with λ = λmax, for which we know wˆ(λmax) = 0. You can run the optimization anyway, and initialize the optimization at w = 0. Next, λ is reduced (e.g. by a constant factor close to 1), and the optimization problem is solved using the previous optimal point as the starting point. This is called warm starting the optimization. The technique of computing a set of solutions for a chain of nearby λ’s is called a continuation or homotopy method. The resulting set of parameter values wˆ(λ) as λ ranges over [0max ] is known as a regularization path.

3.1           Experiments with the Shooting Algorithm

  1. The algorithm as described above is not ready for a large dataset (at least if it has being implemented in Python) because of the implied loop in the summation signs for the expressions for aj and cj. Give an expression for computing aj and cj using matrix and vector operations, without explicit loops. This is called “vectorization” and can lead to dramatic speedup when implemented in languages such as Python, Matlab, and R. Write your expressions using X, w, y = (y1,…,yn)T (the column vector of responses), Xj (the jth column of X, represented as a column matrix), and wj (the jth coordinate of w – a scalar).
  2. Write a function that computes the Lasso solution for a given λ using the shooting algorithm described above. For convergence criteria, continue coordinate descent until a pass through the coordinates reduces the objective function by less than 10−8, or you have taken 1000 passes through the coordinates. Compare performance of cyclic coordinate descent to randomized coordinate descent, where in each round we pass through the coordinates in a different random order (for your choices of λ). Compare also the solutions attained (following the convergence criteria above) for starting at 0 versus starting at the ridge regression solution suggested by Murphy (again, for your choices of λ). If you like, you may adjust the convergence criteria to try to attain better results (or the same results faster).
  3. Run your best Lasso configuration on the training dataset provided, and select the λ that minimizes the square error on the validation set. Include a table of the parameter values you tried and the validation performance for each. Also include a plot of these results. Include also a plot of the prediction functions, just as in the ridge regression section, but this time add the best performing Lasso prediction function and remove the unregularized least squares fit. Similarly, add the lasso coefficients to the bar charts of coefficients generated in

the ridge regression setting. Comment on the results, with particular attention to parameter sparsity and how the ridge and lasso solutions compare. What’s the best model you found, and what’s its validation performance?

  1. Implement the homotopy method described above. Compute the Lasso solution for (at least) the regularization parameters in the set. Plot the results (average validation loss vs λ).
  2. [Optional] Note that the data in Figure 1 is almost entirely nonnegative. Since we don’t have an unregularized bias term, we have “pay for” this offset using our penalized parameters. Note also that λmax would decrease significantly if the y values were 0 centered (using the training data, of course), or if we included an unregularized bias term. Experiment with one or both of these approaches, for both and lasso and ridge regression, and report your findings.

3.2           [Optional] Deriving the Coordinate Minimizer for Lasso

This problem is to derive the expressions for the coordinate minimizers used in the Shooting algorithm. This is often derived using subgradients (slide 15), but here we will take a bare hands approach (which is essentially equivalent).

In each step of the shooting algorithm, we would like to find the wj minimizing

,

where we’ve written xij for the jth entry of the vector xi. This function is convex in wj. The only thing keeping f from being differentiable is the term with |wj|. So f is differentiable everywhere except wj = 0. We’ll break this problem into 3 cases: wj > 0, wj < 0, and wj = 0. In the first two cases, we can simply differentiate f w.r.t. wj to get optimality conditions. For the last case, we’ll use the fact that since f : R R is convex, 0 is a minimizer of f iff

and      .

This is a special case of the optimality conditions described in slide 6 here, where now the “direction” v is simply taken to be the scalars 1 and −1, respectively.

  1. First let’s get a trivial case out of the way. If xij = 0 for i = 1,…,n, what is the coordinate minimizer wj? In the remaining questions below, you may assume that.
  2. Give an expression for the derivative f(wj) for wj 6= 0. It will be convenient to write your expression in terms of the following definitions:

1        wj > 0

sign(wj) := 0             wj = 0

−1     wj < 0

 
aj := n

2Xx2ij

i=1

 
cj :=      n              

X                             X

2          xij yi −               wkxik.

i=1                                       k6=j

  1. If wj > 0 and minimizes f, show that. Similarly, if wj < 0 and minimizes f, show that. Give conditions on cj that imply that a minimizer wj is positive and conditions for which a minimizer wj is negative.
  2. Derive expressions for the two one-sided derivatives at f(0), and show that cj ∈ [−λ,λ] implies that wj = 0 is a minimizer.
  3. Putting together the preceding results, we conclude the following:

Show that this is equivalent to the expression given in 3.

4           Lasso Properties

4.1           Deriving λmax

In this problem we will derive an expression for λmax. For the first three parts, use the Lasso objective function excluding the bias term i.e,. We will show that for any λ ≥ 2kXTyk, the estimated weight vector wˆ is entirely zero, where k·kis the infinity norm (or supremum norm), which is the maximum absolute value of any component of the vector.

  1. The one-sided directional derivative of f(x) at x in the direction v is defined as:

Compute J0(0;v). That is, compute the one-sided directional derivative of J(w) at w = 0 in the direction v. [Hint: the result should be in terms of X,y,λ, and v.]

  1. Since the Lasso objective is convex, wis a minimizer of J(w) if and only if the directional derivative J0(w;v) ≥ 0 for all v 6= 0. Show that for any v 6= 0, we have J0(0;v) ≥ 0 if and only if λ C, for some C that depends on X,y, and v. You should have an explicit expression for C.
  2. In the previous problem, we get a different lower bound on λ for each choice of v. Show that the maximum of these lower bounds on λ is λmax = 2kXTyk. Conclude that w = 0 is a minimizer of J(w) if and only if λ ≥ 2kXTyk.
  3. [Optional] Let, where 1 Rn is a column vector of 1’s.

Let y¯ be the mean of values in the vector y. Show that (w,b) = (0,y¯) is a minimizer of J(w,b) if and only if λ λmax = 2kXT(y y¯)k.

4.2           Feature Correlation

In this problem, we will examine and compare the behavior of the Lasso and ridge regression in the case of an exactly repeated feature. That is, consider the design matrix X Rm×d, where X·i = X·j for some i and j, where X·i is the ith column of X. We will see that ridge regression divides the weight equally among identical features, while Lasso divides the weight arbitrarily. In an optional part to this problem, we will consider what changes when X·i and X·j are highly correlated (e.g. exactly the same except for some small random noise) rather than exactly the same.

  1. Without loss of generality, assume the first two colums of X are our repeated features. Partition X and θ as follows:

We can write the Lasso objective function as:

J(θ) =kyk22 + λkθk1

=kx1θ1 + x2θ2 + Xrθr yk22 + λ|θ1| + λ|θ2| + λkθrk1 With repeated features, there will be multiple minimizers of J(θ). Suppose that

is a minimizer of J(θ). Give conditions on c and d such that  is also a minimizer of J(θ). [Hint: First show that a and b must have the same sign, or at least one of them is zero.

Then, using this result, rewrite the optimization problem to derive a relation between a and b.]

  1. Using the same notation as the previous problem, suppose

minimizes the ridge regression objective function. What is the relationship between a and b, and why?

  1. [Optional] What do you think would happen with Lasso and ridge when Xi and X·j are highly correlated, but not exactly the same. You may investigate this experimentally or theoretically.

5           [Optional] The Ellipsoids in the `1/`2 regularization picture

Recall the famous picture purporting to explain why `1 regularization leads to sparsity, while `2 regularization does not. Here’s the instance from Hastie et al’s The Elements of Statistical Learning:

(While Hastie et al. use β for the parameters, we’ll continue to use w.)

In this problem we’ll show that the level sets of the empirical risk are indeed ellipsoids centered at the empirical risk minimizer wˆ.

Consider linear prediction functions of the form x 7→ wTx. Then the empirical risk for f(x) = wTx under the square loss is

.

  1. [Optional] Let. Show that wˆ has empirical risk given by
  2. [Optional] Show that for any w we have

.

Note that the RHS (i.e. “right hand side”) has one term that’s quadratic in w and one term that’s independent of w. In particular, the RHS does not have any term that’s linear in w. On the LHS (i.e. “left hand side”), we have. After expanding this out, you’ll have terms that are quadratic, linear, and constant in w. Completing the square is the tool for rearranging an expression to get rid of the linear terms. The following “completing the square” identity is easy to verify just by multiplying out the expressions on the RHS:

  1. [Optional] Using the expression derived for Rˆn(w) in 2, give a very short proof that wˆ = is the empirical risk minimizer. That is:

wˆ = argminRˆn(w).

w

Hint: Note that XTX is positive semidefinite and, by definition, a symmetric matrix M is positive semidefinite iff for all x Rd, xTMx ≥ 0.

  1. [Optional] Give an expression for the set of w for which the empirical risk exceeds the minimum empirical risk Rˆn(wˆ) by an amount c > 0. If X is full rank, then XTX is positive definite, and this set is an ellipse – what is its center?

6           [Optional] Projected SGD via Variable Splitting

In this question, we consider another general technique that can be used on the Lasso problem. We first use the variable splitting method to transform the Lasso problem to a differentiable problem with linear inequality constraints, and then we can apply a variant of SGD.

Representing the unknown vector θ as a difference of two non-negative vectors θ+ and θ, the

d                          d

`1-norm of θ is given by Xθi+ + Xθi. Thus, the optimization problem can be written as

i=1                      i=1

m                                                                        d                             d

(θˆ+ˆ−) = argmin X(hθ+−(xi) − yi)2 + λXθi+ + λXθi

θ+−∈Rd i=1 i=1 i=1 such that θ+ ≥ 0 and θ≥ 0,

where hθ+−(x) = (θ+ θ)Tx. The original parameter θ can then be estimated as θˆ= (θˆ+ θˆ).

This is a convex optimization problem with a differentiable objective and linear inequality constraints. We can approach this problem using projected stochastic gradient descent, as discussed in lecture. Here, after taking our stochastic gradient step, we project the result back into the feasible set by setting any negative components of θ+ and θto zero.

  1. [Optional] Implement projected SGD to solve the above optimization problem for the same λ’s as used with the shooting algorithm. Since the two optimization algorithms should find essentially the same solutions, you can check the algorithms against each other. Report the differences in validation loss for each λ between the two optimization methods. (You can make a table or plot the differences.)
  2. [Optional] Choose the λ that gives the best performance on the validation set. Describe the solution wˆ in term of its sparsity. How does the sparsity compare to the solution from the shooting algorithm?
  • Homework_2-swkyy9.zip