FMAN45 Assignment 1-Penalized regression via the LASSO Solved

35.00 $

Category:

Description

5/5 - (2 votes)

1           Penalized regression via the LASSO

In linear regression, the purpose is to estimate the explanatory variable, or weights, w ∈ RM given a linear relationship to the response variable (data) t ∈ RN, defined by the regression matrix X ∈ RN×M, i.e., the hypothesis space. In cases where for example the regression matrix is underdetermined and the model tends to overfit the data, or when a certain type of solution is required, a penalty term is added to the regression problem. In this assignment, you shall work with the so called Least Absolute Shrinkage and Selection Operator (LASSO), a type of penalized regression used when the sought explanatory variable is sparse, i.e., having few non-zero coordinates. The LASSO solves the minimization problem

minimize                   Xw                                                       (1)

w

where λ ≥ 0 is a capacity hyperparameter. For the general regression matrix, there exists no closed-form solution for (1). Instead, consider a coordinate-wise approach, iteratively solving a sequence of optimization problems, wherein each only one coordinate in w is optimized at a time, keeping the others fixed at their most recent value. Thus, for the i:th coordinate wi, the objective is to

minimize(2)

wi

where xi and ri = t − P`6=i x`w` denotes the i:th vector of the regression matrix and the residual vector without the effect of the i:th regressor, respectively. Then, at iteration j, the coordinate descent minimizer of xi has the closed-form solution

(3)

where r                                                                   (4)

`<i                                   `>i

Task 1 (20 p): Verify the first line of (3) by solving (2) for wi 6= 0. You do not need to show the condition for which it holds, i.e., .

Hints: When solving for wi after differentiating, you may want to get rid of wi and solve for |wi| first. Also, the absolute value |wi| is not differentiable for wi = 0. However, for wi 6= 0, its derivative is the sign of wi, i.e.,

(5)

Now consider the simplified case where the regression matrix is an orthonormal basis, i.e., M = N, and X>X = IN, where IN is the N × N identity matrix.

Task 2 (10 p): Given the orthogonal regression matrix, show that the coordinate descent solver in (3) will converge in at most 1 full pass over the coordinates in w, i.e., show that wˆi(2) − wˆi(1) = 0, ∀i.

Hint: Starting from (3), you may use (4) to show that wˆi(j), for an orthogonal regression matrix, does not depend on previous estimates wˆ, but only on t,xi, and λ.

When w is estimated from data using the LASSO, a bias will be induced into it, i.e,

E (wˆ − w∗) 6= 0. Assume that the data t is truly a noisy (stochastic) example from the hypothesis space defined by the regression matrix, that is

t = Xw∗ + e, e ∼ N (0N,σIN)                                                               (6)

where w∗ denotes the specific w defining the model used to generate the data, and where

N(·), 0N, and σ denote the Gaussian distribution, a length-N column vector of zeroes, and the noise standard deviation, respectively.

Task 3 (10 p): Show that the LASSO estimate’s bias, for an orthogonal regression matrix and data generated by (6), is given by (7) when σ → 0, and discuss how this result relates to the method’s acronym, LASSO.

(7)

Hint: Starting from (3), you may consider the first line as two separate cases, x>i r(ij−1) > λ, and x>i r(ij−1) < −λ, respectively.

2           Hyperparameter-learning via K-fold cross-validation

In this section, you shall consider LASSO regression for the noisy data t in A1_data.mat, consisting of N = 50 data points generated by

(8)

,                                                  (9)

(10)

for n = 0,…,N − 1, i.e., the real part of a sum of two complex-valued sinusoids with frequencies 1/20 and 1/5 revolutions per sample, respectively, plus an additive Gaussian noise with standard deviation σ. The regression matrix you’ll be using is given by X in A1_data.mat, which consists of 500 candidate sine and cosine pairs, i.e.,

X X                                                                                                     (11)

X (12) u             (13)

>

v(14)

and where fi are the 500 frequency candidates, equally spaced on the interval [0.02,0.48].

Task 4 (20 p): Implement a (cyclic) coordinate descent solver for the coordinate-wise LASSO solution in (3), using data t and regression matrix X in A1_data.mat. Do not use Xinterp for estimation, it is provided for vizualization only. Then solve the following two subtasks and discuss your results.

  1. Produce reconstruction plots with the following formatting: Plot the original N = 50 data points with disconnected (and clearly visible) markers (no connecting lines) vs. the time axis given by n. Overlay these with N = 50 reconstructed data points, i.e., y = Xwˆ(λ), also using disconnected markers. In addition, add a solid line without markers for an interpolated reconstruction of the data, vs. an interpolated time axis. The interpolatation can be calculated using a highly sampled regression matrix, given as Xinterp. The interpolated time axis is found as ninterp. Specify legends for the all three lines, label the x-axis, and adjust line widths, font sizes and so on to make the plot clearly interpretable. Make three variants of this plot, one for each of the following values of the hyperparameter:
    • λ = 0.1. • λ = 10.
    • Try other values of λ to see the effect of the hyperparameter, and let λ = λuser be the one among these you find most suitable, and produce a reconstruction plot for it.
  2. Count the number of non-zero coordinates for the values of the hyperparameter used above, i.e., λ ∈ {0.1,10,λuser}, and compare these to the actual number of nonzero coordinates needed to model the data given the true frequencies, which amounts to two pairs, or four coordinates.

Hint: To simplify the implementation, you may fill in and use the function sketch lasso_ccd().

When λ & 0, optimization problem (1) reduces to an ordinary least squares problem, which if underdetermined, i.e., M ≥ N, becomes ill-posed, and/or tends to overfit the data. The LASSO’s penalty term promotes sparsity in w by setting coordinates with small magnitude to zero, thereby reducing the degrees of freedom for the problem. The choice of the hyperparameter λ thus becomes critical; if set too low then too many coordinates become non-zero, although small, overfitting the data, and if set too high, then too many coordinates at set to zero, underfitting the data. Let λ be a grid of J potential values of the hyperparameter, i.e.,

(15)

where typically λ1 is selected very small, and λJ is selected as λmax, i.e.,

(16)

To select λ optimally among these candidates, one may for instance use a K-fold crossvalidation scheme. In this approach, the training data is randomly split in K non-overlapping and equally sized folds (i.e., subsets), where (K-1) folds are used for estimation, and the remaining fold is used for validation. More specifically, let the set Ik ⊂ {1,…,N} denote the Nval = |Ik| data indices selected for validation in the k:th fold. Thus, the complement of the validation index set, Ikc ⊂ {1,…,N} denotes the set of Nest = |Ikc| = N − Nval data indices used for estimation, such that, by definition, Ik ∩ Ikc = 0. This partitioning is repeated K times for each potential value of λ, cycling through the non-overlapping validation folds and using it for validation, while using the rest for estimation. To measure the performance of the estimated model when used to predict on the validation data, the root mean squared error on validation (RMSEest) term

(17)

is used, where wˆ [k](λj) denote the weight estimate (herein the LASSO estimate) for the k:th fold using hyperparameter λj. Note that X(Ik) refers to the (concatenation of) rows of X indexed by Ik. The hyperparameter is then selected as

λˆ = λp,            p = argmin RMSEval(λj)                                                      (18)

j

The following algorithm outlines a suggested implementation of a K-fold cross-validation scheme for the LASSO estimator:

Algorithm 1 K-fold cross-validation for LASSO Select grid of hyperparameters λj ≥ 0, ∀j, and K > 1.

Intitialize SEval(k,λj) = 0, ∀k,j.

Randomize validation indices Ik, ∀k. for k = 1, …, K do for j = 1, …, J do

Calculate LASSO estimate wˆ(λj)[k] on the k:th fold’s estimation data.

Set.

end for

end for

Calculate.

Select λˆ as the λj which minimizes RMSEval.

To display the generalization gap, the corresponding root mean squared error on estimation data (RMSEest) becomes

(19)

Task 5 (20 p): Implement a K-fold cross-validation scheme for the LASSO solver implemented in Task 4. Illustrate your results using the following plots (and make detailed comments for them):

  • A plot illustrating RMSEval(λj) λj,∀j, using a solid line with markers. Overlaid in the same plot, also plot the corresponding RMSEest(λj) using a solid line with different markers. Also add a dashed vertical line at the location of the selected λˆ. Specify legends for the all three lines, label the x-axis, and adjust line widths, font sizes and so on to make the plot clearly interpretable.
  • A reconstruction plot for the selected λˆ. Use the same layout and formatting as described for the reconstruction plots in Task 4.

Hint: For simpler implementation, you may fill in and use the function sketch lasso_cv(). Also, for improved vizualization and efficiency, select your grid of candidate λ to be evenly spaced on a log-scale, e.g., lambda_grid = exp( linspace( log(lambda_min), log(lambda_max), N_lambda)).

3           Denoising of an audio excerpt

In general, the sounds of importance that we perceive in our daily lives are narrowband, meaning that for any short (< 100 ms) sequence, the spectral representation is sparse, i.e., it may be modeled using only a small number of pure frequencies. This is typically true for, e.g., speech and music. As a contrast, sounds of small importance, or even nuisance sounds, are often broadband, such as, e.g., noise and interference, containing many or most frequencies across the spectrum.

In this section, you shall attempt to perform denoising of a noisy recording of piano music using the LASSO and a regression matrix of sine and cosine pairs. The idea is thus that, using a hypothesis space with frequency functions, and by virtue of the sparse estimates obtained using the LASSO, only the strong narrowband components in the music will be modeled, while the broadband noise is cancelled.

The data archive A1_data.mat contains an excerpt of piano music, 5 seconds in total, divided into two data sets, Ttrain, and Ttest, where the former is for estimation and validation, while the latter is used only for testing (not validation). You can listen to the datasets using the Matlab-function soundsc(Ttrain,fs), where fs is the sampling rate, also provided in the data archive. The training data is much too large (and non-stationary) to be modeled in one regression problem. Instead, the training data is to be divided into frames 40 ms long. For each frame, a (possibly unique) LASSO solution may be obtained, which if used to reconstruct the data can be listened to. The same regression matrix, Xaudio is used for all frames. Do not use regression matrix X from the previous section.

Task 6 (10 p): Implement a K-fold cross-validation scheme for the multi-frame audio excerpt Ttrain, similar to Task 5, but modified in order to find one optimal λˆ for all frames. Illustrate the results with a plot of RMSEval, RMSEest, and λˆ, using the exact same formatting as in the corresponding plot in Task 5, and discuss your findings.

Hint: The provided function sketch multiframe_lasso_cv() outlines the algorithm, which may be filled in and used. For the data provided, complete training may take 15 minutes or longer, depending on the implementation.

Task 7 (10 p): Using the λˆ you’ve trained in Task 6, use the provided function lasso_denoise() to denoise the test data Ttest. Save your results as

save(’denoised_audio’,’Ytest’,’fs’) and include in your submission. In addition, listen to your output discuss your results. Try whether another choice of λ achieves better noise-cancelling results and detail your findings in the report.

A            Provided data and functions

Contents of data archive A1_data.mat used in sections 2 and 3:

t Single realization of data t ∈ R50 in (8)
X Regression matrix for Task 4 and 5, X ∈ R50×1000 in (11)
Xinterp Like X but with interpolated time axis X ∈ R1000×1000
n Time axis n ∈ Z50 : n ∈ [0,N − 1]
ninterp Interpolated time axis n ∈ R1000 : n ∈ [0,N − 1]
Ttrain Mono audio data for estimation and validation, T ∈ R19404
Ttest Mono audio data for testing, T ∈ R24697
Xaudio Regression matrix for Tasks 6 and 7, X ∈ R364×2000 in (11)
fs Sampling frequency, fs = 8820 Hz.

Function sketches (partly contains pseudocode) provided:

  • what = lasso_ccd(t,X,lambda):

Returns LASSO estimate what (i.e., wˆ(λ)) given an input of data t, regression matrix X, and hyperparameter lambda.

  • [wopt,lambdaopt,RMSEval,RMSEest] = lasso_cv(t,X,lambdavec,K):

Returns wopt; the LASSO estimate for the optimal λˆ, as well as lambdahat; the optimal lambda, and RMSEval, RMSEest the RMSE on estimation and validation, respectively. Takes t, regression matrix X, grid of candidate hyperparameters lambdavec, and the number of folds K, as inputs.

  • [Wopt, lambdahat, RMSEval, RMSEest] = multiframe_lasso_cv(T,X,lambdavec,K)

Returns Wopt; an M by N_frames matrix of LASSO estimates for each size-N frame of data in T, calculated for the optimal λˆ, as well as lambdahat; the optimal lambda, and

RMSEval, RMSEest the RMSE on estimation and validation, respectively. Takes multiframe data T, regression matrix X, grid of candidate hyperparameters lambdavec, and the number of folds K, as inputs.

  • Yclean = lasso_denoise(T,X,lambdaopt):

Returns the denoised audio vector Yclean given an input of noisy audio T, regression matrix X, and hyperparameter lambdaopt.

B              Illustration of data in Task 4 and 5

Figur 1: Plot of the provided noisy target values (data) t, together with the noise-free data generating function f(n) in (9).

  • Assignment-1-c3qevl.zip