Description

5/5 - (2 votes)

1 Penalized regression via the LASSO

Name: FMAN45 Assignment 1-Penalized regression via the LASSO Solved
SKU: 81103
Price: 35.00 USD
Availability: InStock

In linear regression, the purpose is to estimate the explanatory variable, or weights, w ∈ R^Mgiven a linear relationship to the response variable (data) t ∈ R^N, defined by the regression matrix X ∈ R^N^×M, i.e., the hypothesis space. In cases where for example the regression matrix is underdetermined and the model tends to overfit the data, or when a certain type of solution is required, a penalty term is added to the regression problem. In this assignment, you shall work with the so called Least Absolute Shrinkage and Selection Operator (LASSO), a type of penalized regression used when the sought explanatory variable is sparse, i.e., having few non-zero coordinates. The LASSO solves the minimization problem

minimize Xw (1)

where λ ≥ 0 is a capacity hyperparameter. For the general regression matrix, there exists no closed-form solution for (1). Instead, consider a coordinate-wise approach, iteratively solving a sequence of optimization problems, wherein each only one coordinate in w is optimized at a time, keeping the others fixed at their most recent value. Thus, for the i:th coordinate w_i, the objective is to

minimize(2)

w_i

where x_iand r_i= t − ^P_`_6=ix_`w_`denotes the i:th vector of the regression matrix and the residual vector without the effect of the i:th regressor, respectively. Then, at iteration j, the coordinate descent minimizer of x_ihas the closed-form solution

(3)

where r (4)

`<i `>i

Task 1 (20 p): Verify the first line of (3) by solving (2) for w_i6= 0. You do not need to show the condition for which it holds, i.e., .

Hints: When solving for w_iafter differentiating, you may want to get rid of w_iand solve for |w_i| first. Also, the absolute value |w_i| is not differentiable for w_i= 0. However, for w_i6= 0, its derivative is the sign of w_i, i.e.,

(5)

Now consider the simplified case where the regression matrix is an orthonormal basis, i.e., M = N, and X^>X = I_N, where I_Nis the N × N identity matrix.

Task 2 (10 p): Given the orthogonal regression matrix, show that the coordinate descent solver in (3) will converge in at most 1 full pass over the coordinates in w, i.e., show that wˆ_i(2) − wˆ_i⁽¹⁾= 0, ∀i.

Hint: Starting from (3), you may use (4) to show that wˆ_i^(j), for an orthogonal regression matrix, does not depend on previous estimates wˆ, but only on t,x_i, and λ.

When w is estimated from data using the LASSO, a bias will be induced into it, i.e,

E (wˆ − w^∗) 6= 0. Assume that the data t is truly a noisy (stochastic) example from the hypothesis space defined by the regression matrix, that is

t = Xw^∗+ e, e ∼ N (0_N,σI_N) (6)

where w^∗denotes the specific w defining the model used to generate the data, and where

N(·), 0_N, and σ denote the Gaussian distribution, a length-N column vector of zeroes, and the noise standard deviation, respectively.

Task 3 (10 p): Show that the LASSO estimate’s bias, for an orthogonal regression matrix and data generated by (6), is given by (7) when σ → 0, and discuss how this result relates to the method’s acronym, LASSO.

(7)

Hint: Starting from (3), you may consider the first line as two separate cases, x>i r(ij−1) > λ, and x>i r(ij−1) < −λ, respectively.

2 Hyperparameter-learning via K-fold cross-validation

In this section, you shall consider LASSO regression for the noisy data t in A1_data.mat, consisting of N = 50 data points generated by

(8)

, (9)

(10)

for n = 0,…,N − 1, i.e., the real part of a sum of two complex-valued sinusoids with frequencies 1/20 and 1/5 revolutions per sample, respectively, plus an additive Gaussian noise with standard deviation σ. The regression matrix you’ll be using is given by X in A1_data.mat, which consists of 500 candidate sine and cosine pairs, i.e.,

X X (11)

X (12) u (13)

v(14)

and where f_iare the 500 frequency candidates, equally spaced on the interval [0.02,0.48].

Task 4 (20 p): Implement a (cyclic) coordinate descent solver for the coordinate-wise LASSO solution in (3), using data t and regression matrix X in A1_data.mat. Do not use Xinterp for estimation, it is provided for vizualization only. Then solve the following two subtasks and discuss your results.

Produce reconstruction plots with the following formatting: Plot the original N = 50 data points with disconnected (and clearly visible) markers (no connecting lines) vs. the time axis given by n. Overlay these with N = 50 reconstructed data points, i.e., y = Xwˆ(λ), also using disconnected markers. In addition, add a solid line without markers for an interpolated reconstruction of the data, vs. an interpolated time axis. The interpolatation can be calculated using a highly sampled regression matrix, given as Xinterp. The interpolated time axis is found as ninterp. Specify legends for the all three lines, label the x-axis, and adjust line widths, font sizes and so on to make the plot clearly interpretable. Make three variants of this plot, one for each of the following values of the hyperparameter:
- λ = 0.1. • λ = 10.
- Try other values of λ to see the effect of the hyperparameter, and let λ = λ_userbe the one among these you find most suitable, and produce a reconstruction plot for it.
Count the number of non-zero coordinates for the values of the hyperparameter used above, i.e., λ ∈ {0.1,10,λ_user}, and compare these to the actual number of nonzero coordinates needed to model the data given the true frequencies, which amounts to two pairs, or four coordinates.

Hint: To simplify the implementation, you may fill in and use the function sketch lasso_ccd().

When λ & 0, optimization problem (1) reduces to an ordinary least squares problem, which if underdetermined, i.e., M ≥ N, becomes ill-posed, and/or tends to overfit the data. The LASSO’s penalty term promotes sparsity in w by setting coordinates with small magnitude to zero, thereby reducing the degrees of freedom for the problem. The choice of the hyperparameter λ thus becomes critical; if set too low then too many coordinates become non-zero, although small, overfitting the data, and if set too high, then too many coordinates at set to zero, underfitting the data. Let λ be a grid of J potential values of the hyperparameter, i.e.,

(15)

where typically λ₁is selected very small, and λ_Jis selected as λ_max, i.e.,

(16)

To select λ optimally among these candidates, one may for instance use a K-fold crossvalidation scheme. In this approach, the training data is randomly split in K non-overlapping and equally sized folds (i.e., subsets), where (K-1) folds are used for estimation, and the remaining fold is used for validation. More specifically, let the set I_k⊂ {1,…,N} denote the N_val= |I_k| data indices selected for validation in the k:th fold. Thus, the complement of the validation index set, I_k^c⊂ {1,…,N} denotes the set of N_est= |I_k^c| = N − N_valdata indices used for estimation, such that, by definition, I_k∩ I_k^c= 0. This partitioning is repeated K times for each potential value of λ, cycling through the non-overlapping validation folds and using it for validation, while using the rest for estimation. To measure the performance of the estimated model when used to predict on the validation data, the root mean squared error on validation (RMSE_est) term

(17)

is used, where wˆ ^[k](λ_j) denote the weight estimate (herein the LASSO estimate) for the k:th fold using hyperparameter λ_j. Note that X(I_k) refers to the (concatenation of) rows of X indexed by I_k. The hyperparameter is then selected as

λˆ = λ_p, p = argmin RMSE_val(λ_j) (18)

The following algorithm outlines a suggested implementation of a K-fold cross-validation scheme for the LASSO estimator:

Algorithm 1 K-fold cross-validation for LASSO Select grid of hyperparameters λ_j≥ 0, ∀j, and K > 1.

Intitialize SE_val(k,λ_j) = 0, ∀k,j.

Randomize validation indices I_k, ∀k. for k = 1, …, K do for j = 1, …, J do

Calculate LASSO estimate wˆ(λ_j)^[k]on the k:th fold’s estimation data.

Set.

end for

Calculate.

Select λ^ˆas the λ_jwhich minimizes RMSE_val.

To display the generalization gap, the corresponding root mean squared error on estimation data (RMSE_est) becomes

(19)

Task 5 (20 p): Implement a K-fold cross-validation scheme for the LASSO solver implemented in Task 4. Illustrate your results using the following plots (and make detailed comments for them):

A plot illustrating RMSE_val(λ_j) λ_j,∀j, using a solid line with markers. Overlaid in the same plot, also plot the corresponding RMSE_est(λ_j) using a solid line with different markers. Also add a dashed vertical line at the location of the selected λ^ˆ. Specify legends for the all three lines, label the x-axis, and adjust line widths, font sizes and so on to make the plot clearly interpretable.
A reconstruction plot for the selected λ^ˆ. Use the same layout and formatting as described for the reconstruction plots in Task 4.

Hint: For simpler implementation, you may fill in and use the function sketch lasso_cv(). Also, for improved vizualization and efficiency, select your grid of candidate λ to be evenly spaced on a log-scale, e.g., lambda_grid = exp( linspace( log(lambda_min), log(lambda_max), N_lambda)).

3 Denoising of an audio excerpt

In general, the sounds of importance that we perceive in our daily lives are narrowband, meaning that for any short (< 100 ms) sequence, the spectral representation is sparse, i.e., it may be modeled using only a small number of pure frequencies. This is typically true for, e.g., speech and music. As a contrast, sounds of small importance, or even nuisance sounds, are often broadband, such as, e.g., noise and interference, containing many or most frequencies across the spectrum.

In this section, you shall attempt to perform denoising of a noisy recording of piano music using the LASSO and a regression matrix of sine and cosine pairs. The idea is thus that, using a hypothesis space with frequency functions, and by virtue of the sparse estimates obtained using the LASSO, only the strong narrowband components in the music will be modeled, while the broadband noise is cancelled.

The data archive A1_data.mat contains an excerpt of piano music, 5 seconds in total, divided into two data sets, Ttrain, and Ttest, where the former is for estimation and validation, while the latter is used only for testing (not validation). You can listen to the datasets using the Matlab-function soundsc(Ttrain,fs), where fs is the sampling rate, also provided in the data archive. The training data is much too large (and non-stationary) to be modeled in one regression problem. Instead, the training data is to be divided into frames 40 ms long. For each frame, a (possibly unique) LASSO solution may be obtained, which if used to reconstruct the data can be listened to. The same regression matrix, Xaudio is used for all frames. Do not use regression matrix X from the previous section.

Task 6 (10 p): Implement a K-fold cross-validation scheme for the multi-frame audio excerpt Ttrain, similar to Task 5, but modified in order to find one optimal λ^ˆfor all frames. Illustrate the results with a plot of RMSE_val, RMSE_est, and λ^ˆ, using the exact same formatting as in the corresponding plot in Task 5, and discuss your findings.

Hint: The provided function sketch multiframe_lasso_cv() outlines the algorithm, which may be filled in and used. For the data provided, complete training may take 15 minutes or longer, depending on the implementation.

Task 7 (10 p): Using the λ^ˆyou’ve trained in Task 6, use the provided function lasso_denoise() to denoise the test data Ttest. Save your results as

save(’denoised_audio’,’Ytest’,’fs’) and include in your submission. In addition, listen to your output discuss your results. Try whether another choice of λ achieves better noise-cancelling results and detail your findings in the report.

A Provided data and functions

Contents of data archive A1_data.mat used in sections 2 and 3:

t	Single realization of data t ∈ R⁵⁰in (8)
X	Regression matrix for Task 4 and 5, X ∈ R^50×1000in (11)
Xinterp	Like X but with interpolated time axis X ∈ R^1000×1000
n	Time axis n ∈ Z⁵⁰: n ∈ [0,N − 1]
ninterp	Interpolated time axis n ∈ R¹⁰⁰⁰: n ∈ [0,N − 1]
Ttrain	Mono audio data for estimation and validation, T ∈ R¹⁹⁴⁰⁴
Ttest	Mono audio data for testing, T ∈ R²⁴⁶⁹⁷
Xaudio	Regression matrix for Tasks 6 and 7, X ∈ R^364×2000in (11)
fs	Sampling frequency, f_s= 8820 Hz.

Function sketches (partly contains pseudocode) provided:

what = lasso_ccd(t,X,lambda):

Returns LASSO estimate what (i.e., wˆ(λ)) given an input of data t, regression matrix X, and hyperparameter lambda.

[wopt,lambdaopt,RMSEval,RMSEest] = lasso_cv(t,X,lambdavec,K):

Returns wopt; the LASSO estimate for the optimal λ^ˆ, as well as lambdahat; the optimal lambda, and RMSEval, RMSEest the RMSE on estimation and validation, respectively. Takes t, regression matrix X, grid of candidate hyperparameters lambdavec, and the number of folds K, as inputs.

[Wopt, lambdahat, RMSEval, RMSEest] = multiframe_lasso_cv(T,X,lambdavec,K)

Returns Wopt; an M by N_frames matrix of LASSO estimates for each size-N frame of data in T, calculated for the optimal λ^ˆ, as well as lambdahat; the optimal lambda, and

RMSEval, RMSEest the RMSE on estimation and validation, respectively. Takes multiframe data T, regression matrix X, grid of candidate hyperparameters lambdavec, and the number of folds K, as inputs.

Yclean = lasso_denoise(T,X,lambdaopt):

Returns the denoised audio vector Yclean given an input of noisy audio T, regression matrix X, and hyperparameter lambdaopt.

B Illustration of data in Task 4 and 5

Figur 1: Plot of the provided noisy target values (data) t, together with the noise-free data generating function f(n) in (9).

Assignment-1-c3qevl.zip

FMAN45 Assignment 1-Penalized regression via the LASSO Solved

If Helpful Share:

Description

1 Penalized regression via the LASSO

2 Hyperparameter-learning via K-fold cross-validation

3 Denoising of an audio excerpt

A Provided data and functions

B Illustration of data in Task 4 and 5

Related products

FMAN45 Assignment 4- Classic video game Snake Solved

FMAN45 Assignment 3- Common Layers and Backpropagation Solved

FMAN45 Assignment 2-Solving a nonlinear kernel SVM with hard constraints Solved