Description

Rate this product

1. Logistic Regression

Name: (EE270)Large Scale Matrix Computation, Optimization and Learning-HW #3Solved
SKU: 71562
Price: 34.00 USD
Availability: InStock

In this question, we will study logistic regression, which is a popular method in machine learning. We let w ∈ R^pbe our decision variable. w represents weights for the columns of the n × p data matrix X, where the i^throw of X is x^T_i. Let

be the sigmoid/logistic function. This function is non-convex. However −log(σ(a)) is convex, which arises in maximum likelihood estimation as we describe next.

We assume that we collect data x_i, and a binary response variable y_i, which is the label, where

(+1 with probability σ(wT xi)

yi = −1 with probability 1 − σ(wT xi) . (1)

We can write the above in the compact form p(y_i|w,x_i) = σ(y_iw^Tx_i) since σ(a) = 1 − σ(−a). If we collect the observations, then the probability of observing this outcome is) and so the negative log-likelihood, which we will use as our

objective function, is

Note that minimizing negative log-likelihood is same as maximizing the likelihood. This corresponds to maximum likelihood estimation of the parameter w. Once w is identified, we can use (1) to infer the label of a test data point x.

Derive the gradient ∇`(w).
Derive the Hessian ∇²`(w).
Is the cost function `(w) convex?

2. Logistic Regression for Spam E-mail Classification

Download the email spam data set , which is available at https://github.com/probml/ pmtk3/tree/master/data/spamData in Matlab and plain text format. This set consists of 4601 email messages, from which 57 features have been extracted. These are as follows

48 features, in [0 100], giving the percentage of words in a given message which match a given word on the list. The list contains words such as “business”, “free”, “george”, etc.
6 features, in [0 100], giving the percentage of characters in the email that match a given character on the list. The characters are ; ( [ ! $ #
Feature 55: The average length of an uninterrupted sequence of capital letters (max is 40.3, mean is 4.9) Feature 56: The length of the longest uninterrupted sequence of capital letters ( max is 45, mean is 52.6) • Feature 57: The sum of the lengths of uninterrupted sequence of capital letters (max is 25.6, mean is 282.2)

Load the data set using the provided link. In the Matlab version, there is a training set of size 3065 and a test set of size 1536. In the plain text version, there is no predefined testing/training set, you can randomly shuffle the data to pick a training set of size 3065 and use the rest for testing (or you can use the Matlab data along with scipy.io.loadmat).

There are different methods to pre-process the data, e.g., standardize the columns of X. For this problem, transform the features using log(x_ij+ 0.1). One could also add some regularization to the loss function which can help generalization error but this is not necessary. Also note that you need to transform the labels from {0,1} to {1,−1}.

Run gradient descent with a fixed step-size. Plot the value of the cost function at each iteration and find a reasonable step-size for fast convergence
Repeat the previous part using gradient descent with momentum.
Implement gradient descent with Armijo line search. This procedure is as follows: Assume that we are at the point x_kand have a search direction p_k(for gradient descent p_k= −∇f(x_k)). Then, the Armijo line search procedure is:
- Pick an initial step-size t
- Initialize the parameters 0 < ρ < 1 and 0 < c < 1 (typical values are c = 10⁻⁴and ρ = 0.9)
- While f(x_k+ tp_k) > f(x_k) + ct∇f(x_k)^Tp_k, do t ← ρt
- Terminate the procedure if f(x_k+ tp_k) ≤ f(x_k) + ct∇f(x_k)^Tp_k

The test statement in the while loop is the Armijo condition. If p_k= −∇f(x_k), then the test is accepted when. In general, the second term is negative as long as p_kis a descent direction. One can prove this linesearch procedure will terminate.

Find a good estimate for the initial step-size by trial and error. A simple idea is to use the final step-size from the previous step, but this can be unnecessarily small. You may want to do this, but increase the step-size by a factor of 2.

3. Newton’s Method is Affine Invariant

In this question, we will prove the affine invariance of Newton’s method. Let f : Rⁿ→ R be a convex function. Consider an affine transform y → Ay + b, where A ∈ Rⁿ^×nis invertible and b ∈ Rⁿ. Define the function g : Rⁿ→ R by g(y) = f(Ay + b). Denote by x^(k)the k^thiterate of Newton’s method performed on f. Denote y^(k)the k^thiterate of Newton’s method performed on g.

Show that if x(k) = Ay(k) + b, then x(k+1) = Ay(k+1) + b.
Show that Newton’s decrement does not depend on the coordinates, i.e., show that λ(x^(k)) = λ(y^(k)), where λ(x) = (∇f(x)^T∇²f(x)⁻¹∇f(x))^1/2.

Together, this implies that Newton’s method is affine invariant. As an important consequence, Newton’s method cannot be improved by a change of coordinates, unlike gradient descent.

4. Newton’s Method for Convex Optimization

Implement Newton’s method for the logistic regression problem in Problem 1. Plot the value of the cost function at each iteration and find a reasonable step-size for fast convergence.
Implement randomized Newton’s method with uniform sampling sketch, i.e., sampling rows of H^1/2uniformly at random where H and H^1/2denote Hessian and its square-root respectively. Plot the value of the cost function at each iteration and find a reasonable step-size and sketch-size for fast convergence.

5. Fast Johnson-Lindenstrauss Transform (FJLT) using Hadamard Matrices

Construct an 128 × 1024 FJLT matrix as follows

Set m = 128 and n = 1024

Define

Construct H₁₀∈ R^1024,1024recursively via

Generate D as an n×n diagonal matrix of uniformly random ±1 variables (Rademacher distribution)

Generate an m × n uniform sub-sampling matrix P scaled with (uniform sampling sketch)

Form the FJLT matrix.

Verify that S^TS is a multiple of identity. Scale S appropriately if needed to obtain S^TS = I.
Generate a data matrix A of size 1024 × 10 using i.i.d. standard Gaussian variables. Plot the singular values of A and singular values of SA.
In part (c), SA is a Johnson-Lindenstrauss embedding of 10 vectors (1024 dimensional column vectors of A) to dimension 128. Verify that the pairwise distances are approximately preserved, i.e., there exists an

(2)

where A_iis the i-th column of A. Find ^∗, the smallest value of that satisfy (2) for a single realization of the random construction. Note that the matrices D and P are constructed randomly, while H is deterministic. What is the minimum, maximum, and mean value of ^∗in 100 random realizations of the construction?

Hint: The JL embedding property in (2) specifies 2d²linear inequalities of the form

, hence the smallest.

Generate a data matrix A of size 1024 × 10, and a vector b of size 1024 × 1 using i.i.d. standard Gaussian variables. Solve the least squares problem

x^∗= argmin . x

Apply the FJLT to A and b as SA and Sb. Solve the sketched least squares problem

x˜ := argmin . x

Find the Euclidean distance between the solutions, i.e., kx˜−x^∗k₂, between their predictions, i.e., kAx˜ − Ax^∗k₂and the approximation ratio () of the objective value.

homework-3-octon6.zip

(EE270)Large Scale Matrix Computation, Optimization and Learning-HW #3Solved

If Helpful Share:

Description

1. Logistic Regression

2. Logistic Regression for Spam E-mail Classification

3. Newton’s Method is Affine Invariant

4. Newton’s Method for Convex Optimization

5. Fast Johnson-Lindenstrauss Transform (FJLT) using Hadamard Matrices

Related products

(EE270) Large Scale Matrix Computation, Optimization and Learning-HW #2 Solved

(EE270) Large Scale Matrix Computation, Optimization and Learning -HW #4 Solved

(EE270)Large Scale Matrix Computation, Optimization and Learning-HW #1 Solved