Description

Rate this product

Consider a noisy target , where x ∈ R^d⁺¹(including the added coordinate x₀= 1), y ∈ R, w_f∈ R^d⁺¹is an unknown vector, and is an i.i.d. noise term with zero mean and σ² Assume that we run linear regression on a training data set D = {(x₁,y₁),…,(x_N,y_N)} generated i.i.d. from some P(x) and the noise process above, and obtain the weight vector w_lin. As briefly discussed in Lecture 9, it can be shown that the expected in-sample error E_in(w_lin) with respect to D is given by:

For σ = 0.1 and d = 11, what is the smallest number of examples N such that ED [E_in(w_lin)] is no less than 0.006? Choose the correct answer; explain your answer.

As shown in Lecture 9, minimizing E_in(w) for linear regression means solving ∇E_in(w) = 0, which in term means solving the so-called normal equation

X^TXw = X^Ty.

Which of the following statement about the normal equation is correct for any features X and labels y? Choose the correct answer; explain your answer.

There exists at least one solution for the normal equation.
If there exists a solution for the normal equation, E_in(w) = 0 at such a solution.
If there exists a unique solution for the normal equation, E_in(w) = 0 at the solution.
If E_in(w) = 0 at some w, there exists a unique solution for the normal equation. [e] none of the other choices

In Lecture 9, we introduced the hat matrix H = XX^†for linear regression. The matrix projects the label vector y to the “predicted” vector yˆ = Hy and helps us analyze the error of linear regression. Assume that X^TX is invertible, which makes H = X(X^TX)⁻¹X^T. Now, consider the following operations on X. Which operation can possibly change H? Choose the correct answer; explain your answer.
- multiplying the whole matrix X by 2 (which is equivalent to scaling all input vectors by 2)
- multiplying each of the i-th column of X by i (which is equivalent to scaling the i-th feature by i)
- multiplying each of the n-th row of X by (which is equivalent to scaling the n-th example by
- adding three randomly-chosen columns i,j,k to column 1 of X

(i.e., xn,1 ← xn,1 + xn,i + xn,j + xn,k)

none of the other choices (i.e. all other choices are guaranteed to keep H unchanged.)

Likelihood and Maximum Likelihood

Consider a coin with an unknown head probability θ. Independently flip this coin N times to get y₁,y₂,…,y_N, where y_n= 1 if the n-th flipping results in head, and 0 otherwise. Define

. How many of the following statements about ν are true? Choose the correct

answer; explain your answer by illustrating why those statements are true.

Pr() for all N ∈ N and
ν maximizes likelihood(θ^ˆ) over all θ^ˆ∈ [0,1].
ν minimizes over all ˆy ∈ R.
2 · ν is the negative gradient direction −∇E_in(yˆ) at ˆy = 0.

(Note: θ is similar to the role of the “target function” and θ^ˆis similar to the role of the “hypothesis” in our machine learning framework.)

Let y₁,y₂,…,y_Nbe N values generated i.i.d. from a uniform distribution [0,θ] with some unknown θ. For any θ^ˆ≥ max(y₁,y₂,…,y_N), what is its likelihood? Choose the correct answer; explain your answer.

(Hint: Those who are interested in more math [who isn’t? :-)] are encouraged to try to derive the maximum-likelihood estimator.)

Gradient and Stochastic Gradient Descent

In the perceptron learning algorithm, we find one example (x_n_(t),y_n_(t)) that the current weight vector w_tmis-classifies, and then update w_tby

w_t+1 ← w_t+ y_n(t)x_n(t).

A variant of the algorithm finds all examples (x_n,y_n) that the weight vector w_tmis-classifies (e.g. y_n6= sign(w_t^Tx_n)), and then update w_tby

n: y_n6=sign(^w_t^Tx_n)

The variant can be viewed as optimizing some E_in(w) that is composed of one of the following pointwise error functions with a fixed learning rate gradient descent (neglecting any non-differentiable spots of E_in). What is the error function? Choose the correct answer; explain your answer.

err(w,x,y) = |1 − yw^Tx|
err(w,x,y) = max(0,−yw^Tx) [c] err(w,x,y) = −yw^Tx
err(w,x,y) = min(0,−yw^Tx)
err(w,x,y) = max(0,1 − yw^Tx)

Besides the error functions introduced in the lectures so far, the following error function, exponential error, is also widely used by some learning models. The exponential error is defined by err_exp(w,x,y) = exp(−yw^Tx). If we want to use stochastic gradient descent to minimize an E_in(w) that is composed of the error function, which of the following is the update direction −∇err_exp(w,x_n,y_n) for the chosen (x_n,y_n) with respect to w_t? Choose the correct answer; explain your answer.
- +y_nx_nexp(−y_nw^Tx_n)
- −y_nx_nexp(−y_nw^Tx_n)
- +x_nexp(−y_nw^Tx_n)
- −x_nexp(−y_nw^Tx_n)
- none of the other choices

Hessian and Newton Method

Let E(w): R^d→ R be a function. Denote the gradient b_E(w) and the Hessian A_E(w) by
- and A.

Then, the second-order Taylor’s expansion of E(w) around u is:

Suppose A_E(u) is positive definite. What is the optimal direction v such that w ← u+v minimizes the right-hand-side of the Taylor’s expansion above? Choose the correct answer; explain your answer. (Note that iterative optimization with v is generally called Newton’s method.)

+(A_E(u))⁻¹b_E(u)
−(A_E(u))⁻¹b_E(u)
+(A_E(u))⁺¹b_E(u)
−(A_E(u))⁺¹b_E(u)
none of the other choices

Following the previous problem, considering minimizing E_in(w) in linear regression problem with Newton’s method. For any given w_t, what is the Hessian A_E(w_t) with E = E_in? Choose the correct answer; explain your answer.
- N2 XTX

[e] none of the other choices

Multinomial Logistic Regression

In Lecture 11, we solve multiclass classification by OVA or OVO decompositions. One alternative to deal with multiclass classification is to extend the original logistic regression model to Multinomial Logistic Regression (MLR). For a K-class classification problem, we will denote the output space

Y = {1,2,··· ,K}. The hypotheses considered by MLR can be indexed by a matrix

W = ,

that contains weight vectors (w₁,··· ,w_K), each of length d+1. The matrix represents a hypothesis

that can be used to approximate the target distribution P(y|x) for any (x,y). MLR then seeks for the maximum likelihood solution over all such hypotheses. For a given data set {(x₁,y₁),…,(x_N,y_N)} generated i.i.d. from some P(x) and target distribution P(y|x), the likelihood of h_y(x) is proportional to). That is, minimizing the negative log likelihood is equivalent to minimizing an E_in(W) that is composed of the following error function

err(W,x,y) = −lnh_y(x) = ^−Xy = k lnh_k(x). k=1 J K

When minimizing E_in(W) with SGD, we need to compute ^∂^err(W_∂_Wik^,^x^,y⁾. What is the value of the partial derivative? Choose the correct answer; explain your answer.

J K

[e] none of the other choices

Following the previous problem, consider a data set with K = 2 and obtain the optimal solution from MLR as (w). Now, relabel the same data set by replacing y_nwith 3 to form a binary classification data set. Which of the following is an optimal solution for logistic regression on the binary classification data set? Choose the correct answer; explain your answer.

Nonlinear Transformation

Given the following training data set:

x₁= (0,1),y₁= −1 x₂= (1,−0.5),y₂= −1 x₃= (−1,0),y₃= −1

x₄= (−1,2),y₄= +1 x₅= (2,0),y₅= +1 x₆= (1,−1.5),y₆= +1 x₇= (0,−2),y₇= +1

Using the quadratic transform ), which of the following weights w˜ ^Tin the Z-space can separate all of the training data correctly? Choose the correct answer; (no, you don’t need to explain your answer 🙂).

[−9,−1,0,2,−2,3]
[−5,−1,2,3,−7,2]
[9,−1,4,2,−2,3]
[2,1,−4,−2,7,−4]
[−7,0,0,2,−2,3]

Consider the following feature transform, which maps x ∈ R^dto z ∈ R¹⁺¹, keeping only the kth coordinate of x: Φ_(k)(x) = (1,x_k). Let H_kbe the set of hypothesis that couples Φ_(k)with perceptrons. Among the following choices, which of is the tightest upper bound of for d ≥ 4? Choose the correct answer; explain your answer. (Hint: You can use the fact that for d ≥ 4 if needed.)
- 2((log₂log₂d) + 1)
- 2((log₂d) + 1)
- 2((dlog₂d) + 1)
- 2(d + 1)
- 2(d²+ 1)

Experiments with Linear and Nonlinear Models

Next, we will play with linear regression, logistic regression, non-linear transform, and their use for binary classification. Please use the following set for training:

https://www.csie.ntu.edu.tw/~htlin/course/ml20fall/hw3/hw3_train.dat

and the following set for testing (estimating E_out):

https://www.csie.ntu.edu.tw/~htlin/course/ml20fall/hw3/hw3_test.dat

Each line of the data set contains one (x_n,y_n) with x_n∈ R¹⁰. The first 10 numbers of the line contains the components of x_norderly, the last number is y_n, which belongs to {−1,+1} ⊆ R. That is, we can use those y_nfor either binary classification or regression.

(*) Add x_n,₀= 1 to each x_n. Then, implement the linear regression algorithm on page 11 of Lecture 9. What is E_in^sqr(w_lin), where E_in^sqrdenotes the averaged squared error over N examples? Choose the closest answer; provide your code.
- 0.00
- 0.20
- 0.40
- 0.60
- 0.80
(*) Add x_n,₀= 1 to each x_n. Then, implement the SGD algorithm for linear regression using the results on pages 10 and 12 of Lecture 11. Pick one example uniformly at random in each iteration, take η = 0.001 and initialize w with w₀= 0. Run the algorithm until E_in^sqr(w_t) ≤ 1.01E_in^sqr(w_lin), and record the total number of iterations taken. Repeat the experiment 1000 times, each with a different random seed. What is the average number of iterations over the 1000 experiments? Choose the closest answer; provide your code.
- 600
- 1200
- 1800
- 2400
- 3000
(*) Add x_n,₀= 1 to each x_n. Then, implement the SGD algorithm for logistic regression by replacing the SGD update step in the previous problem with the one on page 10 of Lecture 11. Pick one example uniformly at random in each iteration, take η = 0.001 and initialize w with w₀= 0. Run the algorithm for 500 iterations. Repeat the experiment 1000 times, each with a different random seed. What is the average ) over the 1000 experiments, where E_in^cedenotes the averaged cross-entropy error over N examples? Choose the closest answer; provide your code.
- 0.44
- 0.50
- 0.56
- 0.62
- 0.68
(*) Repeat the previous problem, but with w initialized by w₀= w_linof Problem 14 instead. Repeat the experiment 1000 times, each with a different random seed. What is the average) over the 1000 experiments? Choose the closest answer; provide your code.
- 0.44
- 0.50
- 0.56
- 0.62
- 0.68
(*) Following Problem 14, what is(w_lin) (w_lin), where 0/1 denotes the 0/1 error (i.e.

(0/1) using w_linfor binary classification), and E_outis estimated using the test set provided above? Choose the closest answer; provide your code.

0.32
0.36
0.40
0.44
0.48

(*) Next, consider the following homogeneous order-Q polynomial transform

Transform the training and testing data according to Φ(x) with Q = 3, and again implement the linear regression algorithm on page 11 of lecture 9. What is, where g is the hypothesis returned by the transform + linear regression procedure? Choose the closest answer; provide your code.

0.32
0.36
0.40
0.44
0.48

(*) Repeat the previous problem, but with Q = 10 instead. What is ? Choose the closest answer; provide your code.
- 0.32
- 0.36
- 0.40
- 0.44
- 0.48

homework3-hla1vo.zip

ML Homework3-Linear Regression Solved

If Helpful Share:

Description

Likelihood and Maximum Likelihood

Gradient and Stochastic Gradient Descent

Hessian and Newton Method

Multinomial Logistic Regression

Nonlinear Transformation

Experiments with Linear and Nonlinear Models

Related products

ML Homework1- Bayesian Linear Regression Solved

ML Homework3- Perceptron Solved

Machine-Leanrning-Project1 Solved