Description

Rate this product

1. Uniform convergence

Name: CS229 Problem #3-Theory &amp; Unsupervised learning Solved
SKU: 67910
Price: 35.00 USD
Availability: InStock

You are hired by CNN to help design the sampling procedure for making their electoral predictions for the next presidential election in the (fictitious) country of Elbania.

The country of Elbania is organized into states, and there are only two candidates running in this election: One from the Elbanian Democratic party, and another from the Labor Party of Elbania. The plan for making our electorial predictions is as follows: We’ll sample m voters from each state, and ask whether they’re voting democrat. We’ll then publish, for each state, the estimated fraction of democrat voters. In this problem, we’ll work out how many voters we need to sample in order to ensure that we get good predictions with high probability.

One reasonable goal might be to set m large enough that, with high probability, we obtain uniformly accurate estimates of the fraction of democrat voters in every state. But this might require surveying very many people, which would be prohibitively expensive. So, we’re instead going to demand only a slightly lower degree of accuracy.

Specifically, we’ll say that our prediction for a state is “highly inaccurate” if the estimated fraction of democrat voters differs from the actual fraction of democrat voters within that state by more than a tolerance factor γ. CNN knows that their viewers will tolerate some small number of states’ estimates being highly inaccurate; however, their credibility would be damaged if they reported highly inaccurate estimates for too many states. So, rather than trying to ensure that all states’ estimates are within γ of the true values (which would correspond to no state’s estimate being highly inaccurate), we will instead try only to ensure that the number of states with highly inaccurate estimates is small.

To formalize the problem, let there be n states, and let m voters be drawn IID from each state. Let the actual fraction of voters in state i that voted democrat be φ_i. Also let X_ij(1 ≤ i ≤ n,1 ≤ j ≤ m) be a binary random variable indicating whether the j-th randomly chosen voter from state i voted democrat:

1 if the j^thexample from the i^thstate voted democrat

ij =

0 otherwise

We assume that the voters correctly disclose their vote during the survey. Thus, for each value of i, we have that X_ijare drawn IID from a Bernoulli(φ_i) distribution. Moreover, the X_ij’s (for all i,j) are all mutually independent.

After the survey, the fraction of democrat votes in state i is estimated as:

Also, let Z_i= 1{|φˆ_i− φ_i| > γ} be a binary random variable that indicates whether the prediction in state i was highly inaccurate.

Let ψ_ibe the probability that Z_i= 1. Using the Hoeffding inequality, find an upper bound on ψ_i.
In this part, we prove a general result which will be useful for this problem. Let V_iand W_i(1 ≤ i ≤ k) be Bernoulli random variables, and suppose

E[V_i] = P(V_i= 1) ≤ P(W_i= 1) = E[W_i] ∀i ∈ {1,2,…k}

Let the V_i’s be mutually independent, and similarly let the W_i’s also be mutually independent. Prove that, for any value of t, the following holds: !

[Hint: One way to do this is via induction on k. If you use a proof by induction, for the base case (k = 1), you must show that the inequality holds for t < 0, 0 ≤ t < 1, and t ≥ 1.]

The fraction of states on which our predictions are highly inaccurate is given by

. Prove a reasonable closed form upper bound on the probability P(Z > τ) of being highly inaccurate on more than a fraction τ of the states.

[Note: There are many possible answers, but to be considered reasonable, your bound must decrease to zero as m → ∞ (for fixed n and τ > 0). Also, your bound should either remain constant or decrease as n → ∞ (for fixed m and τ > 0). It is also fine

if, for some values of τ, m and n, your bound just tells us that P(Z > τ) ≤ 1 (the trivial bound).]

2.More VC dimension

Let the domain of the inputs for a learning problem be X = R. Consider using hypotheses of the following form:

h_θ(x) = 1{θ₀+ θ₁x + θ₂x²+ ··· + θ_dx^d≥ 0},

and let H = {h_θ: θ ∈ R^d^+[1]} be the corresponding hypothesis class. What is the VC dimension of H? Justify your answer.

[Hint: You may use the fact that a polynomial of degree d has at most d real roots. When doing this problem, you should not assume any other non-trivial result (such as that the VC dimension of linear classifiers in d-dimensions is d + 1) that was not formally proved in class.]

3. LOOCV and SVM

Linear Case. Consider training an SVM using a linear Kernel K(x,z) = x^Tz on a training set {(x⁽ⁱ⁾,y⁽ⁱ⁾) : i = 1,…,m} that is linearly separable, and suppose we do not use ℓ₁ Let |SV | be the number of support vectors obtained when training on the entire training set. (Recall x⁽ⁱ⁾is a support vector if and only if α_i> 0.) Let ˆε_LOOCVdenote the leave one out cross validation error of our SVM. Prove that

εˆ_LOOCV.

General Case. Consider a setting similar to in part (a), except that we now run an SVM using a general (Mercer) kernel. Assume that the data is linearly separable in the high dimensional feature space corresponding to the kernel. Does the bound in part (a) on ˆε_LOOCVstill hold? Justify your answer.

4. [12 points] MAP estimates and weight decay

Consider using a logistic regression model h_θ(x) = g(θ^Tx) where g is the sigmoid function, and let a training set {(x⁽ⁱ⁾,y⁽ⁱ⁾);i = 1,…,m} be given as usual. The maximum likelihood estimate of the parameters θ is given by

m θ Y

θ_ML= argmax p(y⁽ⁱ⁾|x⁽ⁱ⁾;θ).

i=1

If we wanted to regularize logistic regression, then we might put a Bayesian prior on the parameters. Suppose we chose the prior θ ∼ N(0,τ²I) (here, τ > 0, and I is the n+1-byn + 1 identity matrix), and then found the MAP estimate of θ as:

m θ Y

θ_MAP= argmaxp(θ) p(y⁽ⁱ⁾|x⁽ⁱ⁾,θ)

i=1

Prove that

||θMAP||2 ≤ ||θML||2

[Hint: Consider using a proof by contradiction.]

Remark. For this reason, this form of regularization is sometimes also called weight decay, since it encourages the weights (meaning parameters) to take on generally smaller values.

5. KL divergence and Maximum Likelihood

The Kullback-Leibler (KL) divergence between two discrete-valued distributions P(X),Q(X) is defined as follows:¹

For notational convenience, we assume P(x) > 0,∀x. (Otherwise, one standard thing to do is to adopt the convention that “0log0 = 0.”) Sometimes, we also write the KL divergence as KL(P||Q) = KL(P(X)||Q(X)).

The KL divergence is an assymmetric measure of the distance between 2 probability distributions. In this problem we will prove some basic properties of KL divergence, and work out a relationship between minimizing KL divergence and the maximum likelihood estimation that we’re familiar with.

(a) Nonnegativity. Prove the following:

∀P,Q KL(PkQ) ≥ 0

and

KL(PkQ) = 0 if and only if P = Q.

[Hint: You may use the following result, called Jensen’s inequality. If f is a convex function, and X is a random variable, then E[f(X)] ≥ f(E[X]). Moreover, if f is strictly convex (f is convex if its Hessian satisfies H ≥ 0; it is strictly convex if H > 0; for instance f(x) = −logx is strictly convex), then E[f(X)] = f(E[X]) implies that X = E[X] with probability 1; i.e., X is actually a constant.]

Chain rule for KL divergence. The KL divergence between 2 conditional distributions P(X|Y ),Q(X|Y ) is defined as follows: K!

This can be thought of as the expected KL divergence between the corresponding conditional distributions on x (that is, between P(X|Y = y) and Q(X|Y = y)), where the expectation is taken over the random y. Prove the following chain rule for KL divergence:

KL(P(X,Y )kQ(X,Y )) = KL(P(X)kQ(X)) + KL(P(Y |X)kQ(Y |X)).

KL and maximum likelihood.

Consider a density estimation problem, and suppose we are given a training set

{x⁽ⁱ⁾;i = 1,…,m}. Let the empirical distribution be.

(Pˆ is just the uniform distribution over the training set; i.e., sampling from the empirical distribution is the same as picking a random example from the training set.) Suppose we have some family of distributions P_θparameterized by θ. (If you like, think of P_θ(x) as an alternative notation for P(x;θ).) Prove that finding the maximum likelihood estimate for the parameter θ is equivalent to finding P_θwith minimal KL divergence from Pˆ. I.e. prove:

argminKL(PˆkP_θ) = argmax logP_θ(x⁽ⁱ⁾) θ _θX

i=1

Remark. Consider the relationship between parts (b-c) and multi-variate Bernoulli Naive Bayes parameter estimation. In the Naive Bayes model we assumed P_θis of the following form: ). By the chain rule for KL divergence, we therefore have:

KL(^PˆkP_θ) = KL(^Pˆ(y)kp(y)) + XKL(^Pˆ(x_i|y)kp(x_i|y)).

i=1

This shows that finding the maximum likelihood/minimum KL-divergence estimate of the parameters decomposes into 2n + 1 independent optimization problems: One for the class priors p(y), and one for each of the conditional distributions p(x_i|y) for each feature x_igiven each of the two possible labels for y. Specifically, finding the maximum likelihood estimates for each of these problems individually results in also maximizing the likelihood of the joint distribution. (If you know what Bayesian networks are, a similar remark applies to parameter estimation for them.)

6. K-means for compression

In this problem, we will apply the K-means algorithm to lossy image compression, by reducing the number of colors used in an image.

The directory /afs/ir.stanford.edu/class/cs229/ps/ps3/ contains a 512×512 image of a mandrill represented in 24-bit color. This means that, for each of the 262144 pixels in the image, there are three 8-bit numbers (each ranging from 0 to 255) that represent the red, green, and blue intensity values for that pixel. The straightforward representation of this image therefore takes about 262144 × 3 = 786432 bytes (a byte being 8 bits). To compress the image, we will use K-means to reduce the image to k = 16 colors. More specifically, each pixel in the image is considered a point in the three-dimensional (r,g,b)space. To compress the image, we will cluster these points in color-space into 16 clusters, and replace each pixel with the closest cluster centroid.

Follow the instructions below. Be warned that some of these operations can take a while (several minutes even on a fast computer)!^[2]

Copy mandrill-large.tiff from /afs/ir.stanford.edu/class/cs229/ps/ps3 on the leland system. Start up MATLAB, and type A = double(imread(’mandrill-large.tiff’)); to read in the image. Now, A is a “three dimensional matrix,” and A(:,:,1), A(:,:,2) and A(:,:,3) are 512×512 arrays that respectively contain the red, green, and blue values for each pixel. Enter imshow(uint8(round(A))); to display the image.
Since the large image has 262144 pixels and would take a while to cluster, we will instead run vector quantization on a smaller image. Repeat (a) with mandrill-small.tiff. Treating each pixel’s (r,g,b) values as an element of R³, run K-means³with 16 clusters on the pixel data from this smaller image, iterating (preferably) to convergence, but in no case for less than 30 iterations. For initialization, set each cluster centroid to the (r,g,b)-values of a randomly chosen pixel in the image.
Take the matrix A from mandrill-large.tiff, and replace each pixel’s (r,g,b) values with the value of the closest cluster centroid. Display the new image, and compare it visually to the original image. Hand in all your code and a printout of your compressed image (printing on a black-and-white printer is fine).

If we represent the image with these reduced (16) colors, by (approximately) whatfactor have we compressed the image?

[1] If P and Q are densities for continuous-valued random variables, then the sum is replaced by an integral, and everything stated in this problem works fine as well. But for the sake of simplicity, in this problem we’ll just work with this form of KL divergence for probability mass functions/discrete-valued distributions.

[2] In order to use the imread and imshow commands in octave, you have to install the Image package from octave-forge. This package and installation instructions are available at: http://octave.sourceforge.net ³Please implement K-means yourself, rather than using built-in functions from, e.g., MATLAB or octave.

ps3-ca9old.zip

CS229 Problem #3-Theory & Unsupervised learning Solved

If Helpful Share:

Description

1. Uniform convergence

2.More VC dimension

3. LOOCV and SVM

4. [12 points] MAP estimates and weight decay

5. KL divergence and Maximum Likelihood

∀P,Q KL(PkQ) ≥ 0

6. K-means for compression

Related products

CS229 Problem Set #0- Linear Algebra and Multivariable Calculus Solved

CS229 Problem Set #3-Deep Learning & Unsupervised learning Solved

CS229 Problem Set #4-EM, DL, & RL Solved