Name: COMS4771 HW3 Solved
SKU: 52564
Price: 35.00 USD
Availability: InStock

Description

5/5 - (6 votes)

[Bayesian interpretation of ridge regression] Consider the following data generating process for linear regression problem in R^d. Nature first selects d weight coefficients w₁,…,w_das w_i∼ N(0,τ²) i.d. Given n examples x₁,…,x_n∈ R^d, nature generates the output variable y_ias

where i.i.d.

Show that finding the coefficients w₁,…,w_dthat maximizes P[w₁,…,w_d|(x₁,y₁)…,(x_n,y_n)] is equivalent to minimizing the ridge optimization criterion.

[Combining multiple classifiers] The concept of “wisdom-of-the-crowd” posits that collective knowledge of a group as expressed through their aggregated actions or opinions is superior to the decision of any one individual in the group. Here we will study a version of the “wisdom-of-the-crowd” for binary classifiers: how can one combine prediction outputs from multiple possibly low-quality binary classifiers to achieve an aggregate high-quality final output? Consider the following iterative procedure to combine classifier results.

Input:

S – a set of training samples: S = {(x₁,y₁),…,(x_m,y_m)}, where each y_i∈ {−1,+1}
T – number of iterations (also, number of classifiers to combine)
F – a set of (possibly low-quality) classifiers. Each f ∈ F, is of the form f : X → {−1,+1}

Output:

F – a set of selected classifiers {f₁,…,f_T}, where each f_i∈ F. – A – a set of combination weights {α₁,…,α_T}

Iterative Combination Procedure:

Initialize distribution weights
for t = 1,…,T do
is weighted error of j-th classifier w.r.t. D_t
Define [for each f_j∈ F]
// select the classifier with the smallest (weighted) error – f_t= argmin
// recompute weights w.r.t. performance of f_t
Compute classifier weight
Compute distribution weight D_t₊₁(i) = D_t(i)exp(−α_ty_if_t(x_i))
Normalize distribution weights
endfor
return weights α_t, and classifiers f_tfor t = 1,…,T.

Final Combined Prediction:

For any test input x, define the aggregation function as: g(x) := ^P_tα_tf_t(x), and return the prediction as sign(g(x)).

We’ll prove the following statement: If for each iteration t there is some γ_t> 0 such that

(that is, assuming that at each iteration the error of the classifier f_tis just γ_tbetter

than random guessing), then error of the aggregate classifier

err.

That is, the error of the aggregate classifier g decreases exponentially fast with the number of combinations T!

Let Z_t:= ^P_iD_t₊₁(i) (i.e., Z_tdenotes the normalization constant for the weighted distribution D_t₊₁). Show that

Show that error of the aggregate classifier g is upper bounded by the product of Z_t:

err(g) ≤ ^Q_tZ_t.

(hint: use the fact that 0-1 loss is upper bounded by exponential loss)

Show that.

(hint: noting Z_t= ^P_iD_t(i)exp(−α_ty_if_t(x_i)), separate the expression for correctly and incorrectly classified cases and express it in terms of

By combining results from (ii) and (iii), we have that err, now show that:

Thus establishing that err(g) ≤ exp(−2^P_tγ_t²).

[Low-dimensionalinformation-preservingtransformations] (hashingthecube) You have a collection of nonzero distinct binary vectors x₁,…,x_m∈ {0,1}ⁿ. To facilitate later lookup, you decide to hash them to vectors of length p < n by means of a linear mapping x_i7→ Ax_i, where A is a p × n matrix with 0-1 entries, and all computations are performed modulo 2. Suppose the entries of the matrix are picked uniformly at random (ie, each an independent coin toss)
- Pick any 1 ≤ i ≤ m, and any b ∈ {0,1}^p. Show that the probability (over the choice of A) that x_ihashes to b is exactly 1/2^p. (Hint: focus on a coordinate 1 ≤ j ≤ n for which x_ij= 1.)
- Pick any 1 ≤ i < j ≤ m. What is the probability that x_iand x_jhash to the same vector? This is called a collision.
- Show that if p ≥ 2log₂m, then with probability at least 1/2, there are no collisions among the x_i. Thus: to avoid collisions, it is enough to linearly hash into O(logm) dimensions!

(question credit: Prof. Sanjoy Dasgupta)

[Strange consequences of high dimensionality] As discussed in class, we often represent our data in high dimensions. Thus to understand our data better and design effective prediction algorithms, it is good to understand how things behave in high dimensions. Obviously, since we cannot visualize or imagine high dimensional spaces, we often tend to rely on how data behave in one-, two- or three-dimensions and extrapolate how they may behave in hundreds of dimensions. It turns out that our low dimensional intuition can be very misleading about data and distributions in high dimensional spaces. In this problem we will explore this in more detail.

Consider the Gaussian distribution with mean µ and identity covariance I_din R^d. Recall that the density assigned to any point x ∈ R^d, then becomes

Show that when x = µ, x gets assigned the highest density.

(This, of course, makes sense: the Gaussian density peaks at its mean and thus x = µ has the highest density.)

If mean has the highest density, it stands to reason that if we draw a large i.i.d. sample from the distribution, then a large fraction of the points should lie close to the mean. Let’s try to verify this experimentally. For simplicity, let mean µ = 0 (covariance is still I_d). Draw 10,000 points i.i.d. from a Gaussian N(0,I_d).

To see how far away a sampled datapoint is from the mean, we can look at the distance kx − µk²= kxk²(that is, the squared length of the sampled datapoint, when mean is zero). Plot the histogram of squared length of the samples, for dimensions d = 1,2,3,5,10,50 and 100. You should plot the all these histograms on the same figure for a better comparison.

What interesting observations do you see from this plot? Do you notice anything strange when the samples that were drawn from the high dimensional Gaussian distribution? Do most of the samples lie close to the mean?

Let’s mathematically derive where we expect these samples to lie. That is, calculate

Ex∼N(0,I_d)hkxk2i.

Is the empirical plot in part (ii) in agreement with the mathematical expression you derived here?

This “strangeness” is not specific to Gaussian distribution, you can observe something similar even for the simplest of distributions in high dimensions. Consider the uniform distribution over the cube [−1,1]^d. Just like in part (ii), draw 10,000 i.i.d. samples from this d-dimensional cube with uniform density, and plot the histogram of how far away from the origin the sample points lie. (do this for d = 1,2,3,5,10,50 and 100, again on the same plot).

Recall that the cube has side length of 2, while most of the high-dimensional samples have length of far more than 2! This means even though you are drawing uniformly from the cube, most of your samples lie in the corners (and not the interior) of the cube!

Again, calculate the expected (squared) length of the samples. That is, calculate

Ex∼unif([−1,_1]^d₎^hkxk2ⁱ.

Does the plot in part (iv) in agreement with the expression you derive here?

hw3-l9vlyu.zip

COMS4771 HW3 Solved

If Helpful Share:

Description

Related products

COMS4771 HW4 Solved

COMS4771 HW 0 Solved

COMS4771 HW2 Solved