COMP6211E Optimization for Machine Learning (Final Exam) Solved

40.00 $

Description

5/5 - (1 vote)

1. (6 points) Consider the following finite sum optimization problem

1 n

n (cid:88)

i=1

max(0, 1 − w(cid:62)xiyi)2 +

λ 2

(cid:107)w(cid:107)2 2,

where xi ∈ Rd and yi ∈ {±1}.

• (1 points) Derive the dual formulation, as in the SDCA procedure. • (1 point) Write down the formula for closed form solution for SDCA (Option I of Algorithm 11.1). • (1 point) Write down the dual free SDCA update rule for ∆αi in Algorithm 14.3. • (1 point) Write down the SGD update rule. • (2 points) Implement the three methods (SDCA, dual-free SDCA, SGD) in prob1() of prog- template.py, and plot the convergence curves (wrt primal- suboptimality) until SDCA converges (error < 10−10).

2. (6 points) Consider the minimax problem:

min x

max y∈C

(cid:20) x(cid:62)Ay + b(cid:62)y +

(cid:21)

,

(cid:107)x(cid:107)2 2

1 2

C = {y : yj ≥ 0} .

• (2 points) Write down the optimal solution of x as a function of y. Write the optimization problem

in terms of y by eliminating x. Explain the derivations.

• (1 points) Write down the GDA update rule for this problem with learning rate η. • (3 points) Implement GDA, extra gradient, optimistic GDA in prob2() of prog-temp.py, and plot 2) for 100

the convergence curves (wrt gradient norm of x and y; 2-norm of x − Ay; by − 1 iterations.

2 (cid:107)Ay(cid:107)2

3. (6 points) Consider zero-th order optimization.

• (2 points) In Theorem 18.6, if we further assume that f (x) is λ strongly convex, and take ηt =

(λt)−1. Derive the corresponding convergence result.

• (2 points) the optimization problem

Ex∼π(x|θ)f (x),

min θ

where x ∈ Rd. Assume we want to solve this problem using policy gradient, with θ = (µ, ρ), where µ is d-dimensional vector, and ρ ∈ R. Both are part of model parameters. Consider distribution π(x|θ) = N (µ, e−ρI). Derive the policy gradient update rule for θ including both (µ, ρ).

1

• (2 points) Consider the zero-th order optimization problem over discrete set x ∈ {0, 1}d. Imple-

ment policy gradients in Example 18.10 and Example 18.11 to solve the objective function

min x∈{0,1}d

f (x),

f (x) =

(cid:21) x(cid:62)Ax − b(cid:62)x + c

(cid:20) 1 2

on prob3() of prog-template.py, plot convergence curves (wrt f (x)) and report your x∗, θ∗ (refer to the Example, p(xi = 1) = θi).

4. (6 points) Consider the setting of decentralized computing, where we are given m nodes. A vector x = [x1, . . . , xm] has m components, and each node contains a local component xi of the vector, with local objective function fi(xi) + gi(xi). At any time step, in addition to local algebraic operations, we can perform the following function calls simultaneously on all nodes:

• (gradient computation) call grad(x): each node computes the local gradient ∇fi(x). • (proximal mapping) call prox(η, z): each node computes the local proximal mapping

arg min ui

[0.5(cid:107)ui − zi(cid:107)2

2 + ηgi(ui)].

• (communication) call communicate(z) = [(zi−1 + zi+1)/2]i=1,…,m: each node sends its local vector zi over the network, then it receives vectors from the neighboring nodes i − 1 and i + 1 via the network, and computes the average (zi−1 + zi+1)/2 (where z0 = zm and zm+1 = z1).

If we have a variable w = [w1, . . . , wm], with wi stored on node i, then node j (cid:54)= i cannot access the information wi on node i directly, except through calling communicate(). We want to use the above function calls to jointly optimize the following objective function:

m (cid:88)

i=1

[fi(xi) + gi(zi)]

x1 = x2 = · · · = xm = z1 = · · · = zm,

by rewriting the above problem as

f (x) + g(z)

(cid:21)

(cid:20)0 I

x −

(cid:21)

(cid:20)B I

z = 0,

with [Bz]i = zi − (zi−1 + zi+1)/2, f (x) = (cid:80)

i fi(xi), g(z) = (cid:80) • (3 points) Write down an algorithm for decentralized optimization using linearized ADMM. Try to combine redundant communications so that no more than two communication calls are needed for each gradient computation.

i gi(zi).

• (3 points) Implement and plot convergence curves (wrt primal-suboptimality) with different pa- rameters (eta and rho in ADMM) to solve the objective function in prob4() of prog-template.py.

2

 

  • final-i9pozy.zip