Description
1. (6 points) Consider the following finite sum optimization problem
1 n
n (cid:88)
i=1
max(0, 1 − w(cid:62)xiyi)2 +
λ 2
(cid:107)w(cid:107)2 2,
where xi ∈ Rd and yi ∈ {±1}.
• (1 points) Derive the dual formulation, as in the SDCA procedure. • (1 point) Write down the formula for closed form solution for SDCA (Option I of Algorithm 11.1). • (1 point) Write down the dual free SDCA update rule for ∆αi in Algorithm 14.3. • (1 point) Write down the SGD update rule. • (2 points) Implement the three methods (SDCA, dual-free SDCA, SGD) in prob1() of prog- template.py, and plot the convergence curves (wrt primal- suboptimality) until SDCA converges (error < 10−10).
2. (6 points) Consider the minimax problem:
min x
max y∈C
(cid:20) x(cid:62)Ay + b(cid:62)y +
(cid:21)
,
(cid:107)x(cid:107)2 2
1 2
C = {y : yj ≥ 0} .
• (2 points) Write down the optimal solution of x as a function of y. Write the optimization problem
in terms of y by eliminating x. Explain the derivations.
• (1 points) Write down the GDA update rule for this problem with learning rate η. • (3 points) Implement GDA, extra gradient, optimistic GDA in prob2() of prog-temp.py, and plot 2) for 100
the convergence curves (wrt gradient norm of x and y; 2-norm of x − Ay; by − 1 iterations.
2 (cid:107)Ay(cid:107)2
3. (6 points) Consider zero-th order optimization.
• (2 points) In Theorem 18.6, if we further assume that f (x) is λ strongly convex, and take ηt =
(λt)−1. Derive the corresponding convergence result.
• (2 points) the optimization problem
Ex∼π(x|θ)f (x),
min θ
where x ∈ Rd. Assume we want to solve this problem using policy gradient, with θ = (µ, ρ), where µ is d-dimensional vector, and ρ ∈ R. Both are part of model parameters. Consider distribution π(x|θ) = N (µ, e−ρI). Derive the policy gradient update rule for θ including both (µ, ρ).
1
• (2 points) Consider the zero-th order optimization problem over discrete set x ∈ {0, 1}d. Imple-
ment policy gradients in Example 18.10 and Example 18.11 to solve the objective function
min x∈{0,1}d
f (x),
f (x) =
(cid:21) x(cid:62)Ax − b(cid:62)x + c
(cid:20) 1 2
on prob3() of prog-template.py, plot convergence curves (wrt f (x)) and report your x∗, θ∗ (refer to the Example, p(xi = 1) = θi).
4. (6 points) Consider the setting of decentralized computing, where we are given m nodes. A vector x = [x1, . . . , xm] has m components, and each node contains a local component xi of the vector, with local objective function fi(xi) + gi(xi). At any time step, in addition to local algebraic operations, we can perform the following function calls simultaneously on all nodes:
• (gradient computation) call grad(x): each node computes the local gradient ∇fi(x). • (proximal mapping) call prox(η, z): each node computes the local proximal mapping
arg min ui
[0.5(cid:107)ui − zi(cid:107)2
2 + ηgi(ui)].
• (communication) call communicate(z) = [(zi−1 + zi+1)/2]i=1,…,m: each node sends its local vector zi over the network, then it receives vectors from the neighboring nodes i − 1 and i + 1 via the network, and computes the average (zi−1 + zi+1)/2 (where z0 = zm and zm+1 = z1).
If we have a variable w = [w1, . . . , wm], with wi stored on node i, then node j (cid:54)= i cannot access the information wi on node i directly, except through calling communicate(). We want to use the above function calls to jointly optimize the following objective function:
m (cid:88)
i=1
[fi(xi) + gi(zi)]
x1 = x2 = · · · = xm = z1 = · · · = zm,
by rewriting the above problem as
f (x) + g(z)
(cid:21)
(cid:20)0 I
x −
(cid:21)
(cid:20)B I
z = 0,
with [Bz]i = zi − (zi−1 + zi+1)/2, f (x) = (cid:80)
i fi(xi), g(z) = (cid:80) • (3 points) Write down an algorithm for decentralized optimization using linearized ADMM. Try to combine redundant communications so that no more than two communication calls are needed for each gradient computation.
i gi(zi).
• (3 points) Implement and plot convergence curves (wrt primal-suboptimality) with different pa- rameters (eta and rho in ADMM) to solve the objective function in prob4() of prog-template.py.
2