COMP6211E Assignment 4 Solved

30.00 $

Description

Rate this product

Theoretical Problems 

  1. (4 points) Write down the ADMM algorithms for the following problems
    • (2 points) Let Σ be aˆ d×d symmetric positive semidefinite matrix. We want to solve the following problem with d × d symmetric positive definite matrices X and Z:

) + trace(ΣˆX) + λkZk1i,                X Z = 0.

Write down the ADMM algorithm, and derive the closed form solutions for the sub-optimization problems. (Input: ρ, Σ,ˆ X0, T; Output: XT)

  • (2 points) Let x ∈ Rd, b ∈ Rd, and Σ is a d × d symmetric positive definite matrix, A is m × d Assume that we want to solve the problem

subject to kAxk≤ 1

using ADMM by rewriting it as follows:

.

Write down the ADMM algorithm for this decomposition with closed form solutions for the sub problems. (Input: ρ, Σ, A, b, x0, T, m; Output: xT)

  1. (4 points) Write down the SGD algorithms for the following problems. Consider the k-class linear structured SVM problem, which has the following loss function:

,

with the constraints

 .

  • (2 points) Write down the SGD procedure with batchsize 1 (single step update rule).
  • (2 points) Assume that supy kψ(xi,y)k2 A and you have a budget of T total gradient evaluations. How do you set learning rate in your SGD procedure, and what is the convergence rate you expect based on the lecture?

1

  1. (6 points) Consider the L1L2 regularized loss minimization problem:

 ,

where xi,w ∈ Rd, kxik2 ≤ 1, yi ∈ {±1},λ < 1, and (u)+ = max(0,u).

  • (2 point) Estimate a simple upper bound of smoothness parameter and a simple lower bound of strong convexity parameter. Estimate a simple bound for the SGD variance V .
  • (2 points) Write down the minibatch Accelerated Proximal SGD update rule with batch size m. (Input: w0, λ, {ηtt}, T, m; Output: wT; Assume λ > 0) For T˜ = mT total gradient computations, what is the largest batch size you can choose? How do you want to set constant learning rate and momentum parameters for this batch size, and what’s the convergence rate? You may use O(·) notation to hide constants.
  • (2 points) Assume that λ = 0, write down minibatch stochastic RDA update rule for this problem with minibatch size m for T˜ = mT total gradients, by setting constant θ and η0, and calculate the corresponding setting for ηt. (Input: w0, η0 > 0, θ ∈ (0,1), T, m; Output: wT)

What is the largest batch size m, and what’s the corresponding θ and η0 and what is the convergence rate in terms of T˜? You may use O˜(·) which is up to a lnT˜ factor.

By comparing the solution of the proximal operator, can you explain why dual averaging can achieve more sparsity when the weights are near zero?

Programming Problem (6 points)

  • Using the mnist class 1 (positive) versus 7 (negative) data.
  • Use the python template “progtemplate.py”, and implement functions marked with ’# implement’.
  • (4 pts) Implement RDA-ACCL and ADMM-ACCL-linear and compare the RDA-ACCL algorithm with different θ Submit your code and outputs.
  • (2 pts) Compare your plots to the theoretical convergence rates in class, and discuss your experimental results.

2

  • assignment4-0mtn7q.zip