## Description

- In this question we consider the hard margin SVM formulation as developed in class and thetextbook. Recall that we stated that for this type of optimization problem there is no duality gap. That is, the primal and dual objective functions have the same value at their optimal solutions. Use this fact and other facts developed in class to show that the maximum margin
*γ*∗ can be calculated from the dual solution*α*^{∗}. More specifically show that . - The soft margin SVM formulation developed in class called for minimizing subject to constraints (for all
*i*) 1 −*ξ*−_{i }*t*(_{i}*v*(^{T}φ*x*) +_{i}*v*_{0}) ≤ 0 and*ξ*≥ 0._{i }

Consider the alternative formulation QSVM that uses a quadratic penalty on *ξ _{i }*which is: minimize subject to constraints (for all

*i*) 1 −

*ξ*−

_{i }*t*(

_{i}*v*(

^{T}φ*x*) +

_{i}*v*

_{0}) ≤ 0.

Note that we do not need the constraint *ξ _{i }*≥ 0 in this formulation because

*ξ*0 increases the objective function and makes the constraints harder to satisfy. Develop the dual formulation of QSVM: that is (1) compute the dual objective function, (2) state the dual optimization problem, and (3) argue that QSVM is also a kernel method.

_{i }<- In this problem we show that many machine learning problems have solutions that can beexpressed through kernels. Consider the standard formulation where
*a*=_{i }*w*(^{T}φ*x*), and where we have a loss function_{i}*`*(*a*). For example we might use the square loss_{i},t_{i}*`*(*a*) = (_{i},t_{i}*t*−_{i}*a*)_{i}^{2 }or the classification log loss*`*(*a*) =_{i},t_{i}*t*ln_{i }*y*+(1−_{i }*t*)ln(1−_{i}*y*) where_{i}*y*=_{i }*σ*(*a*) etc. But for this question we consider any loss function. In addition consider any regularization function_{i}*R*(*b*) which is monotonically increasing, where*b*is a scalar.

Now consider a machine learning objective of the form ^{P}* _{i }`*(

*a*) +

_{i},t_{i}*R*(

*w*).

^{T}wShow the the optimal solution can be expressed as *w *= ^{P}* _{i }α_{i}φ*(

*x*) that is, it is a weighted sum of training examples. Note that this shows that predictions on test example

_{i}*z*can be computed as

*w*(

^{T}φ*z*) =

^{P}

*(*

_{i }α_{i}K*x*).

_{i},z- In lecture we provided a general form for optimizing the hyper-parameters of Gaussian ProcessRegression (GPR). In general there is no closed form and we must use some iterative gradient based optimization. Consider GPR where one hyperparameter,
*θ*_{0}, gives the overall scale of the covariance. Specifically, assume the covariance matrix has the form*C*(*x*_{n},*x*) =_{m}*θ*_{0}*k*(*x*_{n},*x*) where_{m}*k*(·*,*) is any valid kernel. Show that the marginal likelihood can be optimized w.r.t.*θ*_{0 }in closed form and develop the solution for*θ*_{0}.

1