Exercise Sheet 3
Exercise 1: Lagrange Multipliers
Let x1, . . . , xN ∈ Rd be a dataset of N data points. We consider the objective function
N
J(θ) = ||θ − xk||2
k=1
to be minimized with respect to the parameter θ ∈ Rd. In absence of constraints, the parameter θ that minimizes this objective is given by the empirical mean m = N1 Nk=1 xk. However, this is generally not the case when the parameter θ is constrained.
- (a) Using the method of Lagrange multipliers, find the parameter θ that minimizes J(θ) subject to the constraint θ⊤b = 0, with b some unit vector in Rd. Give a geometrical interpretation to your solution.
- (b) Using the same method, find the parameter θ that minimizes J(θ) subject to ||θ − c||2 = 1, where c is a vector in Rd different from m. Give a geometrical interpretation to your solution.
Exercise 2: Principal Component Analysis (10 + 10 P)
We consider a dataset x1, . . . , xN ∈ Rd. Principal component analysis searches for a unit vector u ∈ Rd such that projecting the data on that vector produces a distribution with maximum variance. Such vector can be found by solving the optimization problem:
argmax u⊤xk− uNN
k=1
(a) Show that the problem above can be rewritten as
arg max u⊤Su u
u⊤xl
1 N 1 N
2
with ∥u∥2=1
N
where S = (xk − m)(xk − m)⊤ is the scatter matrix, and m = N xk is the empirical mean.
k=1 k=1
(b) Show using the method of Lagrange multipliers that the problem above can be reformulated as solving the
eigenvalue problem
and retaining the eigenvector u associated to the highest eigenvalue λ.
Exercise 3: Bounds on Eigenvalues (5 + 5 + 5 + 5 P)
Let λ1 denote the largest eigenvalue of the matrix S. The eigenvalue λ1 quantifies the variance of the data when projected on the first principal component. Because its computation can be expensive, we study how the latter can be bounded with the diagonal elements of the matrix S.
- (a) Show that di=1 Sii is an upper bound to the eigenvalue λ1.
- (b) State the conditions on the data for which the upper bound is tight.
- (c) Show that maxdi=1 Sii is a lower bound to the eigenvalue λ1.
- (d) State the conditions on the data for which the lower bound is tight.
l=1
with
∥u∥2 = 1
1 N
Su = λu
Exercise 4: Iterative PCA (10 P)
When performing principal component analysis, computing the full eigendecomposition of the scatter matrix S is typically slow, and we are often only interested in the first principal components. An efficient procedure to find the first principal component is power iteration. It starts with a random unit vector w(0) ∈ Rd, and iteratively applies the parameter update
w(t+1) = Sw(t) ∥Sw(t)∥
until some convergence criterion is met. Here, we would like to show the exponential convergence of power
iteration. For this, we look at the error terms
w⊤u1
and observe that they should all converge to zero as w approaches the eigenvector u1 and becomes orthogonal
to other eigenvectors.
(a) Show that Ek(w(T)) = |λk/λ1|T ·Ek(w(0)), i.e. the convergence of the algorithm is exponential with the number of time steps T . (Hint: to show this, it is useful to rewrite the scatter matrix in terms of eigenvalues and eigenvectors, i.e. S = di=1 uiu⊤i λi.)
Exercise 5: Programming (30 P)
Download the programming files on ISIS and follow the instructions.
w ⊤ u k
Ek(w) = with k = 2,…,d,
Exercise sheet 3 (programming)
[WiSe 2021/22] Machine Learning 1
In [1]:
import numpy,sklearn,sklearn.datasets,utils
%matplotlib inline
Principal Component Analysis
In this exercise, we will experiment with two different techniques to compute the PCA components of a dataset:
Singular Value Decomposition (SVD)
Power Iteration: A technique that iteratively optimizes the PCA objective.
We consider a random subset of the Labeled Faces in the Wild (LFW) dataset, readily accessible from sklearn, and we apply some basic preprocessing to discount strong variations of luminosity and contrast.
In [2]:
D = sklearn.datasets.fetch_lfw_people(resize=0.5)[‘images’]
D = D[numpy.random.mtrand.RandomState(1).permutation(len(D))[:2000]]*1.0 D = D – D.mean(axis=(1,2),keepdims=True)
D = D / D.std(axis=(1,2),keepdims=True)
print(D.shape)
(2000, 62, 47)
Twofunctionsareprovidedforyourconvenienceandareavailablein utils.py thatisincludedintheziparchive.Thefunctionsarethefollowing:
utils.scatterplot producesascatterplotfromatwo-dimensionaldataset.
utils.render takesanarrayofdatapointsorobjectsofsimilarshape,andrendersthemintheIPythonnotebook.
Some demo code that makes use of these functions is given below.
In [3]:
utils.scatterplot(D[:,32,20],D[:,32,21]) # Plot relation between adjacent pixels utils.render(D[:30],15,2,vmax=5) # Display first 10 examples in the data
PCA with Singular Value Decomposition
Principal components can be found computing a singular value decomposition. Specifically, we assume a matrix Xˉ whose columns contain the data points represented as vectors, and where the data points have been centered (i.e. we have substracted to each of them the mean of the dataset). The matrix Xˉ is
of size d × N where d is the number of input features and N is the number of data points. This matrix, more specifically, the rescaled matrix Z = decomposed using singular value decomposition:
UΛV = Z The k principal components can then be found in the first k columns of the matrix U.
Tasks:
1 Xˉ is then √N
Compute the principal components of the data using the function numpy.linalg.svd .
Measure the computational time required to find the principal components. Use the function time.time() for that purpose. Do not include in your estimate the computation overhead caused by loading the data, plotting and rendering.
Plot the projection of the dataset on the first two principal components using the function utils.scatterplot .
Visualize the 60 leading principal components using the function utils.render .
Note that if the algorithm runs for more than 3 minutes, there may be some error in your implementation.
In [4]:
### REPLACE BY YOUR CODE
import solutions; solutions.pca_svd(D) ###
Time: 42.489 seconds
When looking at the scatter plot, we observe that much more variance is expressed in the first two principal components than in individual dimensions as it was plotted before. When looking at the principal components themselves which we render as images, we can see that the first principal components correspond to low-frequency filters that select for coarse features, and the following principal components capture progressively higher-frequency information and are also becoming more noisy.
PCA with Power Iteration (15 P)
The first PCA algorithm based on singular value decomposition is quite expensive to compute. Instead, the power iteration algorithm looks only for the first component and finds it using an iterative procedure. It starts with an initial weight vector w ∈ Rd, and repeatedly applies the update rule
w ← Sw / ‖
where S is the covariance matrix defined as S = \frac1N \bar{X}\bar{X}^\top. Like for standard PCA, the objective that iterative PCA optimizes is J(\boldsymbol{w}) = \boldsymbol{w}^\top S \boldsymbol{w} subject to the unit norm constraint for \boldsymbol{w}. We can therefore keep track of the progress of the algorithm after each iteration.
Tasks:
Implement the power iteration algorithm. Use as a stopping criterion the value of J(\boldsymbol{w}) between two iterations increasing by less than 0.01.
Print the value of the objective function J(\boldsymbol{w}) at each iteration.
Measure the time taken to find the principal component.
Visualize the the eigenvector \boldsymbol{w} obtained after convergence using the function utils.render . Note that if the algorithm runs for more than 1 minute, there may be some error in your implementation.
In [5]:
### REPLACE BY YOUR CODE
import solutions; solutions.pca_powit(D) ###
iteration 0 J(w) =
0.656
126.401
207.474
221.764
231.388
241.061
250.240
257.989
263.840
267.884
270.508
272.142
273.133
273.725
274.075
274.281
274.402
274.473
274.514
274.538
274.552
274.561
274.565
274.568
274.570
We observe that the computation time has decreased significantly. The difference of performance becomes larger as the number of dimensions and data points increases. We can observe that the principal component is the same (sometimes up to a sign flip) as the one obtained by the SVD algorithm.
iteration iteration iteration iteration iteration iteration iteration iteration iteration iteration 10 iteration 11 iteration 12 iteration 13 iteration 14 iteration 15 iteration 16 iteration 17 iteration 18 iteration 19 iteration 20 iteration 21 iteration 22 iteration 23 iteration 24 Time: 5.962 seconds
1 2 3 4 5 6 7 8 9
J(w) = J(w) = J(w) = J(w) = J(w) = J(w) = J(w) = J(w) = J(w) = J(w) = J(w) = J(w) = J(w) = J(w) = J(w) = J(w) = J(w) = J(w) = J(w) = J(w) = J(w) = J(w) = J(w) = J(w) =
.,, A, n/3 -;,.,,n’- -1-
1.tt
I
:. o
,vJd.,..;;W
:::::.)
.J …) *e ;; 1 . ~ -AJ,-
..J …J _ ) 2.-0-1.,,,, +AJr
:::7
i
L.
+-1.t o
~
kiiv~ ·~ .,:j/,JJ… “”‘-J.A,, ~ .t.
‘° ,.r–:..)
,./
),-J ‘j((‘\)” (e)+ 1(lfJ”If0-~I/+-:\ll(-t1{“-
J J ( t , 1 ) < l ( – J ( t t , ~0 – . . l ” – – –:.___,-..:.:.._· .::: :–::::s I(J – m,I\ ::\ VI -cII – ·”‘
)
L)
(-)
->
.J
. c.ll
de J{9
~>
(11+~){e-c)-( -c) ~ . -~ -~
“””” 1±4-..,/,w.-.- ..-.A t ‘
;h~ ~,th _M ( Y h – ) r . A , _ – ~ ~
J(0J_ j.J-. t.
~~
=
c.)–:::. o ~o
-…) “-
1) C )
‘L IG
-~ – <..
111
i(e- -i/– (e- . …:. _.>
.J-21 ./ ..J _),
+~)(e-C)
( ,1
-6:) (-1+-:\)I-(
-..)
I! ~ – <.I) _
‘\.
=1
+ ~ \ ; ; . ~, . . . 1 1 ~ –
·
‘TN-~ ;tB1=1,e,~II’–~00
-..l
_;;
– –
11
,vl;t:…. .~
.
rtwi }v
.fl.
£,”




