COMP0085 Assignment 1-Approximate Inference Solved

30.00 $

Category: Tags: , , , ,
Click Category Button to View Your Next Assignment | Homework

You'll get a download link with a: zip solution files instantly, after Payment

Securely Powered by: Secure Checkout

Description

5/5 - (1 vote)

1.  A biochemical pathway

A friend who works for a pharmaceutical company has asked for your help analysing the concentrations of 9 species of molecule within a biochemical cascade. The process is a trade secret, so she refers to them by the letters A – I. She tells you that:

  • The concentrations of both species B and C depend on that of A.
  • Molecule C seems to redirect the process that produces enzyme D to produce enzyme E

    instead.

  • D catalyses the production of F from B, while E catalyses the production of H from G.
  • F and H then combine to form the end product I.

    While these general facts are well established, the reaction kinetics are very sensitive to small temperature and stereochemical fluctations; making each link rather noisy. This means that the parameters have not been well characterised.

(a) Using the information you friend has given you, draw a directed acyclic graph (DAG or Bayes Net) showing how the concentrations of the 9 species depend on one another.

(b) Show:
• the moralised graph

• an efficient triangulation
• the resulting junction tree
• the junction tree redrawn as a factor graph.

(c) Find the (non-unique) smallest set of molecules, such that if the concentrations of the species within the set are known, the concentrations of the others would all be indepen- dent (conditioned on the measured ones).

Your friend has gathered data on the concentrations of some of the species over a number of experiments. She expresses each concentration as a perturbation around its mean value (the mean being calculated over a very large number of experiments). Let us write these concentration perturbations as δ[A] etc. She suggests that a reasonable parametric model for the dependence of each perturbation on that of its parents might be to take a weighted linear combination of the parent concentration perturbations and add a reaction-specific Gaussian noise variable.

Unfortunately, the only measurements she is willing to give you are those of δ[B], δ[D], δ[E], and δ[G].

  1. (d)  Given the linear-Gaussian structure, you decide to try factor analysis on these measure- ments; that is, a model of the form

    z ∼ N (0,I) x|z ∼ N (Λz, Ψ)

    where the covariance Ψ is diagonal, but with possibly unequal diagonal variances. Given the known graph, how many factors would you expect to recover? What would these correspond to? [5 marks]

  2. (e)  Can the results of the factor analysis be used to recover the concentration perturbations of any other species in the cascade? What about the linear weights in the graph? Discuss the identifiability of each node and each weight in your DAG — can its value can be determined upto an unknown scale factor (recall that factor analysis can never uniquely identify the scale of a latent parameter). [10 marks]

2. [35 marks] Bayesian linear and Gaussian process regression. The following time series of monthly mean global CO2 concentrations can be obtained from the file co2.txt (data downloaded from http://www.esrl.noaa.gov/gmd/ccgg/trends):

420

410

400

390

380

370

360

350

340

330
1980 1985

1990 1995 2000

2005 2010

2015 2020 2025

global mean CO 2

concentration

date (decimal year)

We will apply Bayesian linear and Gaussian process regression to predict the CO2 concentra- tion f(t) as a function of time t, given as the “decimal year” in the file.

  1. (a)  First we model the function using linear regression, that is, using the functional form f(t) = at + b + ε(t),

    with i.i.d. noise residual ε(t) ∼ N(0,1) and prior a ∼ N(0,102), b ∼ N(360,1002). Compute and show the posterior mean and covariance over a and b given the CO2 data. [10 marks]

  2. (b)  Let aMAP,bMAP be the MAP estimate in the question above. The residual is the differ- ence between the observed function values and the predicted mean function values

    gobs(t) = fobs(t) − (aMAPt + bMAP),

    where fobs(t) is the observed value of the CO2 concentration at time t.
    Plot gobs(t). Do you think these residuals conform to our prior over ε(t)? State, with justifications, which characteristics of the residual you think do or do not conform to our prior belief. [5 marks]

  3. (c)  Write a function to generate samples drawn from a GP. Specifically, given a covariance kernel function k(·,·) and a vector of input points x, return a function f(x) evaluated on the input points x drawn randomly from a GP with the given covariance kernel and with zero mean.

    [10 marks]

  4. (d)  Test your function by plotting sample functions drawn from the following kernel, for various settings of the hyperparameters

    2􏰅 􏰅 2sin2(π(s−t)/τ)􏰆 2 􏰅 (s−t)2􏰆􏰆 2
    k(s,t)=θ exp − σ2 +φ exp − 2η2 +ζ δs=t (1)

    Describe the characteristics of the drawn functions, and how the characteristics of the functions depend on the parameters. [5 marks]

parts per million

  1. (e)  Suppose we were to consider modelling the residual function g(t) using a zero mean GP with the covariance kernel above. Based on the plot of g(t) and your explorations in the preceding part, what do you think will be suitable values for the hyperparameters of k? [5 marks]
  2. (f)  [Bonus] Extrapolate the CO2 concentration levels to 2020 using the GP with covariance kernel k of eqn 1, and your chosen parameter values. Specifically, compute the predictive mean and variance of the residual g(t) for every month between September 2007 and December 2020 given the observed residuals gobs(t). Plot the means and one standard deviation error bars of the extrapolated CO2 concentration levels

    f(t) = aMAPt + bMAP + g(t)

    along with the observed CO2 levels. Does the behaviour of the extrapolation conform to your expectations? How sensitive are your conclusions to settings of the kernel hy- perparameters? [15 bonus marks]

  3. (g)  [Bonus] Why is the above procedure not fully Bayesian? How would we go about modelling f(t) in a Bayesian framework? [5 bonus marks]

3. [70 marks] Mean-field learning

Consider a binary latent factor model. This is a model with a vector s of K binary latent vari- ables, s = (s1, . . . , sK ), a real-valued observed vector x and parameters θ = {{μi, πi}Ki=1, σ2}. The model is described by:

KK

p(s|π) = p(s1,…,sK|π) = 􏰑p(si|πi) = 􏰑πsi(1−πi)(1−si) i

i=1 i=1 􏰉􏰊

p(x|s1,…,sK,μ,σ2) = N 􏰐siμi,σ2I i

where x is a D-dimensional vector and I is the D × D identity matrix. Assume you have a data set of N i.i.d. observations of x, i.e. X = {x(1),…,x(N)}.

Warning: Each question depends on earlier questions.

Hand in: Derivations, code and plots.

We will implement generalized EM learning using the fully factored (a.k.a. mean-field) vari- ational approximation for the model above. That is, for each data point x(n), we will approximate the posterior distribution over the hidden variables by a distribution:

[lambda,F] = MeanField(X,mu,sigma,pie,lambda0,maxsteps)
where lambda is N × K, F is the lower bound on the likelihood, X is the N × D data matrix (X), mu is the D×K matrix of means, pie is the 1×K vector of priors on s, lambda0 are initial values for lambda and maxsteps are maximum number of steps of the fixed point equations. You might also want to set a convergence criterion so that if F changes by less than some very small number ε the iterations halt. [20 marks]

  1. (b)  We have derived the M step for this model in terms of the quantities: X, ES = Eq[s], which is an N ×K matrix of expected values, and ESS, which is an N ×K ×K array of expected values Eq[ss⊤] for each n. The full derivation is provided in Appendix B. Write two or three sentences discussing how the solution relates to linear regression and why. [5 marks]
  2. (c)  Using the above, we have implemented a function:

    [mu, sigma, pie] = MStep(X,ES,ESS)
    This can be implemented either taking in ESS = a K × K matrix summing over N the ESS array as defined above, or taking in the full N × K × K array. This code can be found in Appendix C and can also be found on the web site. Study this code and figure out what the computational complexity of the code is in terms of N, K and D for the case where ESS is K × K. Write out and justify the computational complexity; don’t assume that any of N, K, or D is large compared to the others. [5 marks]

  3. (d)  Examine the data images.jpg shown on the web site (Do not look at genimages.m yet!). This shows 100 greyscale 4 × 4 images generated by randomly combining several

qn(s(n)) =
and find the λ(n)’s that maximize Fn holding θ fixed.

(a) Write a function of the form:

􏰑K (n) (n) λsi (1 − λin)(1−si )

in i=1

features and adding a little noise. Try to guess what these features are by staring at the images. How many are there? Would you expect factor analysis to do a good job modelling this data? How about ICA? mixture of Gaussians? Explain your reasoning. [10 marks]

  1. (e)  Put the E step and M step code together into a function:

    [mu, sigma, pie] = LearnBinFactors(X,K,iterations)
    where K is the number of binary factors, and iterations is the maximum number

    of iterations of EM. Include a check that F increases at every iteration (this is a good debugging tool). [10 marks]

  2. (f)  Run your algorithm for learning the binary latent factor model on the data set generated by genimages.m. What features mu does the algorithm learn (rearrange them into 4 × 4 images)? How could you improve the algorithm and the features it finds? Explain any choices you make along the way and the rationale behind them (e.g. what to set K, how to initialize parameters, hidden states, and lambdas). [10 marks]
  3. (g)  For the setting of the parameters learned in the previous step, run the variational approx- imation for just the first data point (i.e. to find q1(s(1))) (i.e. set N = 1). Convergence of a variational approximation results when the value of λ’s as well as F stops changing. Plot F and log(F(t)-F(t-1)) as a function of iteration number t for MeanField. How rapidly does it converge? Plot F for three widely varying sigmas. How is this affected by increases and decreases of sigma? Why? Support your arguments. [10 marks]

4. [Bonus 40 marks] Variational Bayes for binary factors For the model of the previous question:

  1. (a)  Derive a (variational) Bayesian hyperparameter optimisation algorithm to automatically determine K, the number of hidden binary variables in this model, following the approach discussed in lecture for factor analysis. [10 marks]
  2. (b)  Implement your algorithm, and demonstrate results on the data from Question 1. Fit models with a range of maximum values for K (from half the true number to twice or more).
    Provide a figure showing the VB free energy of each model as a function of iteration; as well a measure of the effective number of factors at each point. Interpret the figure: does the ordering of free energies, and their relationship to the effective number of factors, make sense? [30 marks].

5. [30 marks] EP for the binary factor model
Now derive an EP algorithm to infer the marginals on the source variables in the binary latent

factor model above.

  1. (a)  First, write down the log-joint probability for a single observation-source pair log(p(s, x)). Rearrange the terms to form a sum of log-factors on s (assuming x is observed), each defined either on a single source variable, or on a pair:

    log(p(s, x)) = 􏰐 log fi(si) + 􏰐 log gij (si, sj ). i ij

    Relate your result to the Boltzmann Machine. [Remember that, since the sources s are binary, s2i = si.] [5 marks]

  2. (b)  Next, derive a message passing scheme to find iterative approximations f ̃ and g ̃ to i ij

    each factor. Start your derivation from the KL divergence KL[p∥q] and identify clearly

    each time you make an approximate step. You don’t need to make all of the EP approx-

    imations: which one(s) is(are) missing?

    Give the final message-passing scheme in terms of updates to the natural parameters of

    the site approximations. There will be two different types of update: for the f ̃ and the i

    g ̃ij respectively. [10 marks]

  3. (c)  Rewrite your message passing approximation to use factored approximate messages.

    Explain how this leads to a loopy BP algorithm. [5 marks]

  4. (d)  Describe a Bayesian method for selecting K, the number of hidden binary variables using EP. Does your method pose any computational difficulties and if so how would you tackle them? [10 marks]

6. [Bonus: 50 marks] Implement the EP/loopy-BP algorithm that you derived in the previous question, and compare your results to those of the variational mean-field algorithm.

Appendix: M-step for Assignment [5] Iain Murray

December 20031

A Background

The generative model under consideration has a vector of K binary latent variables s. Each D-dimensional data point x(n) is generated using a new hidden vector, s(n). Each s(n) is identically and independently distributed according to:

􏰃 􏰄 􏰑K (n) (n)

p x 􏰂s ,μ,σ =(2πσ) exp−2σ2 x −
i=1

si μi

πsi (1 − πi)(1−si ). (2) i

P s(n)|π =
Once s(n) has been generated, the data point is created according to the Gaussian distribution:

⊤ 􏰃􏰂􏰄 1􏰉K􏰊􏰉K􏰊

(n)􏰂 (n) 2 2 −D/2 (n) 􏰐 (n)

(n)
x −

􏰐 (n)
si μi .

i=1

i=1

(3) When this process is repeated we end up obtaining a set of visible data X = {x(1),…,x(N)} generated by a set of N binary vectors S = {s(1),…,s(N)} and some model parameters θ = {μ,σ2,π}, which are constant across all the data. Given just X, both S and θ are unknown. We might want to find the set of parameters that maximise the likelihood function P (X |θ); “the parameters that make the data probable”. EM is an approach towards this

goal which takes our knowledge about the uncertain S into account. In the EM algorithm we optimise the objective function

F(q,θ)=⟨logp(S,X|θ)⟩q(S) −⟨logq(S)⟩q(S)

=

􏰐􏰍 􏰃 (n) (n)􏰂􏰂 􏰄􏰎 􏰐􏰍 􏰃 (n)􏰄􏰎 logp s ,x 􏰂θ q(s(n))− logq s

nn

q(s(n)) ,

(4)

alternately increasing F by changing the distribution q(S) in the “E-step”, and the pa- rameters in the “M-step”. This document gives a derivation and Matlab implementation of the M-step. In this assignment you will implement a variational E-step and apply this EM algorithm to a data set.

1Modified to match updated notation in 2006

B M-step derivation

Here we maximise F with respect to each of the parameters using differentiation. This only requires the term with θ dependence:

􏰐􏰍 􏰃(n) (n)􏰂􏰄􏰎 logps ,x 􏰂􏰂θ

􏰐􏰍

=

􏰃(n) (n) 􏰄 􏰃(n) 􏰄􏰎 logpx |s ,θ+logPs |θ

q(s(n))

(5)

q(s(n)) nn

Substituting the given distributions from equations 3 and 2 gives: = − N D log 2π − N D log σ

2
− 1 􏰐x(n)⊤x(n) +􏰐μ⊤μ 􏰐􏰍s(n)s(n)􏰎 −2􏰐μ⊤ 􏰐􏰍s(n)􏰎 x(n)

2σ2 ij i j q(s(n)) i i q(s(n))  n=1 i,j n=1 i n=1

NNN

K􏰋N 􏰉N􏰊􏰌

q(s(n))

+􏰐 logπi 􏰐􏰍s(n)􏰎

+log(1−πi)

N −􏰐􏰍s(n)􏰎 i

.

i q(s(n))
From which we can obtain all the required parameter settings:

i=1 n=1

n=1

∂F 1N􏰍(n)􏰎 1􏰋N􏰍(n)􏰎

∂πi

πi n=1

q(s(n)) 1 − πi n=1 q(s(n))
⇒ , (8)

∂F 1N  = − 􏰐 􏰐 􏰍s(n)s(n)􏰎 − 􏰍s(n)􏰎 x(n)

􏰌
=􏰐si + 􏰐si −N=0 (7)

(6)

∂μi σ2  i j q(s(n)) i q(s(n)) 

n=1 μj =􏰐􏰍s(n)􏰎

i

j
x(n)

q(s(n))

(9)

(10)

. (11)

and

NN 􏰐􏰐􏰍s(n)s(n)􏰎

j n=1

∂F =0 ⇒ ∂σ

n=1

i j

q(s(n))

μ = 􏰏 􏰒􏰏N 􏰀s(n)s(n)⊤􏰁 􏰓−1 􏰏N 􏰍s(n)􏰎 x(n) j i n=1 q(s(n)) ji n=1 i q(s(n))

2 1􏰇N (n)⊤(n) ⊤ N􏰍(n)(n)􏰎
σ =ND 􏰏n=1x x +􏰏i,jμi μj􏰏n=1 si sj q(s(n))

−2􏰏 μ⊤􏰏N 􏰍s(n)􏰎 x(n)􏰈 i i n=1 i q(s(n))

Note that the required sufficient statistics of q (S) are 􏰀s(n)􏰁 In the code these are known as ES and ESS.

q(s(n))

and 􏰏N n=1

􏰀s(n)s(n)⊤􏰁 q(s(n))

.

π=1􏰏N 􏰀s(n)􏰁( ) N n=1 q s(n)

All of the sums above can be interpreted as matrix multiplication or trace operations. This means that each of the boxed parameters above can neatly be computed in one line of Matlab.

C M-step code

MStep.m

%
%
%
%
%
%
%
%
%
%
%
%
%
%
%

[mu, sigma, pie] = MStep(X,ES,ESS)

Inputs :

−−−−−−−−−−−−−−−−−

X NxD data matrix ES NxK E q[s]

ESS KxK sum over data points of E q[ss ’] (NxKxK)
if E q[ss ’] is provided , the sum over N is done for you.

Outputs:

−−−−−−−−
mu DxK matrix of means in p(y|{s i},mu,sigma)

sigma 1×1 standard deviation in same
pie 1xK vector of parameters specifying generative distribution for s

function [mu, sigma , pie ] = MStep(X,ES,ESS)

[N,D] = size(X);

if (size(ES,1) ̃=N), error(’ES must have the same number of rows as X’); end; K = size(ES,2);

if (isequal(size(ESS) ,[N,K,K])) , ESS = shiftdim(sum(ESS,1) ,1); end; if ( ̃isequal(size(ESS),[K,K]))

error(’ESS must be square and have the same number of columns as ES’); end ;

mu= (inv(ESS)∗ES’∗X)’;
sigma = sqrt((trace(X’∗X)+trace(mu’∗mu∗ESS)−2∗trace(ES’∗X∗mu))/(N∗D)); pie = mean(ES,1);

  • approximate-inference-main-ffhrw7.zip