## Description

In this programming project we will be working with Generalized Linear Models as covered in class, including Logistic regression, Poisson regression and Ordinal regression. Your goal is to use one generic implementation for the main algorithm that works for multiple observation likelihoods.

# Data

Data for this assignment is provided in a zip file pp3data.zip on Canvas. Each dataset is given in two files with the data in one and the labels in the other file. We will use the datasets A and usps for classification with logistic regression. We will use the datasets AP for count prediction with Poisson regression. We will use the datasets AO for ordinal prediction with ordinal regression.

The datasets A, AP, AO were artificially generated with labels which are not perfectly matched to any linear predictor yet they are generated to be somewhat predictable. The examples in usps represent 16×16 bitmaps of the characters 3 and 5 and are taken from the well known usps dataset (representing data originally used for zip code classification).

# Implementing our variant of GLM

In this assignment we will use a Bayesian (or regularized) version of the algorithm with *w *∼ ), where *α *= 10, to calculate the MAP solution *w _{MAP }*.

- As discussed in class logistic regression (and by extension GLM) relies on a free parameter (
*w*_{0}) to capture an appropriate separating hyperplane. Therefore, you will need to add a feature fixed at one (also known as an intercept) to all datasets in the assignment. To match the test case below please add this as the first column in the data matrix. - The vector of first derivatives of the log posterior is
*g*==^{∂LogL}_{∂w }^{P}(_{i }d_{i}φ*x*)−_{i}*αw*= Φ−^{T }d*αw*where*d*is a vector whose elements are*d*._{i} - The matrix of second derivatives of the log posterior is
*H*=*∂w∂w*^{∂LogL}= −_{T }^{P}(_{i }r_{i}φ*x*)_{i}*φ*(*x*)_{i}−^{T }*αI*= −ΦΦ −^{T }R*αI*where*R*is a matrix with elements*r*on the diagonal._{i } - The GLM algorithm initializes the weight vector as
*w*= 0 and then repeatedly applies an update with Netwon’s method*w*←*w*−*H*^{−1}*g*until*w* - For this assignment we consider that
*w*has converged if . If*w*has not converged in 100 iterations we stop and output the last*w*as our solution. - The final vector, when the algorithm stops is
*w*. In this assignment we will use_{MAP }*w*for prediction (i.e. we will not calculate a predictive distribution)._{MAP }

# Likelihood models

- In order to apply the algorithm for any likelihood model and to evaluate its predictions we need to specify 4 items: (1)
*d*, (2)_{i}*r*, (3) how to compute our prediction_{i}*t*ˆ for test example*z*, and (4) how to calculate the error when we predict*t*ˆand the true label is*t*. - For the logistic likelihood we have:
*y*=_{i }*σ*(*w*(^{T }φ*x*)) and the first derivative term is_{i}*d*=_{i }*t*−_{i}*y*or_{i }*d*=*t*−*y*. The second derivative term is*r*=_{i }*y*(1 −_{i}*y*). For test example_{i}*z*we predict*t*= 1 iff*p*(*t*= 1) =*σ*(*w*(_{MAP}^{T }φ*z*)) ≥ 0*.*The error is 1 if*t*ˆ6=*t*. - Note that for the logistic model the update formula as developed in class is
*w*_{n}_{+1 }←*w*− (−_{n }*αI*− ΦΦ)^{T }R^{−1}[Φ(^{T }*t*−*y*) −*αw*]. You might want to start developing your code and testing it with this special case and then generalize it to handle all likelihoods. To help you test your implementation of this algorithm we provide an additional dataset,_{n}*irlstest*, and solution weight vector in*irlsw*(for*α*= 0*.*1). The first entry in*irlsw*corresponds to*w*_{0}. - For the Poisson likelihood we have:
*y*=_{i }*e*^{(w}^{Tφ}^{(x}^{i}^{)) }and the first derivative term is*d*=_{i }*t*−_{i}*y*or_{i }*d*=*t*−*y*. The second derivative term is*r*=_{i }*y*. For test example_{i}*z*we have*p*(*t*) =*Poisson*(*λ*) where*a*=*w*(_{MAP}^{T }φ*z*) and*λ*=*e*. We predict the mode^{a}*t*= b*λ*For this assignment we will use the absolute error: err = |*t*ˆ−*t*|. - For the ordinal model with
*K*levels we have parameters*s*and*φ*_{0 }= −∞*< φ*_{1 }*< … < φ*_{K}_{−1 }*< φ*= ∞ where for this assignment we will use_{K }*K*= 5,*s*= 1 and*φ*_{0 }= −∞*< φ*_{1 }= −2*< φ*_{2 }= −1*< φ*_{3 }= 0*< φ*_{4 }= 1*< φ*_{5 }= ∞. The model is somewhat sensitive to the setting of hyperparameters so it is important to use these settings.

Here *a _{i }*=

*w*(

^{T }φ*x*) and for

_{i}*j*∈ {0

*,…K*} we have

*y*=

_{i,j }*σ*(

*s*∗(

*φ*−

_{j }*a*)). Using this notation, for example

_{i}*i*with label

*t*we have

_{i }*d*=

_{i }*y*

_{i,t}*+*

_{i }*y*

_{i,}_{(t}

*−*

_{i}_{1) }−1. For the second derivative we have

*r**i *= *s*^{2}[*y**i,t _{i}*(1 −

*y*

*i,t*) +

_{i}*y*(

_{i,}*t*−1)(1 −

_{i}*y*(

_{i,}*t*−1))].

_{i}To predict for test example *z *we first calculate the *y *values: *a *= *w _{MAP}^{T }φ*(

*z*) and for

*j*∈ {0

*,…K*} we have

*y*=

_{j }*σ*(

*s*∗ (

*φ*−

_{j }*a*)). Then for potential labels

*j*∈ {1

*,…K*} we calculate

*p*=

_{j }*y*−

_{j }*y*

_{j}_{−1}. Finally select

*t*ˆ= argmax

*. For this assignment we will use the absolute error, or the number of levels we are off in prediction, that is, err = |*

_{j }p_{j}*t*ˆ−

*t*|.

While you could implement these as three separate algorithms, you are expected to provide one implementation of the main optimization which is given access to procedures calculating the 4 items above to make a concrete instance of GLM.

# Evaluating the implementation

Your task is to implement the GLM algorithm and generate learning curves with error bars (i.e., ±1*σ*) as follows.

Repeat 30 times

Step 1) Set aside 1/3 of the total data (randomly selected) to use as a test set.

Step 2) Permute the remaining data and record the test set error rate as a function of increasing training set portion (0.1,0.2, …,1 of the total size).

Calculate the mean and standard deviation for each size and plot the result. In addition record the number of iterations and the run time untill convergence in each run and report their averages.

In your submission provide 4 plots, one for each dataset, and corresponding runtime/iterations statistics, and provide a short discussion of the results. Are the learning curves as expected? how does learning time vary across datasets for classification? and across the likelihood models? what are the main costs affecting these (time per iteration, number of iterations)?

# Extra Credit

Explore some approach for model selection for *α *in all models and/or for *s *and *φ *in the ordinal model and report your results. You may want to generate your own data with known parameters in order to test the success of algorithms in identifying good parameters.