CS229 Problem #2-Naive Bayes, SVMs, and Theory Solved

35.00 $

Category:

Description

Rate this product

 

1. Constructing kernels

In class, we saw that by choosing a kernel K(x,z) = φ(x)T φ(z), we can implicitly map data to a high dimensional space, and have the SVM algorithm work in that space. One way to generate kernels is to explicitly define the mapping φ to a higher dimensional space, and then work out the corresponding K.

However in this question we are interested in direct construction of kernels. I.e., suppose we have a function K(x,z) that we think gives an appropriate similarity measure for our learning problem, and we are considering plugging K into the SVM as the kernel function. However for K(x,z) to be a valid kernel, it must correspond to an inner product in some higher dimensional space resulting from some feature mapping φ. Mercer’s theorem tells us that K(x,z) is a (Mercer) kernel if and only if for any finite set {x(1),…,x(m)}, the matrix K is symmetric and positive semidefinite, where the square matrix K ∈ Rm×m is given by Kij = K(x(i),x(j)).

Now here comes the question: Let K1, K2 be kernels over Rn×Rn, let a ∈ R+ be a positive real number, let f : Rn 7→ R be a real-valued function, let φ : Rn → Rd be a function mapping from Rn to Rd, let K3 be a kernel over Rd × Rd, and let p(x) a polynomial over x with positive coefficients.

For each of the functions K below, state whether it is necessarily a kernel. If you think it is, prove it; if you think it isn’t, give a counter-example.

  • K(x,z) = K1(x,z) + K2(x,z)
  • K(x,z) = K1(x,z) − K2(x,z)
  • K(x,z) = aK1(x,z)
  • K(x,z) = −aK1(x,z)
  • K(x,z) = K1(x,z)K2(x,z)
  • K(x,z) = f(x)f(z)
  • K(x,z) = K3(φ(x)(z))
  • K(x,z) = p(K1(x,z))

[Hint: For part (e), the answer is that the K there is indeed a kernel. You still have to prove it, though. (This one may be harder than the rest.) This result may also be useful for another part of the problem.]

2.  Kernelizing the Perceptron

Let there be a binary classification problem with y ∈ {0,1}. The perceptron uses hypotheses of the form hθ(x) = g(θT x), where g(z) = 1{z ≥ 0}. In this problem we will consider a stochastic gradient descent-like implementation of the perceptron algorithm where each update to the parameters θ is made using only one training example. However, unlike stochastic gradient descent, the perceptron algorithm will only make one pass through the entire training set. The update rule for this version of the perceptron algorithm is given by

θ(i+1) := θ(i) + α[y(i+1) − hθ(i)(x(i+1))]x(i+1)

where θ(i) is the value of the parameters after the algorithm has seen the first i training examples. Prior to seeing any training examples, θ(0) is initialized to ~0.

Let K be a Mercer kernel corresponding to some very high-dimensional feature mapping φ. Suppose φ is so high-dimensional (say, ∞-dimensional) that it’s infeasible to ever represent φ(x) explicitly. Describe how you would apply the “kernel trick” to the perceptron to make it work in the high-dimensional feature space φ, but without ever explicitly computing φ(x). [Note: You don’t have to worry about the intercept term. If you like, think of φ as having the property that φ0(x) = 1 so that this is taken care of.] Your description should specify

  • How you will (implicitly) represent the high-dimensional parameter vector θ(i), including how the initial value θ(0) = ~0 is represented (note that θ(i) is now a vector whose dimension is the same as the feature vectors φ(x));
  • How you will efficiently make a prediction on a new input x(i+[1]). I.e., how you will compute hθ(i)(x(i+1)) = g(θ(i)T φ(x(i+1))), using your representation of θ(i); and (c) How you will modify the update rule given above to perform an update to θ on a new training example (x(i+1),y(i+1)); i.e., using the update rule corresponding to the feature mapping φ:

θ(i+1) := θ(i) + α[y(i+1) − hθ(i)(φ(x(i+1)))]φ(x(i+1))

[Note: If you prefer, you are also welcome to do this problem using the convention of labels y ∈ {−1,1}, and g(z) = sign(z) = 1 if z ≥ 0, −1 otherwise.]

3.     Spam classification

In this problem, we will use the naive Bayes algorithm and an SVM to build a spam classifier.

In recent years, spam on electronic newsgroups has been an increasing problem. Here, we’ll build a classifier to distinguish between “real” newsgroup messages, and spam messages. For this experiment, we obtained a set of spam emails, and a set of genuine newsgroup messages.1 Using only the subject line and body of each message, we’ll learn to distinguish between the spam and non-spam.

All the files for the problem are in /afs/ir/class/cs229/ps/ps2/. Note: Please do not circulate this data outside this class. In order to get the text emails into a form usable by naive Bayes, we’ve already done some preprocessing on the messages. You can look at two sample spam emails in the files spamsampleoriginal*, and their preprocessed forms in the files spamsamplepreprocessed*. The first line in the preprocessed format is just the label and is not part of the message. The preprocessing ensures that only the message body and subject remain in the dataset; email addresses (EMAILADDR), web addresses (HTTPADDR), currency (DOLLAR) and numbers (NUMBER) were also replaced by the special tokens to allow them to be considered properly in the classification process. (In this problem, we’ll going to call the features “tokens” rather than “words,” since some of the features will correspond to special values like EMAILADDR. You don’t have to worry about the distinction.) The files newssampleoriginal and newssamplepreprocessed also give an example of a non-spam mail.

The work to extract feature vectors out of the documents has also been done for you, so you can just load in the design matrices (called document-word matrices in text classification) containing all the data. In a document-word matrix, the ith row represents the ith document/email, and the jth column represents the jth distinct token. Thus, the (i,j)-entry of this matrix represents the number of occurrences of the jth token in the ith document.

For this problem, we’ve chosen as our set of tokens considered (that is, as our vocabulary) only the medium frequency tokens. The intuition is that tokens that occur too often or too rarely do not have much classification value. (Examples tokens that occur very often are words like “the,” “and,” and “of,” which occur in so many emails and are sufficiently content-free that they aren’t worth modeling.) Also, words were stemmed using a standard stemming algorithm; basically, this means that “price,” “prices” and “priced” have all been replaced with “price,” so that they can be treated as the same word. For a list of the tokens used, see the file TOKENSLIST.

Since the document-word matrix is extremely sparse (has lots of zero entries), we have stored it in our own efficient format to save space. You don’t have to worry about this format.[2] The file readMatrix.m provides the readMatrix function that reads in the document-word matrix and the correct class labels for the various documents. Code in nb train.m and nb test.m shows how readMatrix should be called. The documentation at the top of these two files will tell you all you need to know about the setup.

  • Implement a naive Bayes classifier for spam classification, using the multinomial eventmodel and Laplace smoothing.

You should use the code outline provided in nbtrain.m to train your parameters, and then use these parameters to classify the test set data by filling in the code in nbtest.m. You may assume that any parameters computed in nbtrain.m are in memory when nb test.m is executed, and do not need to be recomputed (i.e., that nbtest.m is executed immediately after nbtrain.m) [3].

Train your parameters using the document-word matrix in MATRIX.TRAIN, and then report the test set error on MATRIX.TEST.

Remark. If you implement naive Bayes the straightforward way, you’ll find that the computed p(x|y) = Qi p(xi|y) often equals zero. This is because p(x|y), which is the product of many numbers less than one, is a very small number. The standard computer representation of real numbers cannot handle numbers that are too small, and instead rounds them off to zero. (This is called “underflow.”) You’ll have to find a way to compute naive Bayes’ predicted class labels without explicitly representing very small numbers such as p(x|y). [Hint: Think about using logarithms.]

  • Intuitively, some tokens may be particularly indicative of an email being in a particular class. We can try to get an informal sense of how indicative token i is for the SPAM class by looking at:

 .

Using the parameters fit in part (a), find the 5 tokens that are most indicative of the SPAM class (i.e., have the highest positive value on the measure above). The numbered list of tokens in the file TOKENSLIST should be useful for identifying the words/tokens.

  • Repeat part (a), but with training sets of size ranging from 50, 100, 200, …, up to 1400, by using the files TRAIN.*. Plot the test error each time (use MATRIX.TEST as the test data) to obtain a learning curve (test set error vs. training set size). You may need to change the call to readMatrix in nbtrain.m to read the correct file each time. Which training-set size gives the best test set error?
  • Train an SVM on this dataset using the LIBLINEAR SVM library, available for download from http://www.csie.ntu.edu.tw/˜cjlin/liblinear/. This implements an SVM using a linear kernel. Like the Naive Bayes implementation, an outline for your code is provided in m and svmtest.m.

See ps2/README.txt for instructions for downloading and installing LIBLINEAR. Similar to part (c), train an SVM with training set sizes 50, 100, 200, …, 1400, by using the file MATRIX.TRAIN.50 and so on. Plot the test error each time, using MATRIX.TEST as the test data. Use the LIBLINEAR default options when training and testing. You don’t need to try different parameter values.

Running LIBLINEAR in Matlab on Windows or Octave can be buggy, depending on which version of Windows you run. We recommend that you use Matlab on the corn machines (e.g., ssh to corn.stanford.edu). However, there are command line programs you can run (without using MATLAB) which are located in liblinear-1.7/windows for Windows and liblinear-1.7/ for Linux/Unix. If you do it this way, please include the commands that you run from the command line in your solution.

  • How do naive Bayes and Support Vector Machines compare (in terms of generalizationerror) as a function of the training set size?

4.    [20 points] Properties of VC dimension

In this problem, we investigate a few properties of the Vapnik-Chervonenkis dimension, mostly relating to how VC(H) increases as the set H increases. For each part of this problem, you should state whether the given statement is true, and justify your answer with either a formal proof or a counter-example.

  • Let two hypothesis classes H1 and H2 satisfy H1 H2. Prove or disprove: VC(H1) ≤ VC(H2).
  • Let H1 = H2 ∪{h1,…,hk}. (I.e., H1 is the union of H2 and some set of k additional hypotheses.) Prove or disprove: VC(H1) ≤ VC(H2) + k. [Hint: You might want to start by considering the case of k = 1.]
  • Let H1 = H2 H3. Prove or disprove: VC(H1) ≤ VC(H2) + VC(H3).

5.     Training and testing on different distributions

In the discussion in class about learning theory, a key assumption was that we trained and tested our learning algorithms on the same distribution D. In this problem, we’ll investigate one special case of training and testing on different distributions. Specifically, we will consider what happens when the training labels are noisy, but the test labels are not.

Consider a binary classification problem with labels y ∈ {0,1}, and let D be a distribution over (x,y), that we’ll think of as the original, “clean” or “uncorrupted” distribution. Define Dτ to be a “corrupted” distribution over (x,y) which is the same as D, except that the labels y have some probability 0 ≤ τ < 0.5 of being flipped. Thus, to sample from Dτ, we would first sample (x,y) from D, and then with probability τ (independently of the observed x and y) replace y with 1 − y. Note that D0 = D.

The distribution Dτ models a setting in which an unreliable human (or other source) is labeling your training data for you, and on each example he/she has a probability τ of mislabeling it. Even though our training data is corrupted, we are still interested in evaluating our hypotheses with respect to the original, uncorrupted distribution D. We define the generalization error with respect to Dτ to be

ετ(h) = P(x,y)∼Dτ[h(x) 6= y].

Note that ε0(h) is the generalization error with respect to the “clean” distribution; it is with respect to ε0 that we wish to evaluate our hypotheses.

  • For any hypothesis h, the quantity ε0(h) can be calculated as a function of ετ(h) and

τ. Write down a formula for ε0(h) in terms of ετ(h) and τ, and justify your answer.

  • Let |H| be finite, and suppose our training set S = {(x(i),y(i));i = 1,…,m} is obtained by drawing m examples IID from the corrupted distribution Dτ. Suppose we pick h H using empirical risk minimization: hˆ = argminhH εˆS(h). Also, let h= argminhH ε0(h).

Let any δ,γ > 0 be given. Prove that for

ε0(hˆ) ≤ ε0(h) + 2γ

to hold with probability 1 − δ, it suffices that

.

Remark. This result suggests that, roughly, m examples that have been corrupted at noise level τ are worth about as much as (1 − 2τ)2m uncorrupted training examples. This is a useful rule-of-thumb to know if you ever need to decide whether/how much to pay for a more reliable source of training data. (If you’ve taken a class in information theory, you may also have heard that (1−H(τ))m is a good estimate of the information in the m corrupted examples, where H(τ) = −(τ log2 τ + (1 − τ)log2(1 − τ)) is the “binary entropy” function. And indeed, the functions (1−2τ)2 and 1−H(τ) are quite close to each other.)

 

CS229 Problem Set #2                                                                                                                                            6

(c) Comment briefly on what happens as τ approaches 0.5.

[1] Thanks to Christian Shelton for providing the spam email. The non-spam messages are from the 20 newsgroups data at http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.html .

[2] Unless you’re not using Matlab/Octave, in which case feel free to ask us about it.

[3] Matlab note: If a .m file doesn’t begin with a function declaration, the file is a script. Variables in a script are put into the global namespace, unlike with functions.

  • ps2-mlu6na.zip