NLP Homework2-Sentiment Analysis for Movie Reviews Solved

25.00 $

Category:

Description

Rate this product

1          Manual classification

To get started, read these two reviews, and decide which one is negative and which one is positive. Provide a short motivation, and try to anticipate what could pose an issue for automatic sentiment identification.

R1 Busy Phillips put in one hell of a performance, both comedic and dramatic. Erika Christensen was good but Busy stole the show. It was a nice touch after The Smokers, a movie starring Busy, which wasnt all that great. If Busy doesnt get a nomination of any kind for this film it would be a disaster. […]

R2 This movie was awful. The ending was absolutely horrible. There was no plot to the movie whatsoever. The only thing that was decent about the movie was the acting done by Robert DuVall and James Earl Jones. Their performances were excellent! The only problem was that the movie did not do their acting performances any justice. […]

What to submit: Your guesses, motivation & possible issues for automatic sentiment identification.

Dataset

In this assignment you will use a dataset of movie reviews from the Internet Movie Database (IMDB)[1].

The movies directory contains two subdirectories:

  • train These documents will be used to train your language model. (600 docs)
  • test These documents will be used to test your model. (50 docs) The documents are named as [sentiment]-[review ID].txt. The text file txt contains the correct labels for the documents in test. There is no need to modify the files or their directories. Load them using the provided Python 3 Notebook.

2          Tokenization

The first step is to tokenize the data. Tokenization splits up a character sequence into smaller pieces (tokens). An example tokenization is:

Original sentence: “If you have the chance, watch it. Although, a warning, you’ll cry your eyes out.”

Tokens: [If, you, have, the, chance, ,, watch, it, ., Although, ,, a, warning,

,, you, ‘ll, cry, your, eyes, out, .]

2.1          Making your own tokenizer

For this assignment, make a simple tokenizer. Write 3 sentences and try the tokenizer out on them.

What to submit: Provide a description of how your tokenizer works. Report the tokens you obtain when using your tokenizer on your example sentences.

2.2          Using an off-the-shelf tokenizer

Compare the tokenizer you implemented in the previous question with one from NLTK, using the sentences provided in the Notebook.

What to submit: Reflect and answer these questions: What are the differences in the two tokenizer outputs? Which one is better? While coding your tokenizer, did you foresee all these inputs? Is there a single ‘perfect tokenizer’?

2.3          Vocabulary

Run the NLTK tokenizer on all documents in the train directory and keep track of the unigram frequencies. Since our dataset is small, it is a good idea to apply heavy normalizations, for example removing punctuation and transforming each sentence to lowercase. After you implement the normalization, the sentence “If you have the chance, watch it. Although, a warning, you’ll cry your eyes out.” should look similar to this:

Normalized tokens: [if, you, have, the, chance, watch, it, although, a, warning, you, ‘ll, cry, your, eyes, out]

Answer the following questions using the documents in the train directory:

  • How many unique n-grams are there? (where n=1,2,3).
  • Report the top 10 most frequent words (unigrams) and their frequencies. What kind of word are these?
  • How many words occur 1, 2, 3, 4 times in the corpus? Which kind of distribution is this?

Since words that do not occur often don’t add much information to the classification, keep only the words that occur at least 25 times as your vocabulary. Write your code such that all words that are not in your vocabulary are ignored in the rest of this assignment.

What to submit: Answer to the above 3 points.

3          Text classification with a unigram language model

Recall that for a text with words w1 …wn , we calculate the probability as follows using a unigram language model:

n

P(w1,w2,…,wn) = P(w1)P(w2)…P(wn) = ∏P(wi)

i=1

In order to avoid underflow, it is better to calculate this in log space (base 2)[2]:

n                                  n

logP(w1,w2,…,wn) = log∏P(wi) = ∑logP(wi)

i=1                              i=1

In our dataset we have two classes: positive (Pos) and negative (Neg). For each class, we will calculate a separate language model. This is the training or learning phase. In the apply phase, we will classify new texts as positive or negative. For testing our machine learning classifier, we apply the models on the documents in the test part of the corpus.

  1. TRAIN For the documents in the train directory, build two language models. One using the positive reviews, and one using the negative reviews. For example, we calculate the probability for the positive language model as follows.

n

P(w1,w2,…,wn|Pos) = ∏P(wi|Pos)

i=1

Where we are using the conditional probability (P(wi|Pos) instead of just P(wi)), because we are calculating the probabilities using only positive reviews. We estimate the conditional probabilities:

where C(wi,Pos) is the frequency of word wi in the positive reviews and wV C(w,Pos) the total number of words in the positive reviews[3].

Smoothing Use smoothing to avoid zero probabilities:

Where V is the size of your vocabulary. Use two settings: when k = 1 and a value for k that you have selected yourself.

  1. TEST For the reviews in the test directory, calculate the probability for both language models. Assign each review the class for which it has the highest probability. Using the MAP (Maximum Aposteriori Probability) rule:

[1] The dataset is a subset of the original dataset by Maas et al. The full dataset can be found at http://ai.stanford.edu/~amaas/data/sentiment/.

[2] Since the probabilities are “small numbers”, the more we multiply them together, the smaller they become, up to a point where the computer cannot represent these number accurately anymore (https://en.wikipedia.org/wiki/Arithmetic_underflow). By moving everything in log space we sidestep the problem (recall that the logarithm of a product is the sum of the individual logarithms).

[3] Take a few minutes to understand the formula, and reflect on the meaning of this sentence.

  • homework2-ab150h.zip