CA4023 Assignment 1 Solved

30.00 $

Category: Tags: , ,

Description

5/5 - (1 vote)

 

Write a program that splits a document into sentences. The input to your program should be a file containing text. The output should be a new file with each sentence from the first file on a separate line.

For example, if the input file contains the following:

With all the fawning end-of-the-year kudos currently circulating, it’s easy to forget that a sizable number of actual bad movies came out in 2012. Well, consider this a refresher! From failed blockbuster tentpoles (”Battleship”) to would-be hilarious comedies (“The Watch”) to lame scare-challenged horror flicks (“The Apparition”) to…uh, well, pretty much anything involving Mr. Tyler Perry, there’s no doubt that the last 366 days have come with a heaping helping of truly heinous cinematic stinkers. So what better time for an accounting of the year’s most outrageous big-screen abominations than on the eve of the coming apocalypse?

The output file should contain the following:

With all the fawning end-of-the-year kudos currently circulating, it’s easy to forget that a sizable number of actual bad movies came out in 2012.
Well, consider this a refresher!
From failed blockbuster tentpoles (”Battleship”) to would-be hilarious comedies (“The Watch”) to lame scare-challenged horror flicks (“The Apparition”) to…uh, well, pretty much anything involving Mr. Tyler Perry, there’s no doubt that the last 366 days have come with a heaping helping of truly heinous cinematic stinkers.

So what better time for an accounting of the year’s most outrageous big-screen abominations than on the eve of the coming apocalypse?

Note that your solution should NOT make use of machine learning.

PART TWO

Language Modelling

Implement an unsmoothed bigram language model. Train your model on the following toy corpus:

<s> a b </s> <s> b b </s> <s> b a </s> <s> a a </s>

Calculate and print out the probability of each of the following strings:

<s> b </s>
<s> a </s>
<s> a b </s> <s> a a </s> <s> a b a </s>

PART THREE

Naive Bayes Sentiment Polarity Classifier

Write a sentiment polarity classifier which uses the Naive Bayes algorithm to train a sentiment polarity classifier which assigns a sentiment polarity of positive or negative to a review.

Your program should accept as input a training file and a test file. The training file contains a list of reviews and their actual sentiment labels ( positive or negative). The test file contains either a list of reviews with the actual sentiment labels or list of the reviews on their own. Your program should output the predictions of the NB classifier (positive or negative)for each of the reviews in the test file. If the actual labels (sometimes referred to as gold labels or ground truth) are also available for the test reviews, your program should also print the accuracy of the classifier.

You should use the following training data: https://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz described in the following paper:

Pang and Lee 2004. A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts. Proceedings of the 42nd ACL. https://www.aclweb.org/anthology/P04-1035/

There are 1000 positive reviews and 1000 negative reviews. Reserve the last 100 of each type for testing (files starting with CV9) and the first 900 for training (files starting with CV[0-8]).

Analyse the output of your classifier on 5 correct and 5 incorrect samples chosen at random from the test set. For each example, say why you think your classifier made the correct or incorrect decision.

Points to Note

  • ●  You may implement the solutions in a programming language of your choice.
  • ●  Note that you may NOT make use of external NLP libraries.

    Marking Criteria

    For Parts One and Two, marks will be awarded for

  1. Correct implementation (4 marks)
  2. Clear, readable, appropriately commented code (1 mark)

For Part Three, marks will be awarded for

  1. Correct implementation (7 marks)
  2. Clear, readable, appropriately commented code (1 mark)
  3. Analysis (2 marks)
  • Assignment-1-tutkiw.zip