CSE408 Project 1-Text Classification and Sentiment Analysis Solved

30.00 $

Category:

Description

Rate this product

Programming language: Matlab (recommended and template codes provided), or other languages(without template codes)

Fourpeopleinateam.Listteammembersinyourreport,withnameandasuID. ImplementyourcodeunderCodefolder,andreaddatafromDatafolder(onefolderforkNNandonefor

sentiment analysis, or SA). WritetheprojectreportinfileP01_report.pdfandplaceattheprojectrootfolder.

Text Classification with Bag of Words and kNN

  • Input data under ../Data/kNN/training and ../Data/kNN/testing
  • Training directories pos and neg contain 90 text files each
  • Testing directories pos and neg contain 10 text files each
  • The directory name (pos, neg) corresponds to the true classification of each file
  • Grading will include running and inspecting the code and verifying output.

    Vocabulary (lexicon) creation

Implement function buildVoc.m

  1. TemplatefilebuildVoc.m
  2. Function template function voc = buildVoc(folder, voc, finvoc);
  3. Inputs:
    1. a  folder is a folder path, which contains training data
    2. b  voc is a cell array to which the vocabulary is added so you can build a single lexicon for a set of

      folders, the first time you call the fun voc is an empty cell array { }

    3. c  finvoc is 0 the first time you call the function (e.g., with the pos data folder path) and 1

      the second time (e.g., with the neg data folder path).

  4. Output
    1. a  voc is a cell array which represents the vocabulary of the words shown in the data files, except the stop words (stop words list is embedded in the code template)
    2. b  when called with arg finvoc = 0, voc contains unique words, with frequency above a value of your choice.
  5. Implementyourcodeunder%PUTYOURIMPLEMENTATIONHEREtag;

a When you test the code check that the contents of the lexicon, i.e. array voc, do not have issues such as single characters or weird words that may cause performance issues. If needed add code to correct that.

6. UsefulMatlabfunctionsstrtok(),lower(),regexprep(),ismember(),any();

7. Includetheinstructionstogeneratevocandcut&pastetheoutputintheprojectreport.

1.2 Bag of Words feature extraction (10pt)

This function computes the BOW feature vector for any text file in the ../Data/kNN/* directory

  1. Templatefilecse408_bow.m
  2. Functiontemplatefunctionfeat_vec=cse408_bow(filepath,voc)
  3. Inputs:
    1. a  filepath is a file path which contains one review (one .txt file)
    2. b  voc is the lexicon cell array from previous sub-section.
  4. Output:

a feat_vec is a one dimensional Matlab array, which represent the bag of words feature vector given the vocabulary voc

  1. Implementyourcodeunder%PUTYOURIMPLEMENTATIONHEREtag;
  2. UsefulMatlabfunctionsstrtok(),lower(),regexprep(),ismember(),any();
  3. Try out the implementation of function cse408_bow giving it a path to a text file, and voc (the lexicon).
  4. Theoutputcanbecapturedandpastedintheprojectreport.Printonlythefeaturevaluesinasinglerow,

    do not include the lexicon words as it is too long.

1.3 k-NN Classification (15pt)

A set of reviews (training data) with their labels (pos, neg) is provided. The training data is under ../Data/kNN/training/{pos,neg}. The class labels (1 for positive, 0 for negative) should be stored in array train_label_set, and their feature vectors should be stored in array train_feat_set. The objective of the kNN classification is to use a distance metric to fin the k nearest neighbors of a test file (a review under ../Data/kNN/testing/{pos,neg}).

  1. Templatefilecse408_knn.m
    This function returns the label (positive or negative) predicted by the kNN algorithm for the feature

    vector of a test file.

  2. Function template function pred_label = cse408_knn(test_feat, train_label_set, train_feat_set, k,

    DstType)

  3. Inputarguments:
    1. a  test_feat is the feature vector of a test file (the size of test_feat is the same size as the lexicon)
    2. b  train_label_set is the set of labels for the training set (size is the number of training files)
    3. c  train_feat_set is the set of feature vectors for the training set (size is the size of lexicon X number of training files)
    4. d  k is the hyperparameter of kNN algorithm, i.e. the number of neighbors used
    5. e  DstType is the distance computation method, i.e. 1 for sum of squared distances (SSD) and 2

      for angle between vectors and 3 for Number of words in common;

  4. Output:

a pred_label is the predicted label of the testing file. 1 for positive review, 0 for negative review; 5. Implementyourcodeunder%PUTYOURIMPLEMENTATIONHEREtag;

6. 7.

1.4

1.

2. 3.

2 2.1

1.

2. 3.

4. 5. 6.

2.2

1. 2. 3.

3

3.1

1. 2.

3. 4.

5.

UsefulMatlabfunctionssort();
Try out function cse408_knn using appropriate arguments.

Test kNN implementation (10)

The test script for part one is provided in P01Part1_test.m. After your implementation, run P01Part1_test.m to debug and validate your code. It basically iteratively select one of the training review file as a validation file.

As you can see from the code, it uses the test data under ../Data/kNN/testing/{pos,neg}. Grader may run the implementation to verify it, so make sure it runs without problems

Text Sentiment Analysis (30pt) Implementation (15pt)

Implementabasicsentimentanalysismodule.Readinalexicon,inwhicheachwordhasasentimentscore (file wordWithStrenght in SA directory). Iterate through each review file and sum up the sentiment scores for each word that exists in the sentiment strength lexicon;

TemplatefilesentimentAnalysis.m
Input: a file path filepath, which contains one review (one .txt file) and a word with sentiment strength

file wordWithStrength.txt under ../Data/SA folder. Output:onesentimentscore. Implementyourcodeunder%PUTYOURIMPLEMENTATIONHEREtag; UsefulMatlabfunctionsstrtok(),lower(),regexprep(),containers.Map();

Test your implementation (15pt) Afteryourimplementation,runP01Part2test.mtodebugandvalidateyourcode. Reportontheaccuracyoftheperformanceofyourcode. Gradermayruntheimplementationtoverifyit,somakesureitrunswithoutproblems

Report Requirements (30pt)

Results presentation and analysis or discussion (20pt)

You may use the report template provided as guideline.

Make sure to explain where the algorithms worked and where they didn’t and why. You are encouraged to use both text and plots/graphs to explain your observations.

AnalyzeHyperparameterKintheKNNpart,whichKyouempiricallyobservedthatcouldachievethebest performance? You may draw a graph of performance vs. K.

Among the three distance metrics (sum of squared distances (SSD), and the angle between vectors and Number of words in common), which one intuitively makes more sense for classifying positive and negative review? Which one you empirically observed it to achieve the best performance?

For Text Sentiment Analysis task (Part II), which review in our dataset has the highest positive score but it is a negative review? And, which one has the lowest negative score, but it is a positive review?
Which set of words from these reviews confused your sentiment analysis system?

3.2 Proposed improvements

There are many ways to improve either Knn and/or SentimentAnalysis, select one incremental approach and pursue it as far as time allows. For example: describe it in detail, implement it, test it, and compare with previous results. Points will be awarded according to the amount of work. Possible areas for improvement will be discussed in class. Document your approach and results in an Appendix of the report.

Matlab tips

Reading text from files:
First open the file using fun fopen:
>>Fd = fopen(‘..\P1\knn_training\reviews\neg\cv000_29416.txt’,’r’); Then read 10 lines (for knn exercise)

ns = ”;
for i = 1:10

ns = strcat(ns, fgetl(fd));

end

% the code above initializes the string to an empty string, then concatenates to it one line at a time using fun strcat. Each line is read using fun fgetl.

In order to remove punctuation marks (multiple spaces, etc.) you can use fun replace, for example replacing ‘.’ for ‘’; effectively deleting the character:

>> st
st =
they get into an accident .one of the guys dies , but his girlfriend continues to see him in her life , and has nightmares .
>> replace(st,’.’,”)
ans =
they get into an accident one of the guys dies , but his girlfriend continues to see him in her life , and has nightmares

You can replace multiple substrings at once if you use a cell array for the old string to replace:

>> replace(st,{‘,’,’.’,’;’},”)
nst = ‘they get into an accident one of the guys dies but his girlfriend continues to see him in her life and has nightmares’

to extract words from a string use fun strtok: >> [token remain] = strtok(nst)

token = they remain =

get into an accident one of the guys dies but his girlfriend continues to see him in her life and has nightmares

Use the help functions (top right search input, or select – right click on highlighted text in any view, then pick ‘help on selection’) for more information.

Other functions to try out:
lower(), regexprep(), containers.Map();

  • Project-1-4d1sgl.zip