CS640 Assignment 3 -Pre-process and clean the data Solved

30.00 $

Category:

Description

5/5 - (1 vote)

For this assignment, you will work with the Sentiment140 dataset. The description of the dataset is in the given link. You will be focusing on the first column containing the sentiment of the text (0 = negative, 2 = neutral, 4 = positive) and the last column which contains the text itself.

There are two primary tasks you need to complete for this assignment.

  1. Pre-process and clean the data. Remove HTTP links and usernames.
  2. Train two models for classification
    1. Fine-tune a pre-trained BERT model for sentiment analysis. You will use the BERT model provided by the HuggingFace library and add additional layers for classification.
    2. Train a Naive Bayes Classifier using CountVectorizer

You can not use a model that is already trained on Twitter data or has inbuilt classification layers. The goal of the assignment is to familiarize yourself with the process of extending pre-trained models for your downstream task.

You are encouraged to work in groups. If you choose to do so, mention your teammates in the report.

You will need to submit the following:

  1. Details about the pre-processing (cleaning) step you performed.
  2. A short description of the layers you added and your reasoning in addition to the traininghyperparameters. ( Optimizer, Learning rate, batch size, Loss Function,…)
  3. The performance of your models (BERT and Bayes Classifier) on the Test-CSV providedwith the dataset. For this, you can include a classification report. You are encouraged to use library functions to do this.

https://scikit-learn.org/stable/auto_examples/text/plot_document_classification_20newsgroups.h tml#sphx-glr-auto-examples-text-plot-document-classification-20newsgroups-py

  • PA3-27b4za.zip