CSE508 Assignment 5 Solved

30.00 $

Category:

Description

Rate this product

Download the 20​ _newsgroup dataset. You need to pick documents of comp.graphics, sci.med, talk.politics.misc, rec.sport.hockey, sci.space [5​ classes] for text classification.

Implement the following algorithms for text classification:

  1. Naive Bayes
  2. kNN (vary k=1,3,5)

Feature selection techniques to be used with both algorithms:

  • Tf-IDF
  • Mutual Information Implementation Points:
  • Perform the data pre-processing steps.
  • Split your dataset randomly into train: test ratio. You need to select the documents randomly for splitting. You are not​ supposed to split documents in sequential order, for instance, choosing the first 800 documents in the train set and last 200 in the test set for the train: test ratio of 80:20.
  • Implement the TF-IDF scoring technique and mutual information technique for efficient feature selection.
  • For each class – train your Naive Bayes Classifier and kNN on the training data.
  • Test your classifiers on testing data and report​ the confusion matrix and overall accuracy.​
  • Perform the above steps on 50:50, 70:30, and 80:20 training and testing split ratios.
  • Compare and analyze the performance of the above-mentioned two classification algorithms for both the feature selection techniques across different train: test ratios. Use​ graphs to report the performance comparison.​ Also,​ mention your inferences from the graphs.​ Example of a graph you can report – a graph showing the performance of kNN for different values of k.

 

 

 

 

 

  • A5_MT19133-w4lhhu.zip