Description
Download the 20​ _newsgroup dataset. You need to pick documents of comp.graphics, sci.med, talk.politics.misc, rec.sport.hockey, sci.space [5​ classes] for text classification.
Implement the following algorithms for text classification:
- Naive Bayes
- kNN (vary k=1,3,5)
Feature selection techniques to be used with both algorithms:
- Tf-IDF
- Mutual Information Implementation Points:
- Perform the data pre-processing steps.
- Split your dataset randomly into train: test ratio. You need to select the documents randomly for splitting. You are not​ supposed to split documents in sequential order, for instance, choosing the first 800 documents in the train set and last 200 in the test set for the train: test ratio of 80:20.
- Implement the TF-IDF scoring technique and mutual information technique for efficient feature selection.
- For each class – train your Naive Bayes Classifier and kNN on the training data.
- Test your classifiers on testing data and report​ the confusion matrix and overall accuracy.​
- Perform the above steps on 50:50, 70:30, and 80:20 training and testing split ratios.
- Compare and analyze the performance of the above-mentioned two classification algorithms for both the feature selection techniques across different train: test ratios. Use​ graphs to report the performance comparison.​ Also,​ mention your inferences from the graphs.​ Example of a graph you can report – a graph showing the performance of kNN for different values of k.
Â
Â