COMP472 Mini Project 1 – Experiments with Machine Learning Solved

35.00 $

Category:

Description

Rate this product

Experiments with Machine Learning

For this mini-project, you will experiment with different machine learning algorithms and different data sets. As you will use built-in functions from the scikit-learn Library, the focus of this mini-project lies more on the experimentations and analysis than on the implementation.

1 Your Tasks

You will perform 2 tasks: a text classification task to better understand the Multinomial Naive Bayes classifier, and another classification with a variety of types of features to better appreciate how to work with other types of data and machine learning models. You must use:

1. Python 3.8 and the scikit-learn library. Scikit-learn (see http://scikit-learn.org/stable/) provides an interface to program with a variety of different algorithms and built-in datasets. There are plenty of tutorials and examples of code online.

2. GitHub (make sure your project is private while developing).

1

2

1.

2.

3. 4.

5. 6. 7.

Task 1: Text Classification

Download the BBC dataset provided on Moodle. The dataset, created by [Greene and Cunningham, 2006], is a collection of 2225 documents from the BBC news website already categorized into 5 classes: business, entertainment, politics, sport, and tech.

PlotthedistributionoftheinstancesineachclassandsavethegraphicinafilecalledBBC-distribution.pdf. You may want to use matplotlib.pyplot and savefig to do this. This pre-analysis of the data set will allow you to determine if the classes are balanced, and which metric is more appropriate to use to evaluate the performance of your classifier.

Load the corpus using load files and make sure you set the encoding to latin1. This will read the file structure and assign the category name to each file from their parent directory name.

Pre-process the dataset to have the features ready to be used by a multinomial Naive Bayes classifier. This means that the frequency of each word in each class must be computed and stored in a term-document matrix. For this, you can use feature extraction.text.CountVectorizer.

Split the dataset into 80% for training and 20% for testing. For this, you must use train test split with the parameter random state set to None.

Train a multinomial Naive Bayes Classifier (naive bayes.MultinomialNB) on the training set using the default parameters and evaluate it on the test set.

In a file called bbc-performance.txt, save the following information: (to make it easier for the TAs, make sure that your output for each sub-question below is clearly marked in your output file, using the headings (a), (b) …)

  1. (a)  a clear separator (a sequence of hyphens or stars) and string clearly describing the model (e.g. “Multi- nomialNB default values, try 1”)
  2. (b)  the confusion matrix (you can use confusion matrix)
  3. (c)  the precision, recall, and F1-measure for each class (you can use classification report)
  4. (d)  the accuracy, macro-average F1 and weighted-average F1 of the model (you can use accuracy score and f1 score)
  5. (e)  the prior probability of each class
  6. (f)  the size of the vocabulary (i.e. the number of different words1)
  7. (g)  the number of word-tokens in each class (i.e. the number of words in total2)
  8. (h)  the number of word-tokens in the entire corpus
  9. (i)  the number and percentage of words with a frequency of zero in each class
  10. (j)  the number and percentage of words with a frequency of zero in the entire corpus
  11. (k)  your 2 favorite words (that are present in the vocabulary) and their log-prob

Redo steps 6 and 7 without changing anything (do not redo step 5, the dataset split). Change the model name to something like “MultinomialNB default values, try 2” and append the results to the file bbc-performance.txt.

Redo steps 6 and 7 again, but this time, change the smoothing value to 0.0001. Append the results at the end of bbc-performance.txt.

Redo steps 6 and 7, but this time, change the smoothing value to 0.9. Append the results at the end of bbc-performance.txt.

In a separate plain text file called bbc-discussion.txt, explain in 1 to 2 paragraphs: (a) what metric is best suited to this dataset/task and why (see step (2))

8.

9. 10. 11.

(b) why the performance of steps (8-10) are the same or are different than those of step (7) above.
In total, you should have 3 output files for task 1: bbc-distribution.pdf, bbc-performance.txt, and

bbc-discussion.txt.
1for example, if the word potato appears 3 times, you only count it once.

2for example, if the word potato appears 3 times, you count it 3 times. 2

3

1.

2. 3.

4.

5. 6.

Task 2: Drug Classification

Download the Drug dataset on Moodle. This dataset, in csv format, contains features that are numerical, categorical and ordinal as well as one of 5 classes to predict: DrugA, DrugB, DrugC, DrugX, or DrugY.

Load the dataset in Python (you can use pandas.read csv).

Plotthedistributionoftheinstancesineachclassandstorethegraphicinafilecalleddrug-distribution.pdf. You can use matplotlib.pyplot. This pre-analysis will allow you to determine if the classes are balanced, and which metric is more appropriate to use to evaluate the performance of your classifier.

Convert all ordinal and nominal features in numerical format. Make sure that your converted format respects the ordering of ordinal features, and does not introduce any ordering for nominal features. You may want to take a look at pandas.get dummies and pandas.Categorical to do this.

Split the dataset using train test split using the default parameter values. Run 6 different classifiers:

  1. (a)  NB: a Gaussian Naive Bayes Classifier (naive bayes.GaussianNB) with the default parameters.
  2. (b)  Base-DT: a Decision Tree (tree.DecisionTreeClassifier) with the default parameters.
  3. (c)  Top-DT: a better performing Decision Tree found using (GridSearchCV). The gridsearch will allow you to find the best combination of hyper-parameters, as determined by the evaluation function that you have determined in step (3) above. The hyper-parameters that you will experiment with are:

    • criterion: gini or entropy
    • max depth : 2 different values of your choice
    • min samples split: 3 different values of your choice

  4. (d)  PER: a Perceptron (linear model.Perceptron), with default parameter values.
  5. (e)  Base-MLP: a Multi-Layered Perceptron (neural network.MLPClassifier) with 1 hidden layer of 100 neurons, sigmoid/logistic as activation function, stochastic gradient descent, and default values for the rest of the parameters.
  6. (f)  Top-MLP: a better performing Multi-Layered Perceptron found using grid search. For this, you need to experiment with the following parameter values:
    • activation function: sigmoid, tanh, relu and identity
    • 2 network architectures of your choice: for eg 2 hidden layers with 30 + 50 nodes, 3 hidden layers

      with 10+10+10

    • solver: Adam and stochastic gradient descent

      For each of the 6 classifier above, append the following information in a file called drugs-performance.txt: (to make it easier for the TAs, make sure that your output for each sub-question below is clearly marked in your output file, using the headings (a), (b) . . . )

  1. (a)  a clear separator (a sequence of hyphens or stars) and a string clearly describing the model (e.g. the model name + hyper-parameter values that you changed). In the case of Top-DT and Top-MLP, display the best hyperparameters found by the gridsearch.
  2. (b)  the confusion matrix
  3. (c)  the precision, recall, and F1-measure for each class
  4. (d)  the accuracy, macro-average F1 and weighted-average F1 of the model

Redo steps 6, 10 times for each model and append the average accuracy, average macro-average F1, av- erage weighted-average F1 as well as the standard deviation for the accuracy, the standard deviation of the macro-average F1, and the standard deviation of the weighted-average F1 at the end of the file drugs-performance.txt. Does the same model give you the same performance every time? Explain in a plain text file called drugs-discussion.txt. A 1 or 2 paragraph discussion is expected.

7.

8.

In total, you should have 3 output files for task 2: drug-distribution.pdf, drug-performance.txt, and drug-discussion.txt.

3

4 Deliverables

The submission of the mini-project will consist of 2 deliverables:

  1. The code & the output files:
    • 􏰔  Submit all files necessary to run your code. If you used a Jupyter notebook, submit the .ipynb files; otherwise submit the .py files.
    • 􏰔  Submit a readme.md which will contain specific and complete instructions on how to run your exper- iments. You do not need to submit the datasets. If the instructions in your readme file do not work, are incomplete or a file is missing, you will not be given the benefit of the doubt.
    • 􏰔  Submit the 6 output files: bbc-distribution.pdf, bbc-performance.txt, bbc-discussion.txt and drug-distribution.pdf, drug-performance.txt, drug-discussion.txt.
  2. The demo (8 min presentation & Q/A)

    You will have to demo your mini-project for ≈ 12 minutes. Regardless of the demo time, you will demo the program that was uploaded as the official submission on or before the due date. The schedule of the demos will be posted on Moodle. The demos will consist of 2 parts: a presentation ≈ 8 minutes and a Q/A part (≈ 4 minutes). Note that the demos will be done via Zoom and will be recorded.

    Prepare an 8-minute presentation to analyse and compare the performance of your models. The intended audience of your presentation is your TAs. Hence there is no need to explain the theory behind the models. Your presentation should focus on your work and the comparison of the performance of the models when the hyper-parameters are modified.

    Your presentation should contain at least the following:

    • 􏰔  An analysis of the initial dataset given on Moodle. If there is anything particular about these datasets that might have an impact on the performance of some models, explain it.
    • 􏰔  An analysis of the results of all the models with the data sets. In particular, compare and contrast the performance of each model with one another, and with the datasets. Please note that your presentation must be analytical. This means that in addition to stating the facts (e.g. the macro-F1 has this value), you should also analyse them (i.e. explain why some metric seems more appropriate than another, or why your model did not do as well as expected. Tables, graphs and contingency tables to back up your claims would be very welcome here.
    • 􏰔  In the case of team work, a description of the responsibilities and contributions of each team member. Any material used for the presentation (slides, . . . ) must be uploaded on EAS before the due date.

      After your presentation, your TA will proceed with a ≈ 4 minute question period. Each student will be asked questions on the code/mini-project, and he/she will be required to answer the TA satisfactorily. In particular, each member should know what each parameters that you experimented with represent and their effect on the performance. Hence every member of team is expected to attend the demo.

      In addition, your TA may give you a new dataset and ask you to train or run your models on this dataset. The output file generated by your program will have to be uploaded on EAS during your demo.

4

5 Evaluation Scheme

Students in teams can be assigned different grades based on their individual contribution to project.

Individual grades will count for 15% and will be based on:

1. a peer-evaluation done after the submission.

2. the contribution of each student as indicated on GitHub.

3. the Q/A of each student during the demo (correct and clear answers to questions, knowledge of the program, . . . ).

The team grade will count for 85% and will be based on:

Code
Output files with given dataset Demo – Presentation

Output with demo-dataset Total

functionality, proper use of the datasets, design, programming 45 style, …
Format, Correctness and Depth of discussion (x-distribution.pdf, 15 x-performance.txt, and x-discussion.txt)

depth of the analysis, clarity and conciseness, presentation, time- 15 management, . . .
correctness and format 10

85

  • comp472-miniproject1-ngrjgo.zip