CSI5155 Assignment 2-Evaluation of Learning Solved

30.00 $

Description

Rate this product

The aim of this assignment is threefold.

First, we will explore the impact of resampling on model construction and model quality.

Second, we will continue to explore more supervised learning algorithms. Third, we will focus on the evaluation of the results of learning.

Context – In assignment 1, we used the Drug Consumption dataset from the UCI Machine Learning Repository to construct binary classification models to explore an individual’s risk of drug consumption and misuse. We constructed models using four (4) different learners, namely a single decision tree (DT), a random forest (RF) learner, a support vector machine (SVM), and a k‐nearest neighbor (k-NN) classifier, using the hold-out method of evaluation.  

Topic: Supervised learning and Evaluation of Learning

For assignment 2, you should select the dataset that obtained the highest overall accuracy in assignment 1, when using the holdout method. We refer to this dataset as dataset D.

In all your evaluations, use the tenfold cross validation approach. This implies that you will need to rerun the four algorithms against dataset D, prior to completing the following tasks.

Answer the following questions.

  1. Implement one (1) over-sampling method to convert dataset D to a balanced dataset

DB1.

  1. Retrain the four (4) classification algorithms (DT, k-NN, SVM, and RF) using dataset

DB1.

  1. Implement one (1) undersampling method to convert dataset D to a balanced dataset

DB2.

  1. Retrain the four (4) classification algorithms (DT, k-NN, SVM, and RF) against dataset

                    DB2.

  1. Use the multi-layer perceptron (MLP) algorithm and the gradient boosting (GB) ensemble to construct models against datasets D, DB1, and DB2. You should aim to produce the highest possible accuracies for the algorithms, through parameter tuning

Steps 1 to 5, as listed above, will result in three different sets of experiments:

  • – models built against the original dataset D, using ten-fold cross validation,
  • – models built against the over-sampled dataset DB1, and (C) – models built against the under-sampled dataset DB2.

 

  1. Next, apply the six (6) algorithms to the following two (2) datasets. You should aim to produce the highest possible accuracies for the algorithms, through parameter tuning

 

 

  1. Create a table to show the accuracies of the six (6) algorithms against the five (5) datasets, namely the three (3) different versions of the drug consumption dataset (datasets D, DB1 and DB2), as well as the labor-relations and heart-disease Show the steps you followed to determine whether there are any statistically significant differences between the results using Friedman’s test, when α = 0.05. If you find a significant difference, then show how you used the Nemenyi post hoc test to determine the critical differences. [10 marks]

 

  1. Write a 300 to 400 words summary discussing the lessons you learned during this assignment. Your answer should focus on the results obtained when comparing the different sampling methods, while using the various algorithms.          [20 marks]

 

  • Assignment2-dft8nu.zip