CSCI5260 Project 4 – Guess What? Solved

30.00 $

Category:
Click Category Button to View Your Next Assignment | Homework

You'll get a download link with a: . zip solution files instantly, after Payment

Description

Rate this product

Description

Background

You now work for a prominent winery that has hired you to predict the quality of the wine they produce, based on alreadycollected data. The winery collects two main sets of data: one on the white wines they produce (winequality-white.csv, n=4898), and one on the red wines they produce (winequality-red.csv, n=1599). Data Description

Data are in two files: winequality-white.csv (4898 rows x 12 columns) and winequality-red.csv (1599 rows x 12 columns).

Input Variables

These input variables are based on physiochemical tests that occur regularly.

  1. fixed acidity             Range: 3.8 to 15.9
  2. volatile acidity Range: 0.08 to 1.58
  3. citric acid             Range: 0 to 1.66
  4. residual sugar Range: 0.9 to 65.8
  5. chlorides             Range: 0.009 to 0.611
  6. free sulfur dioxide Range: 1 to 289
  7. total sulfur dioxide Range: 6 to 440
  8. density                         Range: 0.98711 to 1.03898
  9. pH                         Range: 2.72 to 4.01
  10. sulphates             Range: 0.22 to 2.0
  11. alcohol                         Range: 8 to 14.9

Output Variable

  1. quality                         Range: 0 to 10

Part 1 – Unsupervised Learning

Coding and Analysis Requirements

Create a file called project4_clustering.py. Write a program that does the following:

  1. Read winequality-white.csv and winequality-red.csv into two separate Pandas data frames.
    1. Reference: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html#pandas.read_csv
  2. Create a target_white data frame and a target_red data frame by selecting the data’s last column (the ‘quality’ column) and storing it there. For example: target_red = data_red[‘quality’]. Be sure to use the drop function after you have copied it to remove it from the original data.
    1. See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html?highlight=drop#pandas.Dat aFrame.drop
  3. Using sklearn.cluster.KMeans, run the k-means clustering algorithm on the white wines and the red wines. You should use 11 clusters because we know there are 11 quality metrics (labeled 0-10).
    1. See https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html.
    2. Note that the result of the fit function returns a data structure containing the following:
      1. cluster_centers_ndarray of shape (n_clusters, n_features)
        1. Coordinates of cluster centers. If the algorithm stops before fully converging (see tol and max_iter), these will not be consistent with labels_.
      2. labels_ndarray of shape (n_samples,)
        1. Labels of each point
  • inertia_float
    1. Sum of squared distances of samples to their closest cluster center.
  1. n_iter_int
    1. Number of iterations run.
  2. Analyze the results for the white wine and the red wine examples. Add a discussion to the Project4.docx writeup document. Remember that the cluster labels ARE NOT predictions of quality. The label is simply the grouping to which an example belongs. To analyze this you should:
    1. Write a procedure that determines the quality for each cluster by averaging the qualities of all items in that cluster.
    2. This is OPEN-ENDED but you should use this information to plot the quality values for each cluster. Include these plots in your docx writeup.
    3. Does the data indicate 11 clearly-defined quality metrics? Explain why it does or does not.

Part 2 – Supervised Learning

Coding and Analysis Requirements

Create a file called project4_ml.py. Using the same data set as above, do the following.

  1. Combine the data sets into a single data set.
    1. To do this, add a column called “type” to each data frame.
    2. Set red wine as type 0 and white wine as type 1.
  2. Split the data into train and test sets.
  3. Train and Test two of the following learning algorithms from the scikit-learn library. Be sure to use the same train and test data for each.
    1. Decision Tree Classifier – https://scikit-learn.org/stable/modules/tree.html#classification
    2. Linear Regression Classifier – https://scikit-learn.org/stable/modules/linear_model.html#generalizedlinear-regression
    3. Gaussian Naïve Bayes Classifier – https://scikit-learn.org/stable/modules/naive_bayes.html#gaussiannaive-bayes
    4. Nearest Neighbor Classifier – https://scikit-learn.org/stable/modules/neighbors.html#nearest-neighborsclassification
    5. Support Vector Machine – https://scikit-learn.org/stable/modules/svm.html#classification Analyze the results by showing the following (add your analysis to Project4.docx):
    6. Which classification method performed better?
    7. You should measure the number of true negatives, true positives, false negatives, and false positives. If you want to drill down, it might be helpfult to track this by class.
    8. Based on the results, what could you do to improve performance?
    9. Keep in mind the ideas of feature engineering and feature scaling as you respond to this.

Part 3 – Deep Learning

Coding and Analysis Requirements

Create a file called project4_nn.py.

  1. Use the combined data set from Part 2, and the same train and test sets.
  2. Train and test a Multilayer Perceptron Neural Network Classifier (MLPClassifer).
  3. https://scikit-learn.org/stable/modules/neural_networks_supervised.html#classification 3. Analyze the results (recording the analysis in Project4.docx) by:
    1. Showing true negatives, true positives, false negatives, and false positives. If you want to drill down, you might track this by class to better analyze results.
    2. Comparing the results to the models trained above.