Description

5/5 - (2 votes)

The dataset for this assignment contains the prices and other attributes of 󱸳󱸮,󱸮󱸮󱸮 diamonds. A pre-processed and cleaned version of the dataset is made available on Moodle for download. Your task is to perform hypothesis testing, regression and classi󰎓cation on the dataset. Submit an R Markdown report summarising your 󰎓ndings together with the source code. Check Moodle for the deadline. This assignment is part of the continuous assessment and worth 󱸱󱸮󱹻 of your module grade.

󱸰 Dataset

Name: BIGDATA Assignment2-Diamond Dataset ML Solved
SKU: 66776
Price: 30.00 USD
Availability: InStock

First download the dataset from Moodle. As the dataset contains 󱸳󱸮K records, generating the plots may take a few moments. One way is to start with a small sample and carry out analysis, for example, you can pick 󱸯󱸮,󱸮󱸮󱸮 observations (without replacement) using the function: s_ample. Run the following code to do so:

s <- sample(nrow(diamonds.dataset), size=10000, replace = FALSE, prob = NULL) diamonds.subset <- diamonds.dataset[s, ]

The above piece of code creates a new dataset named: diamonds.subset containing 󱸯󱸮,󱸮󱸮󱸮 observations from diamond dataset. You can use the sampled dataset (diamonds.subset) 󰎓rst to write and test your code. And then use the full dataset for completing the task given below. REMEMBER! You must report your 󰎓ndings on the full dataset. In your 󰎓nal report, there is no need to include your 󰎓ndings on the sampled dataset. You might be familiar with the dataset already, when you load the dataset, you will 󰎓nd the following variables in the dataset:

󱸯

carat: weight of diamond (0.2 to 5.01) cut: quality of the Cut (Fair, Good, Very Good, Premium) color: diamond color from D (Best) to J (Worst) clarity: a measurement of how clear the diamond is from I1 (Worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (Best) table: width of top of the diamond

x: length in mm y: width in mm z: depth in mm

depth: total depth percentage = 2*z/(x+y) price: price in US dollars 󱸱 Tasks

Complete the following tasks.

󱸱.󱸯 Task A

Before you start statistical analysis, you have to de󰎓ne hypotheses, which will be tested. You should state at least 󱸰 di󰎎erent hypotheses, each to test di󰎎erent data (so not all hypotheses should be checking the same statement just on di󰎎erent variables). Remember that there are di󰎎erent types of tests and you should use as many as you can (given if they are valid and make sense). Your ultimate goal is to report some 󰎓ndings. You should also prove that these 󰎓ndings are statistically correct. Take the below points as hints but do not limit yourself to these:

Look at di󰎎erent plots you have created during exploratory analysis. What conclusions can be drawn based on these? These could become your hypotheses.
If you focus on one attribute, what is your intuition about the distribution that could explain such results? You can check and measure how well the data 󰎓ts some distribution.
For each valid hypothesis test you will get 󱸯󱸳 marks. This section consists of 󱸱󱸮 marks in total.

Remember that data analysis is not only about 󰎓nding and proving hypotheses but also about summarising data and communicating it.

It is not a failure if you do not get ”signi󰎓cant” results, you still have to

󱸰

report that. If your analysis makes sense (e.g. it is valid from the statistical point of view), there is no such thing as a bad result. Present your analysis in the form of a report. Each hypothesis should be described, you should state what you want to prove. If you are claiming that groups have di󰎎erent characteristics, 󰎓rst show these on plots and comment on them. Report should be written in a way that a person without prior knowledge of the data is able to follow it.

󱸱.󱸰 Task B

Divide the dataset into training and test data. Use 󱸵󱸳/󱸰󱸳 split.
Perform Linear Regression with Multiple Variables to predict the diamond price.
Report adjusted R squared (on training data). Use RMSE and correlation to report the prediction accuracy of the test data.
Normalize the data and repeat the process of performing Linear Regression with Multiple Variables on normalized data to predict the diamond price.
Highlight the di󰎎erence in prediction accuracy of both models.
Write your 󰎓ndings in this section. Each valid iteration Linear regression, will get you 󱸯󱸳 marks. This section consists of 󱸱󱸮 marks in total.

󱸱.󱸱 Task C

Divide the dataset into training and test data. Use 󱸶󱸮/󱸰󱸮 split.
Use kNN to classify diamond cuts into appropriate types based on their features.
Use C󱸳.󱸮 to classify diamond cuts into appropriate types based on their features.
Use ANN (hidden󱹫󱸳) to classify diamond cuts into appropriate types based on their features.
Compare the (best) performance of each classi󰎓er.
Write your 󰎓ndings in this section. Each valid classi󰎓cation technique, will get you 󱸯󱸮 marks. This section consists of 󱸱󱸮 marks in total.

󱸱

Keep in mind the following…

You can also get up to 󱸯󱸮 points for clarity and quality of the report and the source code.
Acceptable 󰎓le formats: R Markdown document (.Rmd) and pdf. Zip both 󰎓les together and submit.
Your R Markdown document must compile correctly into html and pdf formats.
Do not submit work that󰎎s not your own. Do not let others copy work that is your own. Both Copyier and Copyee will get ZERO marks.

Assignment02-Diamond-Dataset-ML-jsomou.zip

BIGDATA Assignment2-Diamond Dataset ML Solved

If Helpful Share:

Description

󱸰 Dataset

󱸱.󱸯 Task A

󱸱.󱸰 Task B

󱸱.󱸱 Task C

Related products

Big Data Homework 4 Solved

BigData Assignment2-SPARK STREAMING, SPARK STRUCTURED STREAMING Solved

Big Data Homework 1 Solved