Name: Intro Data Science Homework 1 Solved
SKU: 101123
Availability: InStock

Rate this product

Problem 1: Movie Data Analysis with Pandas hw1_movie.ipynb

In this homework, you are asked to write a program for answering the following questions based on IMDB Movie data (IMDB‐Movie‐Data.csv). The output format of each question is free. You must use Pandas package to answer each question at this time. In addition, you also need to write your code in Jupyter Notebook (.ipynb), and use one code block for each question.

Question

(1) Top‐3 movies with the highest ratings in 2016?
(2) The actor generating the highest average revenue?
(3) The average rating of Emma Watson’s movies?
(4) Top‐3 directors who collaborate with the most actors?
(5) Top‐2 actors playing in the most genres of movies?
Top‐3 actors whose movies lead to the largest maximum gap of years?

(6)

Example of “maximum gap of years”:

Tom Cruise has movies: “Edge of Tomorrow” in 2014, “Mission: Impossible ‐ Rogue Nation” in 2015, “Oblivion” in 2013, “Jack Reacher” in 2012, “Mission: Impossible III” in 2006, “Jack Reacher: Never Go Back” in 2016, “Rock of Ages” in 2012, “Mission: Impossible ‐ Ghost Protocol” in 2011. The maximum gap of years is 2016‐ 2006 = 10

Find all actors who collaborate with Johnny Depp in direct and indirect ways

Example:
A collaborates with B
B collaborates with C and D C collaborates with E and F D collaborates with A and G G collaborates with H

All actors directly and indirectly collaborating with A include: [B, C, D, E, F, G, H]

(7)

NCKU Intro Data Science 2022 Fall

Problem 2: In‐Game Purchase Data Analysis hw1_purchase.ipynb

In this homework, you are asked to deal with a task of analyzing an “in‐game purchase” dataset. Please refer to the dataset “purchase_data.csv”. For in‐game purchasing, players are able to purchase optional items that enhance their playing experience. Now your task is to generate a report that breaks down the game’s purchasing data into meaningful insights. We provide you basic observation about the dataset, as below. You need to follow the instructions in the ipynb code we provide you (“hw1_purchase.ipynb”), and complete each code block on your own.

 There are 1163 active players. The vast majority are male (84%). There also exists, a smaller, but notable proportion of female players (14%).
 Our peak age demographic falls between 20‐24 (44.79%) with secondary groups falling between 15‐19 (18.58%) and 25‐29 (13.37%).
 The age group that spends the most money is the 20‐24 with 1,114.06 dollars as total purchase value and an average purchase of 4.32. In contrast, the demographic group that has the highest average purchase is the 35‐39 with 4.76 and a total purchase value of 147.67.

(and its data frame techniques) to generate the data frame that is of

“hw1_purchase.ipynb”. For more details, please refer to “hw1_purchase.ipynb”.

Problem 3: K‐means Clustering Implementation hw1_kmeans.ipynb

Your task is to use Python (along with numpy and Pandas) to implement the well‐known clustering algorithm, K‐means, based on a synthetic dataset cdata.csv. This dataset contains two data columns, “X” and “Y”, and one “cluster” column (1, 2, 3, and 4). In implementing K‐means, you need to use “X” and “Y” as features for clustering while the “cluster” column is for your validation. Note that it is not necessary to perfectly clustering all of the data points into clusters. Also note that the “cluster” column cannot be used in clustering.

(1) Randomly select data points as the initialized centroids. By default, please set K=4. Report and plot the process until convergence. The centroids also need to be plotted. An example is shown below. Note that it may not have 3 rounds (it can be 4 or 5 rounds, depend on initialized centroids).

You are forced to use the pandas package

exactly the same as the table right after each code block

NCKU Intro Data Science 2022 Fall

Round 1 Round 2 Round 3

(2) Re‐execute your K‐means clustering algorithm by changing K from 2 to 50 (from 2 to 10 is also okay). Plot the K value (x‐axis) vs. the value of Sum of Squared Error (SSE) (y‐axis) as below. Note that it is reasonable and acceptable if the curve is 凹凸不平.
(3) Try 10 times of randomly initialized centroids, and plot their SSE values (y‐axis) such as below.

NCKU Intro Data Science 2022 Fall

Important Notes

This is a homework for each individual. You are asked to write comments to describe the meaning of each part of your codes in either code block or markdown.

hw1-zfuiv7.zip

[SOLVED] Intro Data Science Homework 1

Want to See Past Work First?

[SOLVED] Intro Data Science Homework 1

Related products

CS310 Data Structures Programming Assignment 5: Hashing with Sets and Maps

SOLVED: Console program that repeatedly prompts the user to enter data until they type done

CPSC-131 Data Structures Project 4: GroceryCheckout

Want to See Past Work First?