Problem 1: Movie Data Analysis with Pandas hw1_movie.ipynb
In this homework, you are asked to write a program for answering the following questions based on IMDB Movie data (IMDBâMovieâData.csv). The output format of each question is free. You must use Pandas package to answer each question at this time. In addition, you also need to write your code in Jupyter Notebook (.ipynb), and use one code block for each question.
Question
- (1) Â Topâ3 movies with the highest ratings in 2016?
- (2) Â The actor generating the highest average revenue?
- (3) Â The average rating of Emma Watsonâs movies?
- (4) Â Topâ3 directors who collaborate with the most actors?
- (5) Â Topâ2 actors playing in the most genres of movies?
Topâ3 actors whose movies lead to the largest maximum gap of years?
(6)
Example of âmaximum gap of yearsâ:
Tom Cruise has movies: âEdge of Tomorrowâ in 2014, âMission: Impossible â Rogue Nationâ in 2015, âOblivionâ in 2013, âJack Reacherâ in 2012, âMission: Impossible IIIâ in 2006, âJack Reacher: Never Go Backâ in 2016, âRock of Agesâ in 2012, âMission: Impossible â Ghost Protocolâ in 2011. The maximum gap of years is 2016â 2006 = 10
Find all actors who collaborate with Johnny Depp in direct and indirect ways
Example:
A collaborates with B
B collaborates with C and D C collaborates with E and F D collaborates with A and G G collaborates with H
All actors directly and indirectly collaborating with A include: [B, C, D, E, F, G, H]
(7)
1
NCKU Intro Data Science 2022 Fall
Problem 2: InâGame Purchase Data Analysis hw1_purchase.ipynb
In this homework, you are asked to deal with a task of analyzing an âinâgame purchaseâ dataset. Please refer to the dataset âpurchase_data.csvâ. For inâgame purchasing, players are able to purchase optional items that enhance their playing experience. Now your task is to generate a report that breaks down the gameâs purchasing data into meaningful insights. We provide you basic observation about the dataset, as below. You need to follow the instructions in the ipynb code we provide you (âhw1_purchase.ipynbâ), and complete each code block on your own.
- ďˇ Â There are 1163 active players. The vast majority are male (84%). There also exists, a smaller, but notable proportion of female players (14%).
- ďˇ Â Our peak age demographic falls between 20â24 (44.79%) with secondary groups falling between 15â19 (18.58%) and 25â29 (13.37%).
- ďˇ Â The age group that spends the most money is the 20â24 with 1,114.06 dollars as total purchase value and an average purchase of 4.32. In contrast, the demographic group that has the highest average purchase is the 35â39 with 4.76 and a total purchase value of 147.67.
(and its data frame techniques) to generate the data frame that is of
âhw1_purchase.ipynbâ. For more details, please refer to âhw1_purchase.ipynbâ.
Problem 3: Kâmeans Clustering Implementation hw1_kmeans.ipynb
Your task is to use Python (along with numpy and Pandas) to implement the wellâknown clustering algorithm, Kâmeans, based on a synthetic dataset cdata.csv. This dataset contains two data columns, âXâ and âYâ, and one âclusterâ column (1, 2, 3, and 4). In implementing Kâmeans, you need to use âXâ and âYâ as features for clustering while the âclusterâ column is for your validation. Note that it is not necessary to perfectly clustering all of the data points into clusters. Also note that the âclusterâ column cannot be used in clustering.
(1) Randomly select data points as the initialized centroids. By default, please set K=4. Report and plot the process until convergence. The centroids also need to be plotted. An example is shown below. Note that it may not have 3 rounds (it can be 4 or 5 rounds, depend on initialized centroids).
You are forced to use the pandas package
exactly the same as the table right after each code block
2
NCKU Intro Data Science 2022 Fall
Round 1 Round 2 Round 3
- (2)  Reâexecute your Kâmeans clustering algorithm by changing K from 2 to 50 (from 2 to 10 is also okay). Plot the K value (xâaxis) vs. the value of Sum of Squared Error (SSE) (yâaxis) as below. Note that it is reasonable and acceptable if the curve is ĺšĺ¸ä¸ĺšł.ď
- (3) Â Try 10 times of randomly initialized centroids, and plot their SSE values (yâaxis) such as below.
3
NCKU Intro Data Science 2022 Fall
Important Notes
This is a homework for each individual. You are asked to write comments to describe the meaning of each part of your codes in either code block or markdown.





