comp1239 – Assignment-1 Solved

39.99 $

Description

5/5 - (1 vote)

/ Week 3 (/COMP9321/22T1/resources/73585) / Assignment-1
The assignment is based on two datasets:
NSW Suburbs
NSW Routes
(https://webcms3.cse.unsw.edu.au/COMP9321/22T1/resources/73657) This dataset contains information about NSW public transport routes. (from Transport for NSW Open Data)
You are supposed to manually inspect the dataset and answer the questions. Attributes of the dataset are self-explanatory and also detailed below, but If you have any doubt about an attribute you can always ask for clarification.
These datasets are derived from the Australian and NSW State governments. We are not responsible for political correctness or other data accuracies as this is purely a learning exercise for data services engineering.
You are supposed to explore these datasets yourself and answer the following questions

Using the routes.csv dataset, load it in a data frame add two new columns to it titled:
“start” – which indicates the start service location from the column “service_direction_name”
“end” – which indicates the end service location from the column “service_direction_name” For example, “start” and “end” are “Gosford” and “Wyong” in the value “Gosford, then Tuggerah, Wyong”.
Output: is a data frame with all the columns you have loaded with two new columns. These should both have the added two columns “start” and “end”.
Here is a guide for what you can do for this task (but you can be always more creative). You are free to use “suburbs.csv”( please find the updated template file) in this question
You can ignore what comes after “via” (e.g., ‘Callala to Nowra via Myola’ -> ‘Callala to Nowra’)
You can ignore anything between parenthesis (e.g., ‘Callala (Currarong) to Nowra -> ‘Callala to Nowra’ )
You can use the first suburb name found in the value as the “start”, and the last suburb name found in the value is the “end” (e.g., ‘Callala and Currarong to Nowra’ -> start is “Callala” and end is “Nowra”)
If there is no suburb name in the value but the value contains the word “to”, what comes before “to” could be used as “start”, and what’s after is “end”.
Remember that the assignment is marked manually, so we won’t mark down the whole question for missing a few cases.
Question 2 (1.5 Marks )
Using the data frame from Question 1, find the top five (5) most frequently visited service locations (the locations where most services start or end there) and their frequencies in NSW. Only start and end service locations are eligible.
Some abbreviations exist (e.g. Nwcstl instead of Newcastle) – these can be cleaned or ignored with no penalty given.
Output: a data frame with five rows and two columns for “service_location” and “frequency”.

Using the data frame from Question 1, map the values in the “transport_name” column to one of [Bus, Ferry, Light Rail, Train,
Output: A data frame with the modified column.

Using the data frame from Question 3, create a new data-frame with an ascending (lowest to highest) column of all “transport_name” frequencies: “ transport_name ” – the “transport_name” column.
” frequency ” – which contains the number of occurrences of the given “transport_name” in the whole dataset
Output: a data frame with two columns as the above.

Using the data frame from Question 3 and the suburbs.csv dataset, find the top five (5) ratios of depot frequency (“depot_name”) and the corresponding suburb population (“population”).
For the purposes of this assignment, you can assume the suburb name appears in the depot name. Depots that do not have an associating suburb name can be ignored.
Output: a data-frame with five rows with two columns “depot” and “ratio” as described above. “depot” is the index of the data frame. This must be ordered in descending (highest to lowest) order.

Using data frame from Question 3, create a summary table to represent:
The number of routes managed by Operators (from “operator_name” ) and indicate the type of transport method (from “transport_name”). The design of the summary table is left open. Design the table to maximize insights reflected.
Output: an informative summary table providing insights about the columns mentioned above.

From the suburbs.csv dataset, create a visualization to compare NSW Local Government Areas (LGA) within the “statistic_area” of only “Greater Sydney” based on the following metrics:
(1) “ population ” – the number of people living in the LGA.
(2) “ median_income ” – the median income of all residents in the LGA. This is the level of income that divides the LGA into two equal parts based on their sorted individual income.
(3) “ sqkm ” – the area of the LGA represented in square kilometers.
Output: an image of a static visualization titled “YOUR_ZID_q7.png”.You will be marked based on how good your plot demonstrates the differences or similarities between LGAs based on the above metrics. Quality of the visualization (e.g., choice of visualization, clarity, choice of colors, labeling/legends…etc.) are to be considered in the marking (check the lectures for visualization for guidance in regards to what makes a good visualization).

Using data-frame from Question 3 and suburbs.csv dataset, create a visualization to represent:
The latitude and longitude of all suburbs in NSW, including start and end service locations of all routes (except for Bus services) in NSW
– An indication of the relative size of the above locations based on their “sqkm” (do not explicitlymention the area of the suburbs).
– Connections between these locations represent the routes of the above service locations with theservice name.
– A use of color to separate transport type ( Ferry , Light Rail , Train , Metro ).
– There is no need to use a street map for this.
Output: an image of a static visualization titled “YOUR_ZID_q8.png”. Quality of the visualization (e.g., choice of visualization, clarity, choice of colors, labeling/legends…etc.) are to be considered in the marking (check the lectures for visualization for guidance in regards to what makes a good visualization).
UPDATE: You can ignore BUS routes in this question; this will ease the challenge of showing all routes in a single chart
NEW FAQ
1. Is there a time limit for my submission?
2. Can I ignore some suburbs or aggregate in visualization questions?
3. Is there an automatic test case?
We do not use any automated test cases, and your submission will be marked manually
4. Can I add my private functions anywhere in the code? Yes, you can
5. Are we required to match all “start”, and “end” locations, and “depot_name” with suburb names? No, but do it as much as you can, as it will improve your visualization
6. How can we calculate the ratio in Q5?
There are two acceptable solutions (any of the following is fine):
(1) divide depot frequency by the corresponding suburb’s population. You need to find which suburbthe depot is located.
(2) you can merge depots located in the same suburbs and treat them as a single unit. Then calculatethe ratio
7. How to calculate the median income for an LGA?
There are two acceptable solutions (without imposing any penalties)
(1) calculate the average of suburbs
(2) calculate the weighted average
Important Notes:
Submit your script named ” YOUR_ZID .py” (z2123232.py) which contains your code.
You are required to use the following code template (it is not complete; please download the file) for your submission.
You can download the code template from: (https://github.com/mysilver/COMP9321-Data-
Services/blob/master/z1111111.py) https://github.com/mysilver/COMP9321-DataServices/blob/master/22t1/z1111111.py (https://github.com/mysilver/COMP9321-Data-
Services/blob/master/22t1/z1111111.py)
import json import matplotlib.pyplot as plt import pandas as pd import sys import os import numpy as np import math studentid = os.path.basename(sys.modules[__name__].__file__) def log(question, output_df, other):
print(“————— {}—————-“.format(question)) if other is not None:
print(question, other) if output_df is not None:
df = output_df.head(5).copy(True) for c in df.columns:
df[c] = df[c].apply(lambda a: a[:20] if isinstance(a, str) else a) df.columns = [a[:10] + “…” for a in df.columns] print(df.to_string()) def question_1(routes, suburbs):
“””
:param routes: the path for the routes dataset
:param suburbs: the path for the routes dataset
:return: df1
Data Type: Dataframe
Please read the assignment specs to know how to create the output dataframe “””
#################################################
# Your code goes here …
################################################# log(“QUESTION 1”, output_df=df1, other=df1.shape) return df1
… if __name__ == “__main__”:
df1 = question_1(“routes.csv”, “suburbs.csv”) df2 = question_2(df1.copy(True)) df3 = question_3(df1.copy(True)) df4 = question_4(df3.copy(True)) df5 = question_5(df3.copy(True), “suburbs.csv”) table = question_6(df3.copy(True)) question_7(df3.copy(True), “suburbs.csv”) question_8(df3.copy(True), “suburbs.csv”)
If you do not follow this structure, you will not be marked.
You can only add codes in the specified lines (do not edit the rest of the lines):
#################################################
# Your code goes here …
#################################################
If your code does not run on CSE machines for any reason (e.g., hard-coded file path such as C://Users/), you will be penalized at least by 5 marks. We assume that the CSV files are located in the same directory of your script, and the name is the same as the one in the template (e.g., exposure.csv and Countires.csv)
Please look at the documentation for each question method; it describes the inputs (e.g., a dataframe) and output (e.g., dataframe, list of cities) of the method.
Please use the same variable names as mentioned in the comments
You are supposed to use Pandas library for all questions. That being said, it is forbidden to use regular python codes to process data. However, you can use lambda when required and user-defined functions for panda methods such as ‘apply’.
In the last two questions , you need to plot charts; please do not use “plt.show()” function to pop up charts. The code template will automatically save the chart on the disk. What you need to do is to just call the plot functions of the dataframe (e.g., df.plot.pie()). We highly recommend you go through the lab activities to know how to plot charts.
You should not edit the dataset files. You can only submit your code.
You cannot use other python libraries unless it is already listed in the template file
Use the latest version of the python libraries
Please read all highlighted texts; as they indicate answer/clarification for some questions other students have asked since the assignment has been released.
For the questions, you must use pandas features to answer the questions, and you are not allowed to iterate over the rows of the data frame using a loop. You will lose the mark for the question otherwise. You must use the code template without changing the output format. 1 mark penalty will be applied otherwise.
For the last two questions, you should NOT show the plot, instead, the plot should be saved as a file; and the saved file will only be marked! The template does the work for you! do not call plot.show(); The graphs should not pop up; 1 mark penalty will be applied otherwise.
You should not overwrite any of the dataset files; you can keep your changes in Dataframes, and reuse them.
Datasets should be located in the same directory as your script – please do not use absolute path – 1 mark penalty will be applied otherwise.
For visualization questions, you will be marked based on how good you are representing the information, including but not limited to “choosing appropriate chart”, “using appropriate chart elements e.g., legends, labels, scales”, etc.

FAQ:
Can I pass extra variables to functions?
No
Can we create our own functions besides the question functions (e.g., question_1)?
Yes
Can I call another function inside the question functions? e.g., calling question_1 inside question_2
Yes
What should I do if my charts are not shown automatically?
How are our submissions marked?
You can only use packages imported in the template file to do the assignment. What version of Python should I use?
Python 3+
What version of pandas should I use?
the latest version and you can update the version on your CSE account to make sure you can test your code
How I can submit my assignment?
Go to the assignment page click on the “Make Submission” tab; pick your files which must be named “YOUR_ZID.py”. Make sure that the files are not empty, and submit the files together.
Yes, you can. But 5% of your assignment will be deducted as a late penalty per day. If you are late for more than 5 days, you will not be marked.

14:51:12 GMT+0800 (中国标准时间))
Yes
Reply
GMT+0800 (中国标准时间)) Thanks!
Reply
Regarding the output format of q2 and q4, both have two columns. Should we keep the default index as number or set the first column of the output as index? If we set “transport_name” as the index of the output dataframe then q4 log(“QUESTION 4”, output_df=df4[[“transport_name”, “frequency”]], other=df4.shape) will complain because index “transport_name” cannot be accessed like normal column.
Thanks!
Reply
12:50:10 GMT+0800 (中国标准时间))
Yes, keep the default
Reply
For Q7, do we need to use df3 since it is passed as a parameter?
Reply
09:35:57 GMT+0800 (中国标准时间))
It there is no use for it, you can ignore Reply
There’s some great difference between pandas version used on CSE… on my local computer it is pandas 1.2.4, and on CSE it is 0.23.3+dfsg. I suppose this is the reason I am getting a ValueError with one of the graphs. I remember Morty was saying in the lecture that any latest version should be fine. So which version do we need to follow?
Reply
GMT+0800 (中国标准时间))
The latest version of pandas; you can upgrade it in CSE machines using “pip install … -upgrade”
Reply
(中国标准时间))
Reply
Reply
A very strange problem, when I run df.plot.bar in my own environment, there is no error message, and I get the correct result, but in the cse environment, the error shows TypeError. After understanding, this is probably pandas Version conflict, but after I searched the documentation, I found that whether it is the 0.23 version or the current 1.4 version of pandas, the use of df.plot.bar is correct, and the screenshots are the error message and my df data type, hope to get help!
Error msg: df type:
Reply
09:34:23 GMT+0800 (中国标准时间))
Reply
国标准时间))
Thanks, i will attend tomorrow
Reply
准时间))
I found that there was a problem with the screenshot upload, and added the following:
Error msg: TypeError: Cannot interpret ‘<attribute ‘dtype’ of ‘numpy.generic’ object>’ as a data type
df type: <class ‘pandas.core.frame.DataFrame’>
Reply
国标准时间))
try to upgrade numpy
Reply
准时间))
The structure above does not import re
but the structure on gitlab ( https://github.com/mysilver/COMP9321-Data-
Services/blob/master/22t1/z1111111.py (https://github.com/mysilver/COMP9321-DataServices/blob/master/22t1/z1111111.py) ) import re
Reply
09:32:58 GMT+0800 (中国标准时间))
You can use re if you wish
Reply
Hi, Mohammadali (https://webcms3.cse.unsw.edu.au/users/z5138589)
When I test my py code on the vlab, an error happens when I plot hist and kde by
DataFrame.plot
TypeError: Cannot interpret ‘<attribute ‘dtype’ of ‘numpy.generic’ objects>’ as a data type
This is because the version of Numpy and Pandas is not the latest.
I can run it on my machine, pandas==1.3.5 and numpy==1.21.5
But in vlab pandas==0.23.3+dfsg, numpy==1.21.5, my code can’t run.
When I upgrade them, the plot result is still a little different from mine.
What should I do?
Reply
But in vlab, something so weird happened: the legend number is 4.
Their code is the same.

Reply
09:32:22 GMT+0800 (中国标准时间))
It is hard to debug without actual code; but first try to upgrade all your python packages to the latest versions
Reply
Load More Comments

  • Assign1-3ley2b.zip