Description
Some of the queries in this assignment involve floating point data. In the expected output shown below, there might be slight differences in floating point values due to rounding and formatting issues. You may assume that floating point values in your query results match the expected output if they are equivalent to four decimal places (e.g. if the expected output is written as ‘0.1234’, the values ‘0.123398’ and ‘0.1234567’ would both be considered correct, but the value ‘0.123’ would not).
Question 1: IMDB Queries
Create queries for each of the data retrieval problems below, using the imdb database. In the questions below, any reference to ‘films’ refers to titles with title_type = ‘movie’.
- For each year between 2000 and 2017 (inclusive), list the primary name, production year, rating and number of votes of the film or films which attained the highest rating among all movies produced in that year which received at least 10000 votes. Both the rating and number of votes are stored in the ratings
Expected Query Result |
primary name |
year |
rating |
votes |
Memento |
2000 |
8.5000 |
942432 |
Gladiator |
2000 |
8.5000 |
1095891 |
The Lord of the Rings: The Fellowship of the Ring |
2001 |
8.8000 |
1370113 |
The Lord of the Rings: The Two Towers |
2002 |
8.7000 |
1221886 |
The Lord of the Rings: The Return of the King |
2003 |
8.9000 |
1349934 |
Anbe Sivam |
2003 |
8.9000 |
10116 |
Black Friday |
2004 |
8.6000 |
13513 |
Earthlings |
2005 |
8.7000 |
14649 |
The Prestige |
2006 |
8.5000 |
959259 |
The Lives of Others |
2006 |
8.5000 |
290278 |
The Departed |
2006 |
8.5000 |
975612 |
Like Stars on Earth |
2007 |
8.5000 |
113421 |
The Dark Knight |
2008 |
9.0000 |
1864795 |
Home |
2009 |
8.6000 |
19621 |
Inception |
2010 |
8.8000 |
1653611 |
The Intouchables |
2011 |
8.6000 |
591006 |
The Dark Knight Rises |
2012 |
8.4000 |
1269644 |
Django Unchained |
2012 |
8.4000 |
1087769 |
CM101MMXI Fundamentals |
2013 |
9.3000 |
38106 |
Interstellar |
2014 |
8.6000 |
1120400 |
RangiTaranga |
2015 |
8.7000 |
10286 |
The Mountain II |
2016 |
9.6000 |
93071 |
Ayla: The Daughter of War |
2017 |
9.1000 |
15807 |
- Select the primary name and episode count of each TV series (contained in the tv_series table) for which at least 6000 episodes have been produced.
Expected Query Result |
|
series name |
episode count |
Days of Our Lives |
10240 |
Ohayou Tokushima |
9502 |
Coronation Street |
9316 |
Neighbours |
8911 |
Six O’Clock News |
8646 |
The Price Is Right |
8461 |
One O’Clock News |
8100 |
Jeopardy! |
7599 |
EastEnders |
7393 |
The Bold and the Beautiful |
7236 |
Home and Away |
7081 |
Wheel of Fortune |
6564 |
Gute Zeiten, schlechte Zeiten |
6413 |
The Tonight Show Starring Johnny Carson |
6037 |
The Young and the Restless |
6024 |
Question 2: BC Ferries Queries [14 marks]
The queries you write below should work correctly on any of the BC Ferries databases (ferries_1month, ferries_3months, ferries_6months, ferries_9months, ferries_1year or ferries_3years). For comparison, sample output is shown for both ferries_1month and ferries_3years.
The following definitions apply to all of the parts below unless otherwise stated.
- The ‘duration’ of a sailing is defined to be the number of minutes between the scheduled departure (not the actual departure) and the arrival of that sailing.
- Any reference to the ‘day of a sailing’ refers to the calendar day of the scheduled departure (not the actual departure) of that sailing.
- For many routes, there are simultaneous sailings in both directions at the same time. For example, on a typical day at 9:00am, a ferry leaves Tsawwassen for Swartz Bay and a different ferry leaves Swartz Bay for Tsawwassen. Not all routes or sailings have this property. We will define two vessels as ‘paired up’ if they have both served the same route number at the same (scheduled) departure time/date. Construct a query to count the number of times each distinct pair of ferries have been paired up. Your result must not contain counts for pairs of vessels which have never been paired up. Each distinct pair of ferries should appear only once in your result, and in the result rows, the vessel names of each pair must be ordered alphabetically (so the alphabetically lowest vessel name will be listed first).
|
Expected Query Result (ferries 1month) |
|
|
|
vessel1 |
vessel2 |
num pairings |
|
|
Coastal Inspiration |
Queen of Alberni |
158 |
|
|
Spirit of British Columbia |
Spirit of Vancouver Island |
10 |
|
Expected Query Result (ferries 3years) |
vessel1 |
vessel2 |
num pairings |
Coastal Inspiration |
Queen of Alberni |
5449 |
Spirit of British Columbia |
Spirit of Vancouver Island |
3552 |
Queen of Cowichan |
Queen of Oak Bay |
3048 |
Coastal Celebration |
Queen of New Westminster |
2261 |
Coastal Renaissance |
Queen of Oak Bay |
1825 |
Coastal Celebration |
Spirit of British Columbia |
1653 |
Coastal Celebration |
Spirit of Vancouver Island |
1247 |
Coastal Renaissance |
Spirit of Vancouver Island |
847 |
Coastal Renaissance |
Queen of Alberni |
814 |
Coastal Celebration |
Coastal Renaissance |
622 |
Coastal Renaissance |
Queen of New Westminster |
594 |
Coastal Inspiration |
Queen of New Westminster |
592 |
Queen of Coquitlam |
Queen of Cowichan |
548 |
Queen of Coquitlam |
Queen of Oak Bay |
481 |
Queen of Alberni |
Queen of Coquitlam |
400 |
Queen of Coquitlam |
Queen of New Westminster |
152 |
Coastal Renaissance |
Spirit of British Columbia |
53 |
Coastal Celebration |
Coastal Inspiration |
24 |
Queen of New Westminster |
Spirit of Vancouver Island |
17 |
Coastal Inspiration |
Spirit of British Columbia |
8 |
Queen of Coquitlam |
Queen of Surrey |
3 |
Coastal Renaissance |
Queen of Cowichan |
1 |
|
|
|
|
|
|
- The routes table contains the ‘nominal duration’ of each route, which is the expected crossing time. The nominal duration is determined by BC Ferries based on the average marine and traffic conditions, along with information like loading times and the speed of the vessels. We can test the accuracy of this calculation by computing the average time of each crossing. Construct a query to find, for each route number, the nominal duration (in minutes) and the average duration (in minutes) of a crossing based on all available data for that route. For this question, assume that the ‘duration’ of a particular sailing is the time between its scheduled departure and its arrival.
Expected Query Result (ferries 1month) |
route number |
nominal duration |
avg duration |
1 |
95 |
101.2424 |
3 |
40 |
44.9415 |
4 |
35 |
33.2479 |
8 |
20 |
20.5182 |
30 |
120 |
129.0709 |
Expected Query Result (ferries 3years) |
route number |
nominal duration |
avg duration |
1 |
95 |
92.8867 |
2 |
100 |
106.8488 |
3 |
40 |
45.1520 |
4 |
35 |
31.8939 |
8 |
20 |
21.9507 |
30 |
120 |
123.0126 |
- Suppose we define a sailing to be ‘late’ if the duration is at least five minutes longer[1] than the nominal duration in the routes Construct a query to find, for each month, the number of days for which at least one sailing occurred on Route 1 but no late sailings occurred on route number 1.
Expected Query Result (ferries 1month) |
month |
count |
4 |
1 |
Expected Query Result (ferries 3years) |
month |
count |
1 |
65 |
2 |
53 |
3 |
58 |
4 |
29 |
5 |
19 |
6 |
23 |
7 |
22 |
8 |
18 |
9 |
43 |
10 |
45 |
11 |
51 |
12 |
60 |
- Construct a query to find, for each vessel with any sailings, the total number of sailings it has made, the number of late sailings it has made (which may be zero) and the fraction of its sailings that were late (that is, the number of late sailings divided by the total number of sailings). You should be careful with this query: a vessel may be involved in multiple routes, each with a different nominal duration.
|
Expected Query Result (ferries 1month) |
|
|
|
vessel name |
total sailings |
late sailings |
late fraction |
|
|
Coastal Inspiration |
232 |
166 |
0.7155 |
|
|
Coastal Renaissance |
39 |
14 |
0.3590 |
|
|
Queen of Alberni |
163 |
43 |
0.2638 |
|
|
Queen of Capilano |
440 |
30 |
0.0682 |
|
|
Queen of Coquitlam |
68 |
16 |
0.2353 |
|
|
Queen of Surrey |
274 |
100 |
0.3650 |
|
|
Skeena Queen |
234 |
22 |
0.0940 |
|
|
Spirit of British Columbia |
164 |
98 |
0.5976 |
|
|
Spirit of Vancouver Island |
94 |
23 |
0.2447 |
|
Expected Query Result (ferries 3years) |
|
vessel name |
total sailings |
late sailings |
late fraction |
Bowen Queen |
1268 |
83 |
0.0655 |
Coastal Celebration |
6506 |
406 |
0.0624 |
Coastal Inspiration |
6545 |
2502 |
0.3823 |
Coastal Renaissance |
5637 |
1220 |
0.2164 |
Mayne Queen |
1 |
0 |
0.0000 |
Queen of Alberni |
6903 |
1164 |
0.1686 |
Queen of Capilano |
14277 |
2806 |
0.1965 |
Queen of Coquitlam |
6650 |
2230 |
0.3353 |
Queen of Cowichan |
7071 |
2946 |
0.4166 |
Queen of Cumberland |
974 |
113 |
0.1160 |
Queen of New Westminster |
3966 |
439 |
0.1107 |
Queen of Oak Bay |
6609 |
4592 |
0.6948 |
Queen of Surrey |
12759 |
4555 |
0.3570 |
Quinitsa |
8 |
8 |
1.0000 |
Skeena Queen |
7312 |
378 |
0.0517 |
Spirit of British Columbia |
5539 |
1013 |
0.1829 |
Spirit of Vancouver Island |
5916 |
791 |
0.1337 |
|
|
|
|
|
|
|
|
|
- Usually, you would expect that when the beginning of a sailing is delayed (that is, when actual_departure is much later than scheduled_departure), the arrival is also delayed and the sailing is late. However, in some cases, a vessel that departs late may still arrive on time. Define a ‘made up sailing’ to be any sailing which leaves at least 15 minutes after its scheduled departure but arrives less than (or equal to) five minutes late. Write a query to list the number of made up sailings in the dataset for each vessel. Only vessels with at least one made up sailing should be listed.
|
Expected Query Result (ferries 1month) |
|
|
vessel name |
made up sailings |
|
|
Coastal Inspiration |
4 |
|
|
Coastal Renaissance |
2 |
|
|
Spirit of British Columbia |
3 |
|
Expected Query Result (ferries 3years) |
vessel name |
made up sailings |
Coastal Celebration |
109 |
Coastal Inspiration |
70 |
Coastal Renaissance |
53 |
Queen of Alberni |
11 |
Queen of Coquitlam |
7 |
Queen of Cowichan |
1 |
Queen of New Westminster |
28 |
Queen of Oak Bay |
1 |
Skeena Queen |
2 |
Spirit of British Columbia |
57 |
Spirit of Vancouver Island |
75 |
|
|
|
|
|
- Define a ‘good day’ for a particular route number to be a day where at least one sailing occurred and no late sailings occurred. For each route, find the maximum number of good days in a row over the entire dataset.
Note that the rather unimpressive results for the ferries_1month dataset compared to the other datasets seem to be the result of COVID-19 related service reductions and extra precautions (which may be increasing the number of late sailings).
Hint: This is probably easiest when you use a large number of CTE expressions (that is, subqueries in the WITH clause).
Expected Query Result (ferries 1month) |
route number |
max consecutive good days |
1 |
1 |
2 |
0 |
3 |
1 |
4 |
3 |
8 |
4 |
30 |
0 |
Expected Query Result (ferries 3years) |
route number |
max consecutive good days |
1 |
15 |
2 |
2 |
3 |
4 |
4 |
27 |
8 |
8 |
30 |
6 |
- For each route, output all date ranges where the maximum number of consecutive good days (as defined in part (f) above) was achieved. For some routes, only one date range will meet this criteria, but for others (e.g. route 2 in the ferries_3years dataset) there may be multiple date ranges of the same size. The columns containing the start and end dates should contain values of type DATE (created, for example, by the make_date function).
Expected Query Result (ferries 1month) |
route number |
start date |
end date |
max consecutive good days |
1 |
2020-04-26 |
2020-04-26 |
1 |
3 |
2020-04-15 |
2020-04-15 |
1 |
4 |
2020-04-19 |
2020-04-21 |
3 |
8 |
2020-04-29 |
2020-05-02 |
4 |
Expected Query Result (ferries 3years) |
route number |
start date |
end date |
max consecutive good days |
1 |
2017-12-21 |
2018-01-04 |
15 |
2 |
2017-05-20 |
2017-05-21 |
2 |
2 |
2017-09-26 |
2017-09-27 |
2 |
3 |
2018-02-24 |
2018-02-27 |
4 |
4 |
2018-09-02 |
2018-09-28 |
27 |
8 |
2019-03-06 |
2019-03-13 |
8 |
30 |
2017-09-23 |
2017-09-28 |
6 |
Question 3: VWSN Queries [10 marks]
Create queries for each of the data retrieval problems below, using the vwsn 1year database.
- Find the highest observed temperature in the dataset, along with the station number, station name and observation time of all cases where that temperature was reported.
Note: The expected output below is correct (that is, the observation shown is actually in the dataset, even though it appears to be unseasonably warm).
|
Expected Query Result |
|
|
station id |
name |
temperature |
observation time |
180 |
Captain Meares Elementary Secondary School |
42.6000 |
2019-10-05 15:11:00 |
- For each station with station ID between 1 and 10 (inclusive), list the station ID, station name, maximum temperature observed at that station and observation time of all observations in the dataset in which the maximum temperature was attained at that station. Only include stations which actually have recorded observations in the dataset.
Expected Query Result |
station id |
name |
max temperature |
observation time |
1 |
Ian Stewart Complex/Mt. Douglas High School |
29.8000 |
2019-08-05 16:31:00 |
3 |
Strawberry Vale Elementary School |
30.2000 |
2019-06-12 15:15:00 |
3 |
Strawberry Vale Elementary School |
30.2000 |
2019-06-12 15:21:00 |
4 |
Oaklands Elementary School |
30.3000 |
2019-06-12 14:46:00 |
4 |
Oaklands Elementary School |
30.3000 |
2019-06-12 14:52:00 |
5 |
Cedar Hill Middle School |
30.2000 |
2019-06-12 17:12:00 |
6 |
Marigold Elementary School/Spectrum High School |
29.8000 |
2019-06-12 14:41:00 |
7 |
Campus View Elementary |
29.6000 |
2019-06-12 17:22:00 |
8 |
Victoria High School |
29.7000 |
2019-06-12 14:41:00 |
9 |
Frank Hobbs Elementary School |
29.2000 |
2019-08-05 17:27:00 |
10 |
Macaulay Elementary School |
27.3000 |
2019-06-12 12:26:00 |
- Find the IDs and names of all stations which have reported at least one observation at some point, but which did not report any observations in January 2020.
Expected Query Result |
station id |
name |
13 |
Shoreline Middle School |
16 |
Tillicum Elementary School |
31 |
Sangster Elementary School |
32 |
Colwood Elementary School |
36 |
Hans Helgesen Elementary School |
57 |
Wishart Elementary School |
62 |
Deep Cove Elementary School |
70 |
Parkland Secondary School |
81 |
Braefoot Elementary School |
94 |
Pender Islands Elementary and Secondary School |
103 |
Frances Kelsey Secondary School |
105 |
Port Renfrew Elementary School |
108 |
Alberni Weather |
109 |
Brentwood Elementary School |
112 |
Saturna Elementary School |
113 |
Mayne Island School |
117 |
Saltspring Elementary and Middle Schools |
124 |
Seaview Elementary School |
131 |
Glenlyon Norfolk Junior School |
133 |
Discovery Elementary School |
136 |
Pleasant Valley Elementary School |
160 |
Shawnigan Lake School |
161 |
Bamfield Marine Sciences Centre |
166 |
Maquinna Elementary School |
180 |
Captain Meares Elementary Secondary School |
195 |
North Island Distance Education School |
196 |
Valley View Elementary School |
226 |
Ditidaht Community School |
- In this question (and the next question), define the ‘daily average temperature’ for a particular day to be the average of all observations from all stations on that day. Furthermore, for each month, define the ‘10 hottest days’ to be the top 10 days (by daily average temperature) and define the ‘10 coolest days’ to be the bottom 10 days (by daily average temperature). For each month/year pair in the dataset, compute the average daily average temperature across the ten hottest days and the average daily average temperature across the ten coolest days. Notice that you are computing an average of averages (first, find the average daily temperatures of each of the ten hottest days, then average those temperatures to produce the result). Hint:
Use the rank() function with an appropriate over() clause (maybe in multiple places)
Expected Query Result |
year |
month |
hottest10 average |
coolest10 average |
2019 |
5 |
16.8834 |
12.0071 |
2019 |
6 |
17.8665 |
13.8339 |
2019 |
7 |
19.0044 |
15.9400 |
2019 |
8 |
19.7309 |
16.5326 |
2019 |
9 |
17.0929 |
12.3717 |
2019 |
10 |
10.8171 |
6.5475 |
2019 |
11 |
9.2537 |
3.6983 |
2019 |
12 |
7.3156 |
4.1378 |
2020 |
1 |
7.8368 |
0.9990 |
2020 |
2 |
6.3289 |
3.1054 |
2020 |
3 |
7.3842 |
3.7168 |
2020 |
4 |
11.5621 |
7.1426 |
- List the day, month and year (in separate columns) of all days whose daily average temperature (see previous question) was lower than the daily average temperature of each of the previous 28 days in the dataset. The result should not list any of the first 28 days in the dataset, since the metric cannot be computed for those days, but the data for those days should be used to compute the rest of the result. Hint: You may want to use both the min() and count() aggregation functions in combination with an over() clause that includes both partitioning and windowing.
Expected Query Result |
year |
month |
day |
2019 |
8 |
21 |
2019 |
8 |
23 |
2019 |
9 |
12 |
2019 |
9 |
14 |
2019 |
9 |
15 |
2019 |
9 |
16 |
2019 |
9 |
17 |
2019 |
9 |
23 |
2019 |
9 |
27 |
2019 |
9 |
30 |
2019 |
10 |
1 |
2019 |
10 |
8 |
2019 |
10 |
9 |
2019 |
10 |
10 |
2019 |
10 |
28 |
2019 |
10 |
29 |
2019 |
10 |
30 |
2019 |
11 |
26 |
2019 |
11 |
28 |
2019 |
11 |
29 |
2020 |
1 |
9 |
2020 |
1 |
12 |
2020 |
1 |
13 |
2020 |
1 |
14 |
2020 |
3 |
14 |
[1] . A duration which is exactly five minutes longer is still considered late.