CSE545 Assignment 1 Solved

35.00 $

Category:

Description

5/5 - (3 votes)

Overview

Part I. Streaming (35 points)

Part I. MapReduce (65 points)

Part II. Spark and Streaming (50 points)

Overview

Goal: Gain experience working with streaming algorithms as well as MapReduce work ow systems. Gain experience simulating working with a distributed le system.

Requirements. You must use Python version 3.5 or later.  You do not need a distribution of MapReduce for this assignment as we will use our own basic simulator (see task II.A.) — future assignments will use MapReduce on a cluster.

Templates and Python Libraries. Template code is provided for each part of the assignment. Each template includes all of the data science, machine learning, or statistics libraries that you may import. Other data science, machine learning, or statistics related libraries are prohibited unless listed below —  ask if unsure. The intention is for you to implement the algorithms we have gone over and problem solve in order to best understand the concepts of this course and their practical application.

Within the templates, all provided method names and classes must be used as provided with the same parameters. However, you may also use additional methods to keep your code clean.

Additional approved libraries that are not in the template will be listed here (if any): datetime random.shuffle

 

Part I. Streaming (35 points)

Here, you will implement and evaluate the runtime of the typical multi-level sampler as well as the

“streaming” sampler that we went over in class. We call it a “multi-level sampler” because the level of analysis we wish to sample (e.g. user level) is not the same level with which the data arrives (e.g. transaction level).

Data: to complete these tasks, you be provided transaction data of the following form: record_id, date, user_id, amount

record_id — unique id for the transaction

date — date of the transaction

user_id — id for the user making the transaction

amount — the amount of the transaction The data is provided in three les

1)  transactions_small.csv

2)  Transactions_medium.csv [zip]

3)  Transactions_large.csv [zip; to be released]

Task I.A) Typical non-streaming multi-level sampler (15 points)

Implement the standard non-streaming sampling method

Step 1: read le to pull out unique user_ids from le

Step 2: subset to random  1% of user_ids

Step 3: read le again to pull out records from the 1% user_id

and compute mean and standard deviation within

Task I.B) Streaming multi-level sampler (15 points)

Implement the streaming multi-level sampling code we went over in class which uses hash functions on the user_id to read through the le once. Technically, we are simulating a web stream of data by instead taking a single pass at a le but you should see the advantage of this algorithm even for sampling from les.  Record the information that is needed in order to compute the mean and standard deviation.  Your sample should correspond to  2% and .5% of the user_ids in each le (approximate, especially in the case of the small le).  Make sure to use a streaming mean and std-dev (see rules in method description).

Task I.C) Timing (5 points)

Time wall-clock processing time, in milliseconds, over di erent sizes of data: small(10,000) medium(1,000,000) and large (100,000,000)

Report runtimes and results for both implementations above, using percentages of .02 and .005 for each of the three les (small may not have an adequate sample at .005: that’s ok).

Template Code for Part I.  A template to be lled in with your code is provided here: sampler_lastname_id.py  

Part II. MapReduce (65 points)

Here, you will complete a back-end for a MapReduce system and test it on a couple MapReduce jobs: word count (provided), and meanCharsMR (you must implement). Template code is provided. Speci cally, you must complete:

Task I.A) PartitionFunction (10 points)

Complete the partition function, making sure to use a hash that can handle: integers and strings.

Task I.B) RunSystem (20 points)

Complete the “runSystem(self)” method which divides the data into chunks and schedules the running of mapTasks and reduceTasks. The are two places to complete:

(1) Divide up the data into chunks according to num_map_tasks, and launch a map task per chunk.

(2) Send each key-value pair to its assigned reducer.

Task I.C) Combiner (15 points)

Edit the “MapTask” method to add support for running a Combiner. Look for “#<<COMPLETE>>”  within the method. Remember, a combiner runs the reduce task at the end of the map task in order to save communication cost of sending to multiple reducers. Note: main will run the WordCountBasicMR to test with and without the combiner. It is recommended that you look over the WordCountBasicMR to understand what it is doing.  You can assume your combiner code will only run on reducers that are both commutative and associative (see hint at bottom).

Task I.D) Mean CharsMR (20 points)

Edit the “map” and “reduce” methods of “MeanCharsMR” to implement a map-reduce computation of the mean and standard deviation number of each character (a-z, case insensitive) in each document.

Template Code for Part II.  A template to be lled in with your code is provided here:

MRSystemSimulator2020_lastname_id.py

Submission

Please use blackboard to submit two les each with your lastname and student id:

1.  sampler_<lastname>_<id>.py

2.  MRSystemSimulator2020_<lastname>_<id>.py

Do not upload a zip le. Double-check that your les are there and correct after uploading and make sure to submit.  Uploading les that are zips or any other type than python code les will result in the submission being considered invalid. Partially uploaded les or non-submitted les will count as unsubmitted.

 

Questions: Please post questions to the course piazza page.

Hints

As questions come in this location will be used to track suggestions.

● Combiners require a reduce function that is both commutative and associative: f(v1, v2) = f(v2, v1)                 and      f(f(v1, v2), v3) = f(v1, f(v2, v3))

 

  • hw1-cmsw9t.zip