Big Data Homework 4 Solved

30.00 $

Category: Tags: , , , ,
Click Category Button to View Your Next Assignment | Homework

You'll get a download link with a: zip solution files instantly, after Payment

Securely Powered by: Secure Checkout Second Badge

Description

Rate this product

Enron Emails

Folder /home/public/enron contains individual emails of former Enron employees. There is one file per email and the name of the employee is listed as the subfolder name. These emails were used in the actual trail but the judge decided to release them for public consumption. (I have all of them but I am sharing with you only a few of them so that potentially you do not need to write a script to load the data. You are encouraged to write a script, but because you have only a few emails you can insert them manually.)

Load in an appropriate hbase model the following attributes of each email: name of the employee (obtained from the name of the folder), email address of the sender, date of the email, send to as string, and email body as string.

Recently I read in the newspaper FakeInnovations that entrepreneur Bogus John Enron wants to fund a company with the same employees (those jailed will work from the jail). Bogus John needs an hbase database that will track all the emails. The management of the company wants to be able to quickly query emails for a user, all emails during a time period, and all emails for a given user during a period of time.

You have to perform the following tasks:

  1. Create an hbase database model.
  2. Import all emails in hbase.
  3. Return the bodies of all emails for a user of your choice (as a single text file).
  4. Return the bodies of all emails written during a particular month of your choice (as a single text

    file).

  5. Return the bodies of all emails of a given user during a particular month both of your choice (as

    a single text file).

Here are 2 options that you can choose from.

  1. Write an hbase script for all of the tasks. This is the route of least effort but also least rewarding.
  2. The hbase on wolf offers restful access. You can either use python or java to perform the task.

    Python: use module starbase: https://github.com/barseghyanartur/starbase or HappyBase

    https://happybase.readthedocs.io/en/latest/

    The restful interface on wolf for hbase runs on port 20550 and thus you have to initialize the connection as c = Connection(port=20550)

To fetch a single record, use ‘get’ but to fetch a range of records, use ‘scan.’

  • HW4-2vuujf.zip