Project 1 Computer Science 143 Solved -Project 1 Part A&B Solved

70.00 $ 35.00 $

Category:
Click Category Button to View Your Next Assignment | Homework

You'll get a download link with a: . zip solution files instantly, after Payment

Description

5/5 - (1 vote)

TuneSearch

 

In this project, you will design a (very) simple search engine to search for song lyrics. We will call this search engine ​TuneSearch​ for lack of creativity and take a trip back to the 1990s, a very exciting time for the World Wide Web and music if I say so myself.

 

Before Google debuted in 1998, there were a ton of other search engines with their own pros and cons and technical achievements such as AltaVista, Excite, Lycos, Yahoo (it was a search engine at one point), HotBot, Dogpile, Ask Jeeves, WebCrawler, Inktomi, MetaCrawler, InfoSeek and All the Web. Most of these search engines used some form of an index, inverted index, and some ranking method. In later years, song lyric sites started popping up. In this project, we will use a sample of data from LyricsFreak.com.

 

In this project, students will build a basic search engine for song lyrics backed by several technologies, but most importantly PostgreSQL. Just think, this project could have made you a billionaire back in the 90s.

 

The other technologies that will be used in this project (​though students will not need to touch most of it​) are:

 

  1. Nginx, a high performance HTTP/web server. It has many other uses as a reverse proxy server, mail proxy server and a generic TCP/UDP proxy server, but we are only using the web server part.
  2. Flask​, a popular micro-platform for web and API development. Flask is a great way to prototype web projects and host small projects quickly. It is fairly lightweight.
  3. uWSGI​, is a protocol and serves as a “glue” between Nginx and Flask.
  4. Jinja2, ​is a templating engine used by Flask. Developers write standard HTML files and embed special tokens where variables passed to the template should be displayed.
  5. PostgreSQL, is an open source object-relational database management system with an​ emphasis on extensibility and standards compliance. It can handle workloads ranging from small single-machine applications to large Internet-facing applications with many concurrent users.
  6. psycopg2, a Python package for interacting with PostgreSQL from Python.
  7. Python 3, the world’s best programming language.

 

Since this is the beginning of a databases and data systems class, the focus of this project is on writing queries​ and the amount of new Python code that must be written should be minimal.

System Setup

To help students set up the uniform environment for the class project, we will be using VirtualBox to run the Linux operating system in a virtual machine. VirtualBox allows a single machine to share resources and run multiple operating systems simultaneously. You will need to download the following files

 

  • VirtualBox binary file for your host Operating System. The latest version is 6.0.4 and we have not experienced problems with it.
  • VirtualBox Image: ​CS143-19S-P1.ova ​- This is a very large file (~2.5GB), so it may take a while to download.

 

and follow our VirtualBox setup instruction to install VirtualBox on your own machine. The provided virtual machine image is based on Ubuntu 18.04 LTS, PostgreSQL 10.6, Nginx 1.14, Flask 1.0.2, Python 3.6.7.

 

If you have access to an equivalent machine that has PostgreSQL, Nginx, Python 3, and Flask installed, you may use it instead of the virtual machine image. In this case, we do recommend setting up Python and Flask using virtualenv. However, please note that we ​will not​ provide support for systems other than the virtual machine image, and that your project MUST be runnable on the provided virtual machine. We will be using the virtual machine image for grading purposes, and if your submission does not work within this setup, you may get zero points. We cannot make any exceptions to your project schedule for problems incurred by using your own computing facilities.

 

Project

Your project is to build a simple song lyrics search engine. Your system manages all of its data at the back-end in a PostgreSQL database and provides a Web interface to the users at the front-end. While this task may sound daunting at first, you will be surprised how easy and helpful a commercial DBMS is for such a project.

 

To develop Project 1 all students will use the PostgreSQL DBMS and the Ubuntu Linux operating system. You will also use the Python programming language to access PostgreSQL and the Flask microplatform (powered by Nginx) to provide a Web interface. While there are many other ways to develop Web applications, and while Apache and PHP is one of the most popular method these days to implement a database-backed server for a small-to-medium scale Web site, we want to use a standard language across both CS 143 projects and also introduce students to Flask, which is very popular in the data community and lightweight. We will provide online references that will help you learn Python and PostgreSQL. We will also give you a considerable amount of real data to populate your system.

 

For Project 1, we will expect certain minimal functionality in your system – beyond that, the sky’s the limit. Minimal functionality includes a variety of queries and searching capabilities over the data in your system. Various integrity constraints must be monitored. Although you will use PostgreSQL, which has reasonable transaction support, multi-user issues are not a focus of the project.

 

The main reason for using PostgreSQL in lieu of MySQL this quarter is that it follows the ANSI SQL standard presented in the textbook much, much more closely than does MySQL. While MySQL may be more popular (currently) and in some cases easier to use, it only implements a subset of the ANSI SQL standard and even fewer features.

 

Project 1 consists of two parts:

 

Part A: Creating Schemas and Loading Data

In this part, you will get familiar with the overall project environment and learn basic Python if you do not know them yet. You will write several queries to create database tables with the correct data types, keys and constraints and then load data into these tables. This part does not require​         Python, though some students may wish to practice by writing a Python script, using the psycopg2​               package, to load the data from the script instead.

 

Part B: PostgreSQL Backed Search Engine

In this part, you will write queries and some Python code that fetches search results given an initial search query, and display the results on a search page. This task involves SELECT statements, JOINs and subqueries and will provide good practice for the midterm.

 

TuneSearch Project 1 Part A

Partners

The CS143 project may be completed in teams of one or two students. The choice is up to each student. Partnership rules consist of the following:

 

  • An identical amount of work is expected and the same grading scale is used regardless of the size of your team.
  • If you work in a team of two, choose your partner carefully. Partners are permitted to “divorce” at any time during the course, and “single” student may choose to find a partner as the project progresses. However students from divorced teams may not form new teams or join other teams.
  • Partners in a team will receive exactly the same grade for each project part submitted jointly.

If you have two students in your team, please make sure to submit your work jointly – do ​not​ turn your project in twice, once for each partner.

 

Scope

The primary purpose of Part A is the following:

 

  1. To provide you with the data you need to complete the project.
  2. To create the PostgreSQL relation/table schemas you will need to complete the project.
  3. To load the data.
  4. To make sure the VirtualBox image is working correctly to ensure your further success.

 

We provide you with a whole bunch of basic information to get everyone up-to-speed on our computing systems and the languages and tools we will be using. Those of you who have used a database server before (remember, we do not expect any familiarity), will find this part nearly trivial. Those of you who haven’t will find it mostly straightforward. With all that said, ​please don’t start at the last minute​ — as with all programming and systems work, everything takes a bit of time, and unforeseen snafus do crop up.

 

 

System Setup

We will be using VirtualBox to run the Linux operating system in a virtual machine. VirtualBox allows a single machine to share resources and run multiple operating systems simultaneously. Read our VirtualBox setup instruction and follow the instructions to install VirtualBox and our virtual-machine image on your own machine.

 

The provided virtual machine image is based on Ubuntu 18.04.2, PostgreSQL 10.1, Nginx, Python 3 and Flask. You will need to use the provided VirtualBox guest OS to develop and test all projects for this class.

 

Your VirtualBox guest is essentially a Linux machine. Your guest is accessible from your host through secure shell (SSH) at localhost​          port ​     1422​     with username ​               cs143​   and password ​ cs143​   .​ The web server and framework (nginx + uwsgi + Flask) is accessible from your host at ​localhost port ​1480​.

 

If you have any problems with the image, let the TAs know ASAP on Piazza.

The Big Picture: What we are Going to Do

 

In this project, we are creating a very basic search engine a la the 1990s, called TuneSearch. We will assume that a TuneSearch user enters his/her query as a space-delimited string of tokens/words. The user can then use one of two radio buttons labeled AND or OR to determine how we will search for the terms. If OR is selected, any song that contains any of the words will be returned as search results. If AND is selected, only songs that contain ALL of the words will be returned. Nowadays, search engines like Google assume AND queries, but believe it or not, several search engines used to require the user to specify how to create the boolean predicate.

 

An example of searching from some lyrics:

 

An example of an AND query:

 

 

 

 

 

The results:

 

 

But it is not just enough to search for words in a song and present them to the user. We also need to rank ​them in a way that makes sense. This is very complicated for modern search engines, so we are going to keep it simple. One way is to just rank by how many times the words in the query appear in each song result. A slightly more sophisticated method that we will use involves using TF-IDF, which is a metric that explains how ​important ​a particular word is in a song. Words with high TF-IDF values are those that will lead to more specific results. TF stands for term frequency, and is simply the number of times a word appears in a song. IDF stands for Inverse Document Frequency. Its reciprocal is the number of songs where the particular term is used. TF-IDF will prioritize songs that use a rare word frequently. While a set of songs may contain the word ​raspberry​, it is rare enough that the song ​Raspberry Beret ​by Prince should be ranked first, because the word ​raspberry appears many times compared to others and not very often in other songs. A word such as ​and ​has a very low TF-IDF because it appears in practically all songs. We will discuss TF-IDF more in the specification for Project 1B. ​For project 1A, it is important to know that there is a TF-IDF score for each song-token pair.

The Data

The dataset consists of lyrics from several thousand songs collected from the popular website LyricsFreak in CSV format. This data was heavily cleaned by your professor. In your VirtualBox image, there is a directory named ​data​. This directory contains three files:

 

  1. csv​ contains data about each artist in the dataset. There is an ID number, and their name in quotation marks. The quotation marks are necessary because some names may have commas in them.
  2. csv​ contains data about each song in the dataset. There is an ID for both the song and the artist, the name of the song in quotation marks, and the suffix of the LyricsFreak URL where the data came from. For example, the song ​Ants Marching​ by

Dave Matthews Band has the URL

/d/dave+matthews+band/ants+marching_20036621.html​, which, as a full URL is http://www.lyricsfreak.com/d/dave+matthews+band/ants+marching_2 0036621.html

  1. csv contains data about the individual words (referred to as ​ ​tokens​) used in each song. There is an ID for the song, a token, and the number of times that word was used in the song. This table basically contains TF, or at least part of it.

 

The Database

Your VirtualBox instance contains a Postgres database server on it. You can access the command-line interface with the command ​psql​. You will be authorized as user ​cs143​, though note that this username did not need to be the same as the Linux username. We have set the password to ​cs143​.

 

If you are familiar with MySQL, you will immediately notice some differences. Let’s start with looking at what databases are stored in this server by using the ​\l​ command (lowercase L). Every database user (not necessarily Linux user) has their own database that they can do whatever they want with thus there are databases for ​cs143​ and postgres. ​postgres​ is a system user that is created when the server is installed. We will only be using user ​cs143​. There is a ​TEST database you are free to use for whatever you like. You can use ​              ​cs143​ or TEST​ for development purposes if you wish. There are databases for homework, and searchengine​. You will use database ​searchengine​ for project 1. ​template0​ and template1​ are system databases — don’t modify or delete them. ​template1​ is a backup of template0.​

 

Below are a few more commands that will be of use:

 

Command Postgres / psql MySQL Equivalent
List databases \l SHOW DATABASES;
List relations \d SHOW TABLES;
Show schema for relation ​r \d <r> DESCRIBE <r>;
Switch databases \c <dbname> USE <dbname>;

All SQL queries must be suffixed with a semicolon, but commands do not need to be.

Task 0: Load the Skeleton

Most of the Python code and web pages are already written for you. You can download these files at this link​. Navigate to your ​vm-shared​ directory on your system and unzip the file so that your directory structure looks like the following:

 

vm-shared |

|– logs

|–SearchEngine

 

Depending on your operating system, or your zip client, it may create a directory named after the ZIP file. If this happens, move everything within that directory so that it is directly under vm-shared​. This content is accessible on your VM at directory ​www​ in the home directory.

Task 1: Create the Schema

Your first task is to create the database schema for this problem. You will design four tables with appropriate names, specify types for each of the attributes, and specify the proper primary and foreign keys. The fourth table will contain the TF-IDF score, even though we are not computing it yet. What attributes should it contain? What should the keys be?

This type of problem would usually be solved with a system like ElasticSearch backed by an indexing engine like Lucene, but here we will use PostgreSQL and text processing code your professor has written for you.

 

Create a directory in your shared folder named  ​sql​. Create a file ​schema.sql​ that creates the relations in your schema. A SQL file is simply a collection of SQL queries that are executed in order. For your sanity, you are required to drop relations if they already exist. If you do not do this, Postgres will throw an error at each query saying the relation already exists.

 

You will be graded on using the correct data types without being wasteful, picking the correct keys, and of course, not yielding any errors or warnings.

 

Task 2: Load the Data

This task in some ways will allow you to check your work for any major problems from Task 1.

 

Now you will load the data into the schema you just created. There are a variety of ways to do this including command-line utilities, or writing a Python script with the ​psycopg2​ package. PostgreSQL uses the ​\copy ​(and/or ​COPY​ construct)​ ​command to read delimited files into Postgres pretty easily. Please take a look here for more information. You will want to pay attention to the file format, delimiter, quoting and header. This is similar to MySQL’s ​LOAD LOCAL INFILE​ construct.

 

In your ​sql​ directory, create a file called ​load.sql​ that contains, one on each line, a ​\copy command that loads the three CSV files into Postgres.

 

Once you are finished, you can check the number of rows in each relation to make sure they match:

 

Relation Description Expected Tuples
Artist info 643
Song info 57,650
Token info 5,375,064
TF-IDF scores 0

 

 

 

Late Submission Policy

Part 1A must be submitted by the deadline. See the syllabus for more information about the late policy.

 

Submission Instruction

Preparing Your Submission

Please create a folder named as your UID, put all your files into the folder, then compress this folder into a single zip file called ​P1A.zip​. If you are working with a partner, ​one​ partner will use their UID as the name for the folder. We will figure out who the partner is from ​TEAM.txt​.That is, the zip file should have the following structure.

P1A.zip

|

+- Folder named with Your UID, like “904200000” (without quotes)

|

+- schema.sql

|

+- load.sql

|

+- TEAM.txt

|

+- README.txt

 

Please note that the file names are case sensitive, so you should use the exact same cases for the file names. (For teamwork, use the submitter’s UID to name the folder)

 

  • sql​: A SQL file containing the schema to create the necessary relations.
  • sql​: A SQL file containing the ​\copy​ commands to load the data into Postgres.

Please make the paths absolute so we can check them automatically.

  • txt​: A plain text file that contains the UID(s) of every member of your team. If you worked by yourself, just include your UID. If you worked with a partner, write both UIDs on one line separated by a comma (,).
  • txt: A ​plain text​ file that contains anything thing else you feel may be useful to us. If you had any problems getting something to work they way we want it to, include it in this file. While you may receive a point deduction, it is better than 0. This is also the file to include any citations.

Projects are submitted to CCLE, and only to CCLE.

 

 

TuneSearch Project 1 Part B

Overview

In this part of the project, you will use the data you loaded into PostgreSQL to create a simple search engine. In this part, you will accomplish the following:

 

  1. Connect to the PostgreSQL database via the psycopg2 package for Python 3.
  2. Use the parameters submitted by the user to form a SQL query that will fetch and rank results.
  3. Gracefully pass the results to an HTML search results page.
  4. Allow the user to page through the results in multiples of 20. (20 results shown at a time)
  5. Protect the database against SQL injections, a common attack vector.

 

Project Layout

In the www​          ​ directory in your home directory on the VM (your shared folder), you will see three directories: ​SearchEngine​, ​logs​, and ​sql​. We will get to the logs directory later. The SearchEngine directory contains all of the files for this project, including files you will not need to touch that create a “glue” between nginx and Flask. This “glue” is uWSGI. There are a few different files in this directory, none which need to be touched:

 

  1. ini​. While nginx handles the web connections over port 80 on the VM, nginx passes all requests to uWSGI. This .ini file configures ​SearchEngine​ for use with uWSGI. It mostly defines file permissions and the number of processes that uWSGI will run. There are a couple of parameters though that will be of interest to you. touch-reload​ and​ py-autoreload​. When a Python file in this project changes, uWSGI and Flask will reload them. Without these options, you would need to restart uwsgi or nginx every time you changed a Python file. ​logto​ specifies where the contents of stderr will be written on disk. This is where the logs directory comes in.

 

  1. py ​loads the ​searchengine​ module so that Flask can see it. It also sets up logging to stderr and appends the path of the module to the Python search path. It then runs the Flask app.

 

Inside the ​SearchEngine​ directory, there is another directory called ​SearchEngine​ which houses a Python module called ​searchengine​. Got that? If you cd into that inner SearchEngine​ directory, you can tell it is a Python module because it contains a file called __init__.py​. This directory contains all of the files that process the data for the search engine as well as the files that control the display of the website. There are several files and directories here:

 

  1. static​ contains files that control the look and feel of the website, including the header image used for TuneSearch and a CSS file. ​Feel free to play with these​.
  2. templates contains an index.html file of plain HTML, and a results.html file which is a​ Jinja2 ​           ​You will need to modify this Jinja2 template to get pagination to work.
  3. py​ has two functions you will need to modify: dosearch()​ and​       index()​ to handle the HTTP request sent by the user.
  4. py​ which contains the main logic for connecting to the database, running the proper query and passing the results back to the Jinja2 template.

 

In the outermost ​www​ directory, there are two other directories: ​logs​ and ​sql​.

 

  1. logs​ contains one file, ​errlog​ which contains errors reported by Python to uWSGI.
  2. sql​ is an empty directory, and you may wish to house your queries from Part 1A in this directory.

 

How TuneSearch Works

As stated in the project overview, a user issues a simple query containing words (tokens). These tokens may also have punctuation attached to them, such as when searching for specific lyrics “Could anything ever feel this real forever?” The punctuation must be stripped. I have already written some code to do this for you.

 

The user selects either the AND constraint, or the OR constraint. If the user selects AND, only songs that contain ALL of the words in the query are returned. If the user selects OR, songs containing ANY of the words in the query are returned.

 

When the user visits the main search page, the function ​index()​ in ​searchengine.py​ fires off and simply renders an HTML file called ​index.html​. Whenever the user enters a search query and hits the search button, an HTTP GET request is passed to ​/search​, which is handled by the dosearch()​         function in the same file. You can access data from the request​        using the ​request.args​ dict including the query and the query type (AND or OR). We then move to the ​search()​ function in ​search.py​ where we connect to a database (if needed) and run a Postgres query that finds songs that match the terms in the query. The results are then returned back to the calling function to be rendered in the ​results.html​ Jinja2 template.

 

This may sound like a lot, but we have broken down this project into a series of small tasks to keep your organized. Tasks 1, 4, and 5 will probably take the most time.

 

To view the website, you do not need to open the HTML file or run Python. Just open a browser to http://localhost:1480!

 

Task 0: Redo Part 1A with Our Solution

To prevent double jeopardy and have everyone start off on the same foot, please see our solution for project 1A which will be posted after the two day grace period. Please do not wait for that. If there are errors, you can always reload the data. We want all students to use the same schema.

 

Task 1: Search Ranking Query

Write a SQL query that computes the TF-IDF score for each token in each song. To do this, you will need to first compute the number of songs that each token appears in (the Inverse Document Frequency, or IDF). You already have the term frequency TF. There are several variations on how to compute TF-IDF. We will use this simple variation:

 

where TF is the term (token) frequency of token ​i ​in song ​j​, |J| is the number of songs and DF​i is​                    the number of songs term ​i​ appears in. You will probably want to do this computation in a separate table, and as a one-time data processing step rather than as part of the app.

 

Then write a query against this table that handles both the AND case and the OR case. We can determine how well a song matches the query by summing across the TF-IDF scores in the search query and then ordering in descending order by TF-IDF. You will want to write your query such that you can test it using some of your own sample queries at the ​psql command​     line.

Task 2: Write Code to Connect and Disconnect from the Database

In ​search.py​, there is a comment with some instructions. You must connect to the Postgres instance running on your VM. Whenever you open a connection, you must also close it including when a failure occurs. Take a look at ​try/except/finally exception handling in Python.

 

Task 3: Integrate your Database Query in the search()​ Function​

In the search()​           function in ​       search.py​     , enter your PostgreSQL query or queries to handle​  the user’s input and extract the results from the query. See the helpful hints later in this document.

 

Each result lists the name of the song (which links to the corresponding LyricsFreak page), as well as the artist that performed it.

 

Task 4: Enable Pagination

You now have a TF-IDF table, and a Postgres query that ranks the search results. Enough Python code is written for you already to output all results in one page. But now we are going to add a Previous and Next button at the bottom of the search results page, if there are more than 20 search results. Each page of results should contain 20 results.

 

To enable pagination, we can rerun the query you just wrote, once for each page of results, but this is very slow and wasteful. Instead, we should use some other construct we learned in lecture to store the search results, and then write another query against those results to pull the results for each page. Do you know what construct that is? How will you maintain state as you move among pages of results? You may want to play with the ​request.args​ dict as well as embedding some information in the search results page that will indicate where we are in the list of full results.

 

Finally, the Previous/Next buttons should only show up when they are necessary. For example, The Previous button should not show up on the first page of results, and the Next button should not appear on the final page of results. Additionally, Next button should not be displayed if there are no more results to display.

 

Task 5: Protect Your Database from SQL Injections

Finally, handle user input in such a way that the user cannot commit a SQL injection. Note that because some text processing is done on the query, you won’t be able to actually do a SQL injection yourself. We just want your query to be protected from them.

 

Late Submission Policy

Part 1B must be submitted the deadline. See the syllabus for more information about the late policy.

 

Submission Instructions

Preparing Your Submission

Please create a folder named as your UID, put all your files into the folder, then compress this folder into a single zip file called ​P1B.zip​. If you are working with a partner, ​one​ partner will use their UID as the name for the folder. We will figure out who the partner is from TEAM.txt​      ​.That is, the zip file should have the following structure.

 

P1B.zip

|

+- Folder named with Your UID, like “904200000” (without quotes)

|

+- SearchEngine (the outermost directory, just like in the ZIP file)

|

+- sql

|

+- TEAM.txt

|

+- README.txt

(We do not need your logs directory)

 

Please note that the file names are case sensitive, so you should use the exact same cases for the file names. (For teamwork, use the submitter’s UID to name the folder)

 

  • The​ sql ​directory should contain the following:

○ load.sql​ containing the query you used to populate the TF-IDF table.

○    Please make the paths absolute so we can check them automatically.

  • txt​: A plain text file that contains the UID(s) of every member of your team. If you worked by yourself, just include your UID. If you worked with a partner, write both UIDs on one line separated by a comma (,).
  • txt​: A plain text​ file that contains anything thing else you feel may be useful to us.

If you had any problems getting something to work they way we want it to, include it in this

file. While you may receive a point deduction, it is better than 0. This is also the file to include any citations.

Projects are submitted to CCLE, and only to CCLE.

Testing of Your Submission

Grading is a difficult and time-consuming process. In order to help you ensure you have the correct files for submission, you can test your packaging by downloading this Test Script​.

In essence, this script unzips your submission to a temporary directory and tests whether or not you have included all your submission files. Once you download the test script, it can be executed like:

~$ ./p1b_test <Your UID>

(Put your P1B.zip file in the same directory with this test script, you may need to use “chmod +x p1b_test” if there is a permission error).

You ​MUST​ test your submission using the script before your final submission to minimize the chance of an unexpected error during grading. Significant points may be deducted if the grader encounters an error during grading. When everything runs properly, you will see an output similar to the following from this script:

Check File Successfully. Please upload your P1B.zip file to CCLE.

Helpful Hints and Troubleshooting

When you make a Python or SQL error, nginx will throw an unhelpful HTTP 502 error. If this happens, your first course of action is to look at logs/errlog​        ​ as this is where most errors from Python and Postgres will be recorded. If you catch exceptions, you need to be very careful or else you may never be able to see the error message that is causing you grief.

 

You can also look at the nginx error log at ​/var/log/nginx/error_log​ but that is generally more helpful for configuration problems.

 

The file​ search.py​ contains a small ​main()​ function that you can use to pass search queries to Postgres via your Python code. You can then embed ​print​ statements, use a debugger, or other methods to see what is being returned from Postgres. You can change the ​main() function as you see fit for your testing.

 

The goal of this project is to do something interesting with Postgres. We are not interested in testing a bunch of edge and corner cases on the parsing of the search query (except for preventing SQL injections), so don’t go overboard. 🙂

 

Resources

psycopg2 Postgres Driver for Python 3

 

Jinja2 Documentation

 

Flask Documentation

 

Textbook, 7th Edition, ​Chapter 9

This chapter focuses on Java, but the concepts are highly relevant.

  • Project-1-TuneSearch-arxogj.zip