Name: EECS 485 Project 5: Wikipedia Search Engine Solved
SKU: 53975
Price: 45.00 USD
Availability: InStock

5/5 - (3 votes)

Introduction

Build a scalable search engine that is similar to a commercial search engine. The search engine in this assignment has several features:

Indexing implemented with MapReduce so it can scale to very large corpus sizes

Information retrieval based on both tf-idf and PageRank scores

A new search engine interface with two special features: user-driven scoring and summarization.

The learning goals of this project are information retrieval concepts like PageRank and tf-idf, parallel data processing with MapReduce, and writing an end-to-end search engine. Table of Contents

Setup

Inverted Index with MapReduce

Index server

Search interface

Deploy to AWS

Submitting and grading

FAQ

Setup

Group registration

AWS account and instance

You will use Amazon Web Services (AWS) to deploy your project. AWS account setup may take up to 24 hours, so get started now. Create an account, launch and configure the instance. Don’t deploy yet. AWS Tutorial .

Project folder

Create a folder for this project (instructions ). Your folder location might be different.

$ pwd

/Users/awdeorio/src/eecs485/p5-search-engine

Version control

Set up version control using the Version control tutorial.

Be sure to check out the Version control for a team tutorial.

After you’re done, you should have a local repository with a “clean” status and your local repository should be connected to a remote GitLab repository.

$ pwd

/Users/awdeorio/src/eecs485/p5-search-engine

$ git status

On branch master

Your branch is up-to-date with ‘origin/master’.

nothing to commit, working tree clean

$ git remote -v

origin https://gitlab.eecs.umich.edu/awdeorio/p5-search-engine.git (fetch) origin https://gitlab.eecs.umich.edu/awdeorio/p5-search-engine.git (push) You should have a _.gitignore file (instructions).

$ pwd

/Users/awdeorio/src/eecs485/p5-search-engine

$ head .gitignore

This is a sample .gitignore file that’s useful for EECS 485 projects.

…

Python virtual environment

Create a Python virtual environment using the Python Virtual Environment Tutorial.

Check that you have a Python virtual environment, and that it’s activated (remember _sourceenv/bin/activate ).

$ pwd

/Users/awdeorio/src/eecs485/p5-search-engine

$ ls -d env env

$ echo $VIRTUAL_ENV

/Users/awdeorio/src/eecs485/p5-search-engine/env

Starter files

Download and unpack the starter files.

$ pwd

/Users/awdeorio/src/eecs485/p5-search-engine

$ wget https://eecs485staff.github.io/p5-search-engine/starter_files.tar.gz $ tar -xvzf starter_files.tar.gz

Move the starter files to your project directory and remove the original _{starter_files/}directory and tarball.

$ pwd

/Users/awdeorio/src/eecs485/p5-search-engine

$ mv starter_files/* .

$ rm -rf starter_files starter_files.tar.gz

This project involves very few starter files. You have learned everything you need to build this project from (almost) scratch. You are responsible for converting the starter files structure into the final structure. Each part of the spec will walk you through what we expect in terms of structure.

bin/ : As in previous projects, this is where convenience shell scripts will be located. You will

write search , index , and indexdb hadoop/ : MapReduce programs live here. See Inverted Index with Mapreduce and the

following Hadoop setup and example instructions for more details. hadoop/word_count/ : Sample MapReduce program, with input. hadoop/inverted_index/ : Build a pipeline of MapReduce programs in this directory. You will

write several map*.py and reduce*.py files, as well as a pipeline.sh script. index/index/ : Python code for the index server. See Index server for more details. search/search/ : Search interface app. See Search interface for more details.

Both the index server and the search server are separate Flask apps. The search server will serve a bundle.js of your react source code. The index server will be styled after the project 3 REST API. As in previous projects, both will be python packages that should have a _setup.py.

At the end of this project your directory structure should look something like this:

$ tree –matchdirs -I ‘env|__pycache__’

├── bin

│ ├── index

│ ├── indexdb

│ └── search

├── hadoop

│ ├── hadoop-streaming-2.7.2.jar

│ ├── inverted_index

│ │ ├── input.txt

│ │ ├── input_split.py

│ │ ├── map0.py

│ │ ├── map1.py

│ │ ├── …

│ │ ├── output_sample.txt

│ │ ├── pipeline.sh

│ │ ├── reduce0.py

│ │ ├── reduce1.py

│ │ ├── …

│ │ └── stopwords.txt

│ └── word_count

│ ├── input

│ │ ├── file01

│ │ └── file02

│ ├── map.py

│ └── reduce.py

├── index

│ ├── index

│ │ ├── __init__.py

│ │ ├── api

│ │ │ └── *.py

│ │ ├── inverted_index.txt

│ │ ├── pagerank.out

│ │ └── stopwords.txt

│ └── setup.py

├── search

│ ├── node_modules

│ ├── package-lock.json

│ ├── package.json

│ ├── search

│ │ ├── __init__.py

│ │ ├── api

│ │ │ └── *.py

│ │ ├── config.py │ │ ├── js

│ │ │ └── *.jsx

│ │ ├── sql

│ │ │ └── wikipedia.sql

│ │ ├── static

│ │ │ └── js

│ │ │ └── bundle.js

│ │ ├── templates

│ │ │ └── *.html

│ │ ├── var

│ │ │ └── wikipedia.sqlite3

│ │ └── views

│ │ └── *.py

│ ├── setup.py

│ └── webpack.config.js

└── tests

Install Hadoop

Hadoop does not work on all machines. For this project, we will use a lightweight Python implementation of Hadoop. The source code is in _{tests/utils/hadoop.py}.

$ pwd # Make sure you are in the root directory

/Users/awdeorio/src/eecs485/p5-search-engine

$ echo $VIRTUAL_ENV # Check that you have a virtual environment

/Users/awdeorio/src/eecs485/p5-search-engine/env

$ pip install sh # Install a required third party sh module Collecting sh

…

$ pushd $VIRTUAL_ENV/bin

$ ln -sf ../../tests/utils/hadoop.py hadoop # Link to provided hadoop implementation

$ popd

$ pwd

/Users/awdeorio/src/eecs485/p5-search-engine

$ which hadoop

/Users/awdeorio/src/eecs485/p5-search-engine/env/bin/hadoop

$ hadoop -h # hadoop command usage usage: hadoop [-h] -D PROPERTIES -input INPUT -output OUTPUT -mapper MAPPER

-reducer REDUCER

Lightweight Hadoop work-alike.

optional arguments:

-h, –help show this help message and exit required arguments:

-D PROPERTIES

-input INPUT

-output OUTPUT

-mapper MAPPER

-reducer REDUCER

If you want to play around with the real Hadoop, Appendix: B has the installation steps listed. However, for this project, we will be using our Python implemented version of Hadoop. You do not need to use the real Hadoop for this project.

End-to-End Testing

See the End-to-End Testing Tutorial in Project 3 to install Google Chrome and ChromeDriver.

Install script

Installing all of the tool chain requires a lot of steps! Follow the Project 3 Tutorial and write a bash script _bin/install to install your app. However, we need to make some small changes from your previous install script.

This project has a different folder structure, so the server and front-end installation instruction needs to change. You will also have two backend servers running for this project, so you will need to install two different python packages.

Install back end

pip install -e index # index server pip install -e search # search server

Install front end

pushd search npm install . popd

If running the install script on CAEN, _{pip install} might throw PermissionDenied errors when updating packages. To allow installation and updates of packages using pip add the following lines before the first _{pip install} command.

Tell pip to write to a different tmp directory.

mkdir -p tmp export TMPDIR=tmp

You will also need to install the python implementation of hadoop. Add this to the end of your install script.

Install the hadoop implementation

pushd $VIRTUAL_ENV/bin

ln -sf ../../tests/utils/hadoop.py hadoop popd

“Hello World” with Hadoop

Next we will run a sample mapreduce word count program with Hadoop. This example will run on both real Hadoop and the provided lightweight Python implementation.

The example MapReduce program lives in _{hadoop/word_count/}. Change directory.

$ cd hadoop/word_count/

$ ls

input map.py reduce.py

The inputs live in hadoop/word_count/input/ .

$ pwd

/Users/awdeorio/src/eecs485/p5-search-engine/hadoop/word_count $ tree

├── input

│ ├── file01

│ └── file02

├── map.py

└── reduce.py $ cat input/file* hadoop map reduce file map map streaming file reduce map reduce is cool hadoop file system google file system

Run the Hadoop mapreduce job.

jar index/hadoop/hadoop-streaming-2.7.2.jar is required by the real Hadoop. The

provided implementation ignores this argument.

-D mapreduce.job.maps=N number of mappers

-D mapreduce.job.reduces=N number of reducers

-input DIRECTORY input directory

-output DIRECTORY output directory

–_{mapper FILE} mapper executable

-reducer FILE reducer executable

$ hadoop \ jar ../hadoop-streaming-2.7.2.jar \

-D mapreduce.job.maps=2 \

-D mapreduce.job.reduces=2 \

-input input \

-output output \

-mapper ./map.py \

-reducer ./reduce.py

Starting map stage

+ ./map.py < output/hadooptmp/mapper-input/part-00000 > output/hadooptmp/mapper-output/par + ./map.py < output/hadooptmp/mapper-input/part-00001 > output/hadooptmp/mapper-output/par Starting group stage

+ cat output/hadooptmp/mapper-output/* | sort > output/hadooptmp/grouper-output/sorted.out Starting reduce stage

+ ./reduce.py < output/hadooptmp/grouper-output/part-00000 > output/hadooptmp/reducer-outp + ./reduce.py < output/hadooptmp/grouper-output/part-00001 > output/hadooptmp/reducer-outp Output directory: output

You will see an _output directory created. This is where the output of the MR job lives. You are interested in all _part-XXXXX files.

$ pwd

/Users/awdeorio/src/eecs485/p5-search-engine/hadoop/word_count

$ ls output/ hadooptmp part-00000 part-00001

$ cat output/part-* cool 1 google 1 is 1 reduce 3 system 2 file 4

hadoop 2

map 4

streaming 1

Inverted Index with MapReduce

You will be building and using a MapReduce based indexer in order to process the large dataset for this project. You will be using Hadoop’s command line streaming interface that lets you write your indexer in the language of your choice instead of using Java (which is what Hadoop is written in). However, you are limited to using Python 3 so that the course staff can provide better support to students. In addition, all of your mappers must output both a key and a value (not just a key). Moreover, keys and values BOTH must be non-empty. If you do not need a key or a value, common practice is to emit a “1”

There is one key difference between the MapReduce discussed in class and the Hadoop streaming interface implementation: In the Java Interface (and in lecture) one instance of the reduce function was called for each intermediate key. In the streaming interface, one instance of the reduce function may receive multiple keys. However, each reduce function will receive all the values for any key it receives as input.

You will not actually run your program on hundreds of nodes; it is possible to run a MapReduce program on just one machine: your local one. However, a good MapReduce program that can run on a single node will run fine on clusters of any size. In principle, we could take your MapReduce program and use it to build an index on 100B web pages.

For this project you will create an inverted index for the documents in hadoop/inverted_index/input.txt through a series of MapReduce jobs. This inverted index will

contain the idf, term frequency, and document normalization factor as specified in the information retrieval lecture slides. The format of your inverted index must follow the format, with each data element separated by a space:

<term> <idf> <doc_id_x> <occurrences in doc_id_x> <doc_id_x normalization factor before sq

A sample of the correctly formatted output for the inverted index can be found in hadoop/inverted_index/output_sample.txt . You can also use this sample output to check the

accuracy of your inverted index.

For your reference, here are the definitions of term frequency and idf:

And here is the definition for normalization factor for a document:

Splitter script

Each document has three properties: _{doc_id}, _{doc_title}, and _{doc_body}. Your mapper code will be receiving input via standard in and line-by-line. As a result, the input is newline separated and each document is represented by 3 lines: 1st is the _{doc_id}, 2nd is the _{doc_title}, and the 3rd is the _{doc_body}. This will allow you to read the input in your mapper function easily.

Input for Hadoop programs is an input directory, not file. As you will notice, our input is in one large file called _{hadoop/inverted_index/input.txt} but your code runs with multiple mappers, and each mapper gets one input file. Consequently, you must write a custom Python script to break the large _input.txt input file with over 3,000 documents (3 lines each) into smaller files and place these files into the _{hadoop/inverted_index/input} directory. You can name your smaller input files however you like, as long as they are under the _input/ directory. Do not split the input file into one file per document; this will make your MapReduce pipeline run very slowly! We do not require a specific number of smaller input files, but ~30 is fine.

Example

$ pwd

/Users/awdeorio/src/eecs485/p5-search-engine/hadoop/inverted_index

$ ./input_split.py

Output directory: input/ $ ls -d input.txt input input/ input.txt

At this point, you should be able to pass _{test_pipeline03_split_file} in test_pipeline_public.py .

$ pytest -v ./tests/test_pipeline_public.py::TestPipelinePublic::test_pipeline03_split_fil

Hadoop pipeline script

This script will execute several MR jobs. The first one will be responsible for counting the number of documents in the input data. Name the mapper and reducer for this map0.py and reduce0.py, respectively. The next N jobs will be a “pipeline”, meaning that you will chain multiple jobs together such that the input of a job is the output of the previous job.

We have provided _{hadoop/inverted_index/pipeline.sh}, which has helpful comments for getting started.

Job 0

The first MapReduce job that you will create counts the total number of documents. This should be run with as many mapper workers as input files and only ONE reducer worker. The mapper and reducer executables should be named _map0.py and _reduce0.py, respectively. The reducer should save the total number of documents in a file called _{total_document_count.txt}. The only data in this file will be an integer representing the total document count. The total_document_count.txt file should be created in whichever directory the pipeline is executed

from.

$ pwd

/Users/awdeorio/src/eecs485/p5-search-engine/hadoop/inverted_index

$ ./pipeline.sh

…

Output directory: output

$ cat total_document_count.txt

3268

MapReduce pipeline

You will be going from large datasets to an inverted index, which involves calculating quite a few values for each word. As a result, you will need to write several MapReduce jobs and run them in a pipeline to be able to generate the inverted index. The first MapReduce job in this pipeline will get input from _input/ and write its output to _output/. The starter file _pipeline.sh shows how to pipe the output from one MapReduce job into the next one.

To test your MapReduce program, we recommend that you make a new test file, with only 10-20 of the documents from the original large file. Once you know that your program works with a single input file, try breaking the input into more files, and using more mappers.

Each of your MapReduce jobs in this pipeline should have mappers and reducers equal to the number of your input files. However, your code should still work when the number of input files and correspondingly the number of mappers is changed. You may only use maximum of 9 MapReduce jobs in this pipeline (but the inverted index can be produced in fewer). The first job in the pipeline (the document counter) must have mapper and reducer executables named

map0.py , reduce0.py respectively, and the second job should be map1.py , reduce1.py , etc.

The format of your inverted index must follow the format, with each data element separated by a space: word -> idf -> doc_id -> number of occurrences in doc_id -> doc_id’s normalization factor (before sqrt)

Sample of formatted output can be found in hadoop/inverted_index/output_sample.txt . The order of words in the inverted index does not matter. If a word appears in more than one doc it does not matter which doc_id comes first in the inverted index. Note that the log for idf is computed with base 10 and that some of your decimals may be slightly off due to rounding errors. In general, you should be within 5% of these sample output decimals.

When building the inverted index file, you should use a list of predefined stopwords to remove words that are so common that they do not add anything to search quality (“and”, “the”, etc). We have given you a list of stopwords to use in hadoop/inverted_index/stopwords.txt .

When creating your index you should treat capital and lowercase letters as the same (case insensitive). You should also only include alphanumeric words in your index and ignore any other symbols. If a word contains a symbol, simply remove it from the string. Do this with the following code snippet:

import re

re.sub(r'[^a-zA-Z0-9]+’, ”, word)

You should remove non-alphanumeric characters before you remove stopwords. Your inverted index should include both _{doc_title} and _{doc_body} for each document.

To construct the inverted index file used by the index server, concatenate all the files in the output directory from the final MapReduce job, and put them into a new file. Make sure to copy any updated inverted index file to your index Flask app ( _index/index). This can either be done at the end of the pipeline.sh or run manually after the ouput files are created.

$ pwd

/Users/awdeorio/src/eecs485/p5-search-engine/hadoop/inverted_index cat output/* > inverted_index.txt

MapReduce specifications

There are a few things you must ensure to get your mappers and reducers playing nicely with both hadoop and our autograder. Here is the list:

Emit both keys and values from all mappers.

Do not emit empty keys or values.

Split key value pairs using a tab. It should be as such “key\tvalue”

Please have a newline at the end of each of your mapper and reducer outputs at every stage

Your map-reduce output must emit more than a single line.

Sample input.

The Document: A

This document is about Mike Bostock. He made d3.js and he’s really cool 2

The Document: B

Another flaw in the human character is that everybody wants to build and nobody wants to d

Document C:

Originality is the fine art of remembering what you hear but forgetting where you heard it

Sample output.

character 0.47712125471966244 2 1 1.593512841936855 maintenance 0.47712125471966244 2 1 1.593512841936855 mike 0.47712125471966244 1 1 1.138223458526325 kurt 0.47712125471966244 2 1 1.593512841936855 peter 0.47712125471966244 3 1 2.048802225347385 flaw 0.47712125471966244 2 1 1.593512841936855 heard 0.47712125471966244 3 1 2.048802225347385 cool 0.47712125471966244 1 1 1.138223458526325 remembering 0.47712125471966244 3 1 2.048802225347385 laurence 0.47712125471966244 3 1 2.048802225347385 d3js 0.47712125471966244 1 1 1.138223458526325 made 0.47712125471966244 1 1 1.138223458526325 build 0.47712125471966244 2 1 1.593512841936855 document 0.0 2 1 1.593512841936855 3 1 2.048802225347385 1 2 1.138223458526325 originality 0.47712125471966244 3 1 2.048802225347385 bostock 0.47712125471966244 1 1 1.138223458526325 forgetting 0.47712125471966244 3 1 2.048802225347385 hear 0.47712125471966244 3 1 2.048802225347385 art 0.47712125471966244 3 1 2.048802225347385 human 0.47712125471966244 2 1 1.593512841936855 fine 0.47712125471966244 3 1 2.048802225347385 vonnegut 0.47712125471966244 2 1 1.593512841936855

Common problems

Your map and reduce programs should use relative paths when opening an input file. For example, to access stopwords.txt :

with open(“stopwords.txt”, “r”) as stopwords: for line in stopwords:

# Do something with line

Testing

Once you have implemented your MapReduce pipeline to create the inverted index, you should be able to pass all tests in test_doc_counts_public.py and test_pipeline_public.py .

$ pytest -v tests/test_doc_counts_public.py tests/test_pipeline_public.py

Index server

The index server is a separate service from the search interface that handles search queries and returns a list of relevant results. Your index server is a RESTful API that returns search results in JSON format that the search server processes to display results to the user.

Directory structure

In this project since you have two servers, the index server and the search server, you will need separate directories for each of your server applications. Your index server code is going to be inside the directory _index/index and your _setup.py is going to be inside the directory _index. This is to allow you to use python packages with two different servers in the same project root directory. Note that the _setup.py is always in a directory above the directory in which your application’s code is. Below is an example of what your final index directory structure should look like.

$ pwd

/Users/awdeorio/src/eecs485/p5-search-engine

$ tree index index/

├── index

│ ├── __init__.py

│ ├── api

│ │ ├── __init__.py

│ │ └── *.py

│ ├── inverted_index.txt

│ ├── pagerank.out

│ └── stopwords.txt

Your _index and _index/api directories need to be python packages.

PageRank integration

In lecture, you learned about PageRank, an algorithm used to determine the relative importance of a website based on the sites that link to that site, and the links to other sites that appear on that website. Sites that are more important will have a higher PageRank score, so they should be returned closer to the top of the search results.

In this project, you are given a set of pages and their PageRank scores in _pagerank.out, so you do not need to implement PageRank. However, it is still important to understand how the PageRank algorithm works!

Your search engine will rank documents based on both the query-dependent tf-idf score, as well as the query-independent PageRank score. The formula for the score of a query q on a single document d should be:

where w is a decimal between 0 and 1, inclusive. The value w will be a URL parameter. The w parameter represents how much weight we want to give the document’s PageRank versus its cosine similarity with the query. This is a completely different variable from the “w_ik” given by the formula in Inverted Index with MapReduce. The final score contains two parts, one from pagerank and the other from a tf-idf based cosine similarity score. The PageRank(d) is the pagerank score of document d, and the tfIdf(q, d) is the cosine similarity between the query and the document tf-idf weight vectors. Each weight in a tf-idf weight vector is calculated via the formula in Inverted Index with MapReduce for “w_ik”. Treat query q as a simple AND, non-phrase query(that is, assume the words in a query do not have to appear consecutively and insequence).

Integrating PageRank scores will require creating a second index, which maps each _{doc_id} to its corresponding precomputed PageRank score, which is given to you in _pagerank.out. You can do this where you read the inverted index. This index should be accessed at query time by your index server.

If you still have some confusion with these calculations, please refer to Appendix A

Returning hits

When the Index server is run, it will load the inverted index file, pagerank file, and stopwords file into memory and wait for queries. It should only load the inverted index and pagerank into memory once, when the index server is first started.

Note: The t2 micro instance used to create the AWS instance has a small memory capacity, so please make sure the data structure used to store the inverted index is memory effiicent where possible. You can check how much memory your server is using by starting up the index server and checking the memory footprint of it using top or other MacOS/Windows memory analysis tools.

Every time you send an appropriate request to the index server, it will process the user’s query and use the inverted index loaded into memory to return a list of all of the “hit” _{doc_ids}. A hit will be a document that is similar to the user’s query. When finding similar documents, only include documents that have every word from the query in the document. The index server should not load the inverted index or pagerank into memory every time it receives a query!

Index Server Endpoint

Route: http://{host}:{port}?w=<w>&q=<query>

Your index server should only have one endpoint, which receives the pagerank weight and query as URL parameters w and q, and returns the search results, including the relevance score for each result (calculated according to the previous section). For example, the endpoint _?

w=0.3&q=michigan%20wolverine would correspond to a pagerank weight of 0.3, and a query “michigan wolverine”.

Your index server should return a JSON object in the following format. A full walkthrough of the example is shown in Appendix A.

{

“hits”: [

{

“docid”: 868657,

“score”: 0.071872572435374

}

]

}

The documents in the hits array must be sorted in order of relevance score, with the most relevant documents at the beginning, and the least relevant documents at the end. If multiple documents have the same relevance score, sort them in order of _{doc_id} (with the smallest

doc_id first).

When you get a user’s query make sure you remove all stopwords. Since we removed them from our inverted index, we need to remove them from the query. Also clean the user’s query of any nonalphanumeric characters as you did with the inverted index.

Running the index server

Install you index server app by running the command below in your project root directory:

$ pip install -e index

To run the index server, you will use environment variables to put the development server in debug mode, specify the name of your app’s python module and locate an additional config file. (These commands assume you are using the bash shell.)

$ export FLASK_DEBUG=True

$ export FLASK_APP=index

$ export INDEX_SETTINGS=config.py

$ flask run –host 0.0.0.0 –port 8001

Serving Flask app “index”
Forcing debug mode on
Running on http://127.0.0.1:8001/ (Press CTRL+C to quit)

Common problems

Make sure that your index server reads stopwords.txt , pagerank.out from the index/index/ directory.

Pro-tip: use an absolute path and Python’s _{__file__} variable. The following code will result in stopwords_filename = “index/index/stopwords.txt” .

“””index/index/api/views.py”””

index_package_dir = os.path.dirname(os.path.dirname(__file__)) stopwords_filename = os.path.join(index_package_dir, “stopwords.txt”)

Testing

Once you have implemented your Index Server, you should be able to pass all tests in test_index_server_public.py .

$ pytest -v tests/test_index_server_public.py

Search interface

The third component of the project is a React driven interface for the search engine. The search interface app will provide a GUI for the user to enter a query and specify a Pagerank weight, and will then send a request to your index server. When it receives the search results from the index server, it will display them on the webpage.

Directory structure

As mentioned earlier, in this project since you have two separate servers, the index server and the search server, you will need a separate directory for you search server. Your search interface server code is going to be inside the directory _{search/search} and your _setup.py is going to be inside the directory _search. Your search server is supposed to be a python package. Below is an example of what your final search directory structure should look like.

$ tree search

├── search

│ ├── package.json

│ ├── package-lock.json

│ ├── node_modules

│ │ └── */

│ ├── search

│ │ ├── api

│ │ │ └── *.py

│ │ ├── js

│ │ │ └── *.jsx

│ │ ├── config.py

│ │ ├── sql

│ │ │ └── wikipedia.sql

│ │ ├── static

│ │ │ └── js

│ │ │ └── bundle.js

│ │ ├── templates

│ │ │ └── *.html

│ │ ├── var

│ │ │ └── wikipedia.sqlite3

│ │ └── views

│ │ └── *.py

│ ├── setup.py

│ └── webpack.config.js

Your _search, _views, and _api directories need to be Python packages.

Database

You will need to create a new database for project 5, with a table called Documents as follows:

docid : INT and PRIMARY KEY title : _VARCHAR with max length 100 categories : VARCHAR with max length 5000 image : _VARCHAR with max length 200 summary : _VARCHAR with max length 5000

The SQL to create this table and load the necessary data into it is provided in search/search/sql/wikipedia.sql inside the starter files.

GUI

Make a simple search interface for your database of Wikipedia articles which allows users to input a query and a w value, and view a ranked list of relevant docs. This is the main route ( http://{host}:{port}/ ).

Users can enter their query into a text input box ( _{<input type=”text”>}), and specify the w value using a slider. The query is executed when a user presses a ( _{<input type=”submit”>}). You can assume anyone can use your search engine; you do not need to worry about access rights or sessions like past projects. Your engine should receive a simple AND, non-phrase query (that is, assume the words in a query do not have to appear consecutively and in-sequence), and return the ranked list of docs. Your search results will be titles of Wikipedia documents, displayed exactly as they appear in the database. Search results will show at most 10 documents. Any other content should appear below.

This page must include a slider ( _{<input type=”range”>}) that will set the w GET query parameter in the URL, and will be a decimal value between 0-1, inclusive. You should set the range slider’s step value to be 0.01.

When the user clicks the submit button of the search GUI, you should fetch the appropriate titles from the index server. For each title returned, display the title in a p tag with a class=”doc_title” ( _{<p className=”doc_title”>}) as well as a button to display the summary for that title with a value=”Show Summary”( <input type=”submit” value=”Show Summary”> )

If there are no search results, no action is required

Upon clicking the ‘show summary’ button for a title, the list of results sould be replaced with a summary of the document. You may simply place the summary data within a _<p> tag with a

class=”doc\_summary” ( <p className=”doc_summary”> ). In addition to having this summary, this

page will also have a “Similar Documents” section, which will show at most 10 documents using w=0.15, and the current document’s title as the search query. For each related document title, render a p tag with class=”doc_title” ( <p className=”doc_title”> ).

Running the search server

Install you search server app by running the command below in your project root directory:

$ pip install -e search

To run the search server, you will use environment variables to put the development server in debug mode, specify the name of your app’s python module and locate an additional config file. (These commands assume you are using the bash shell.)

$ export FLASK_DEBUG=True $ export FLASK_APP=search

$ export SEARCH_SETTINGS=config.py

$ flask run –host 0.0.0.0 –port 8000

Serving Flask app “search”
Forcing debug mode on
Running on http://127.0.0.1:8000/ (Press CTRL+C to quit)

In order, to be able to search on the search server, you must have your index server up and running, since it is the response of the index server that the search server uses to display the search results.

Shell scripts to launch both servers

You will be responsible for writing shell scripts to launch both the index server and the search server. Hint: these will be somewhat similar to last project’s bin/mapreduce script. The name of these scripts must be bin/index and bin/search .

Example: start search.

$ ./bin/search start starting search server …

+ export FLASK_APP=search

+ export SEARCH_SETTINGS=config.py

+ flask run –host 0.0.0.0 –port 8000 &> /dev/null &

Example: stop search.

$ ./bin/search stop stopping search server …

+ pkill -f ‘flask run –host 0.0.0.0 –port 8000’ Example: restart search. Your PIDs may be different.

$ ./bin/search restart stopping search server …

+ pkill -f ‘flask run –host 0.0.0.0 –port 8000’ starting search server …

+ export FLASK_APP=search

+ export SEARCH_SETTINGS=config.py

+ flask run –host 0.0.0.0 –port 8000 &> /dev/null &

Example: start search when something is already running on port 8000

$ ./bin/search start

Error: a process is already using port 8000 Example: start index.

$ ./bin/index start starting index server …

+ export FLASK_APP=index

+ flask run –host 0.0.0.0 –port 8001 &> /dev/null & Example: stop index.

$ ./bin/index stop stopping index server …

+ pkill -f ‘flask run –host 0.0.0.0 –port 8001’ Example: restart index.

$ ./bin/index restart stopping index server …

+ pkill -f ‘flask run –host 0.0.0.0 –port 8001’ starting index server …

+ export FLASK_APP=index

+ flask run –host 0.0.0.0 –port 8001 &> /dev/null &

Example: start index when something is already running on port 8001

$ ./bin/index start

Error: a process is already using port 8001

flask run –host 0.0.0.0 –port 8000 &> /dev/null & will start up your flask server in the

background. The _{&> /dev/null} will prevent your Flask app from outputting text to the terminal.

Index database management script

You will also write a database script for the index server.

Example create.

$ ./bin/indexdb create

+ mkdir -p search/search/var/

+ sqlite3 search/search/var/wikipedia.sqlite3 < search/search/sql/wikipedia.sql Example create when database already exists.

$ ./bin/indexdb create

Error: database already exists Example destroy.

$ ./bin/indexdb destroy

+ rm -f search/search/var/wikipedia.sqlite3 Example reset.

$ ./bin/indexdb reset

+ rm -f search/search/var/wikipedia.sqlite3

+ mkdir -p search/search/var/

+ sqlite3 search/search/var/wikipedia.sqlite3 < search/search/sql/wikipedia.sql

Deploy to AWS

You should have already created an AWS account and instance (instructions ). Resume the AWS Tutorial – Deploy a web app .

Once you have installed your front-end, go to your instance. In the “Description”, click on the name of your security group.

In this case, the security group is “launch-wizard 2”.

On this page, click on “Actions”, then “Edit inbound rules”.

Then, add a “Custom TCP Protocol” for port 8001.

Remember that you need to run two servers, the index server and search server.

$ gunicorn -b localhost:8000 -D search:app

$ gunicorn -b 0.0.0.0:8001 -D index:app

After you have deployed your site, download the search page along with a log. Also download the index server’s response with a log. Do this from your local machine.

$ pwd

/Users/awdeorio/src/eecs485/p5-search-engine

$ curl -v <Public DNS (IPv4)>/ > deployed_search.html 2> deployed_search.log

$ curl -v “<Public DNS (IPv4)>:8001/?q=world%20flags&w=0.5” > deployed_index.json 2> deplo

Verify that the output in deployed_search.log and deployed_index.log doesn’t include errors like “Couldn’t connect to server”.

Testing

Run published unit tests.

$ pwd

/Users/awdeorio/src/eecs485/p5-search-engine $ pytest -v

Code style

As in previous projects, all your Python code for the Flask servers is expected to be _pycodestyle, pydocstyle , and _pylint compliant. You don’t need to lint the mapper/reducer executables

(although it may be a good idea to do so still!)

Also as usual, you may not use any external dependencies aside from what is provided in the setup.py’s.

Submitting and grading

One team member should register your group on the autograder using the create new invitation feature.

Submit a tarball to the autograder, which is linked from https://eecs485.org . Include the _–disable-copyfile flag only on macOS.

$ tar \

-cjvf submit.tar.bz2 \

–disable-copyfile \

–exclude ‘*input.txt’ \

–exclude ‘*/input/*’ \

–exclude ‘*/input_multi/*’ \

–exclude ‘*output*’ \

–exclude ‘*__pycache__*’ \

–exclude ‘*node_modules*’ \

–exclude ‘*stopwords.txt’ \

–exclude ‘*.out’ \

–exclude ‘*.sql’ \

–exclude ‘*.sqlite3’ \

–exclude ‘*.jar’ \

–exclude ‘*.egg-info’ \

–exclude ‘*var*’ \

–exclude ‘*tmp*’ \

–exclude ‘hadoop/inverted_index/inverted_index.txt’ \ bin \ hadoop \ hadoop/inverted_index/input_split.py \ index \ search \ deployed_index.json \ deployed_index.log \ deployed_search.html \ deployed_search.log

Note that the submission command uses the _-j flag: This tells _tar to use the _bzip2compression algorithm. _bzip2 is slower than _gzip (which we used in past projects), but generally results in better compression and smaller file sizes. Make sure your submission tarball is around 10MB.

$ du -h submit.tar.bz2

9.6M submit.tar.gz

Rubric

This is an approximate rubric.

Deliverable	Value
Public tests	50%
Hidden unit tests run after the deadline	40%
Public style tests	10%

FAQ

API Request URLs

When you are attempting to send a request to search interface API from your ReactJS search interface frontend, do not include the _host:port in the url you are attempting to fetch.

If you want to send a request to your index server API from your search interface API, you will want to include the following _host:port pair in the url you are attempting to send a request to: http://localhost:8001

Appendix A: TFIDF Calculation Walkthrough

You have a query, let’s make a query vector for that

You have a document, let’s make a document vector for that

Let’s define a norm factor for our document

Let’s define a norm factor for our query

Note: that the IDFs for both norm factors come from the inverted index. The norm factor for the document can be lifted from your inverted index (but you have sqrt it), where as the query norm factor needs to be computed on the fly.

If you have defined your query and document vectors this way, then the final score will be

Example

Let’s walk through an example for the query _{?w=0.3&q=michigan%20wolverine}. The query is michigan wolverine . The weight of PageRank is _0.3 and the weight tf-idf is _0.7.

Read inverted index and PageRank

Search the inverted index for _michigan.

$ egrep ‘^michigan[[:space:]]’ ./index/index/inverted_index.txt michigan 1.509960674077735 10883747 14 8852.693606436345 … $ egrep ‘^wolverine[[:space:]]’ ./index/index/inverted_index.txt wolverine 2.669184007846121 5791422 1 66539.45074152622 …

We see that the idf for _michigan is _1.509 and the idf for _wolverine is _2.669. When we look at the docids for _michigan and _wolvinere, we see that there is one document (docid ₈₆₈₆₅₇) that contains both terms.

$ egrep ‘^(michigan|wolverine)[[:space:]]’ ./index/index/inverted_index.txt | grep 868657 wolverine … 868657 1 126536.87039603827 … michigan … 868657 46 126536.87039603827 …

In document 868657 , michigan appears 46 times, and wolverine occurs 1 time.

We’ll also read the value of PageRank for document ₈₆₈₆₅₇ from _pagerank.out.

$ egrep 868657 ./index/index/pagerank.out

868657,5.41822e-06

Query vector and document vector

The Query vector has one position for each term ( _{[“michigan”, “wolverine”]}). Each position is a number calculated as <term frequency in query> * <idf> .

q = [1*1.509, 1*2.669] = [1.509, 2.669]

The Document vector similarly has one position for each term ( _{[“michigan”, “wolverine”]}). Each position is a number calculated as <term frequency in document> * idf .

d = [46*1.509, 1*2.669]

= [69.458, 2.669] tf-idf

Compute the dot product of _q and _d. q * d = 112.003

Compute the normalization factor for the query. The normalization factor is the length of the query vector, which is the sum-of-squares of the query vector.

q = [1.509, 2.669] norm_q = sqrt(1.509^2 + 2.669^2) = 3.066

Compute the normalization factor for the document, which is the square root of the normalization factor read from the inverted Index.

norm_d_squared = 126536.870 norm_d = sqrt(126536.870)

= 355.7128124 Compute tf-idf.

tfidf = q * d / (norm_q * norm_d)

= 112.003 / (3.066 * 355.712)

= 0.102

Weighted score

Finally, we combine tf-idf and PageRank using the weight.

weighted_score = w * PageRank + (1-w) (tdidf)

= (0.3 * 5.418e-06) + (1-0.3) * (0.102)

= 0.071

Appendix B: Installing Real Hadoop

Although not needed for this project, below are the steps to install hadoop on OSX and WSL if you want to play around with it.

If you are on OSX, install Hadoop with Homebrew

$ brew cask install java

$ brew install hadoop

If you are on Ubuntu Linux, Ubuntu Linux VM or Windows 10 WSL, you will have to install Hadoop manually. OSX users skip the following. Resume below, where specified Install Java

$ sudo -s # Become root

$ sudo apt-get update

$ sudo apt-get install default-jdk # Install Java $ java -version # Check Java version openjdk version “1.8.0_151”

OpenJDK Runtime Environment (build 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12)

OpenJDK 64-Bit Server VM (build 25.151-b12, mixed mode)

Download Hadoop and unpack.

$ cd /opt

$ wget http://apache.osuosl.org/hadoop/common/hadoop-2.8.5/hadoop-2.8.5.tar.gz

$ wget https://dist.apache.org/repos/dist/release/hadoop/common/hadoop-2.8.5/hadoop-2.8.5.

$ shasum -a 256 hadoop-2.8.5.tar.gz e8bf9a53337b1dca3b152b0a5b5e277dc734e76520543e525c301a050bb27eae hadoop-2.8.5.tar.gz

$ grep SHA256 hadoop-2.8.5.tar.gz.mds

SHA256 = E8BF9A53 337B1DCA 3B152B0A 5B5E277D C734E765 20543E52 5C301A05 0BB27EAE

$ tar -xvzf hadoop-2.8.5.tar.gz

$ rm hadoop-2.8.5.tar.gz hadoop-2.8.5.tar.gz.mds

Locate the path to your Java interpreter.

$ which java | xargs readlink -f | sed ‘s:bin/java::’

/usr/lib/jvm/java-8-openjdk-amd64/jre/

Edit /opt/hadoop-2.8.5/etc/hadoop/hadoop-env.sh to change the JAVA_HOME .

#export JAVA_HOME=${JAVA_HOME} # remove/comment export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre

Write a launch script, /usr/local/bin/hadoop

#!/bin/bash

exec “/opt/hadoop-2.8.5/bin/hadoop” “$@”

Make the script executable and check that it’s working.

$ chmod +x /usr/local/bin/hadoop

$ which hadoop

/usr/local/bin/hadoop

$ hadoop -h

Usage: hadoop [–config confdir] [COMMAND | CLASSNAME]

…

$ exit # drop root privileges

OSX Java Home Error

Note to OSX users: If you run Hadoop and Hadoop is unable to find JAVA_HOME, please refer to the fix below.

Find _{hadoop-env.sh}. You location might be different.

$ find / -name hadoop-env.sh

/usr/local/Cellar/hadoop/2.8.1/libexec/etc/hadoop/hadoop-env.sh

Uncomment and edit the export JAVA_HOME line in hadoop-env.sh :

# The java implementation to use.

export JAVA_HOME=“$(/usr/libexec/java_home)”

[SOLVED] EECS 485 Project 5: Wikipedia Search Engine

Introduction

Setup

AWS account and instance

Project folder

Version control

Python virtual environment

Starter files

Install Hadoop

Install script

“Hello World” with Hadoop

Inverted Index with MapReduce

Splitter script

Hadoop pipeline script

Job 0

MapReduce pipeline

MapReduce specifications

Common problems

Testing

Index server

Directory structure

PageRank integration

Returning hits

Index Server Endpoint

Running the index server

Common problems

Testing

Search interface

Directory structure

Database

GUI

Running the search server

Shell scripts to launch both servers

Index database management script

Deploy to AWS

Testing

Code style

Submitting and grading

Rubric

FAQ

API Request URLs

Appendix A: TFIDF Calculation Walkthrough

Example

Read inverted index and PageRank

Query vector and document vector

Weighted score

Appendix B: Installing Real Hadoop

OSX Java Home Error

Related products

SOLVED:CMIS 320 Project 2

Project 4 Introduction – the SeaPort Project series

EECS485 P2: Server-side Dynamic Pages

Related in this category

More in this category

EECS485 P2: Server-side Dynamic Pages

EECS485 P1: Templated Static Site Generator

EECS 485 Project 4: Map Reduce

EECS485 P3: Client-side Dynamic Pages

Want to See Past Work First?