Name: COMP479 Project 3-Implement SPIMI Solved
SKU: 75906
Price: 25.00 USD
Availability: InStock

Description

5/5 - (1 vote)

Objectives: Implement SPIMI. Implement ranking of returns. Test and analyze your system, discuss how your design decisions influence the results.

Data: Use Reuters21578 for testing and if needed, continue your text scrubbing skills for the final project. Note that the text preprocessing should be secondary in this project.

Description: this project consists of two subprojects that build on each other. Each subproject should be very simple to execute, discuss with your peers and during Lab Q&A if there are any hurdles.

Subproject I: Implement SPIMI using your Project 2 Subproject 1 system. In particular:

(Project 2 Subproject I item 1:) develop a module that while there are still more documents to be processed, accepts a document as a list of tokens and outputs term-documentID pairs. Instead of appending new termdocID pairings (since you are not to compress the index, a matched token and docID, not a pair data structure. Omit punctuation.) to a global list, do:
SPIMI:
- in the following, replace K with 500 for submitting your first and last block for grading
- replace K with 10000 for comparison with naive indexer

for K term-docIDs, create a new hash key for the term if necessary and/or append the docID to the postings list associated with the hashed term if it is not already listed in the postings list or if it is already listed, augment a term counter to calculate tf.

when the block is full (representing K term-docIDs), collect the index, sort, and ”store” in consecutively labelled BlockX
disk block merging: when all term-docID pairs of your input are stored in block-sized indices, merge the miniindeces into a global index. You can hold the merged index in memory
compare timing with the naive indexer (for 10000 term-docID pairings).
compile an inverted index for Reuters21578 without using any compression techniquesdocID hint: Use the NEWID values from the Reuters corpus to make your retrieval comparable.

Subproject II: Convert your indexer into a probabilistic search engine

using the assumptions made in Chapter 11 about independence of terms and documents etc. and
using the BM25 formula (11.32),
rank the documents your SPIMI implementation returns and
for a given query, return a ranked list of results.

Notes: experiment with different values for the parameters k₁and b as described in the textbook.

Test queries:

design four test queries:
- a single keyword query, to comapre with Project 2
- a query consisting of several keywords for BM25
- a multiple keyword query returning documents containing all the keywords (AND), for unranked Boolean retrieval
- a multiple keywords query returning documents containing at least one keyword (OR), where documentsare ordered by how many keywords they contain), for unranked Boolean retrieval
run your four test queries to showcase your code and comment on the results in your report Deliverables:
individual project
well documented code
well documented sample runs for your queries on the information needs:
- Democrats’ welfare and healthcare reform policies
- Drug company bankruptcies
- George Bush
any additional testing or aborted design ideas that show off particular aspects of your project

a project report that summarizes your approach, illustrates your design and discusses what you have learnedfrom the project. Note that a summary and commentary on your sample runs has to be included in the report

Project3-yvz4cg.zip

COMP479 Project 3-Implement SPIMI Solved

If Helpful Share:

Description

Related products

COMP479 Project 2-Naive indexer Solved

COMP479 Project 1-Text preprocessing with NLTK, proofreading results Solved