CSE508 Assignment 4 Solved

30.00 $

Category:

Description

Rate this product

Download 20newsgroup dataset. You need to pick documents of comp.graphics, sci.med, talk.politics.misc, rec.sport.hockey, sci.space.

  1. Implement a cosine similarity measure with tf-idf weighting. Your index should contain the information that you will need to calculate the cosine similarity measure such as tf and idf values.
  2. Implement the Rocchio Algorithm (with query refinement).
    1. You have to display top k (k should be at least 100​ )​ docs for the initial query.  
    2. To provide feedback, you have to mark p% of k docs to be relevant. The p% selected documents would be from the folder which is assumed to be the Actual relevant set (Ground truth).

For e.g., if Ground Truth relevant are docs of the folder sci.med​      and k = 100 then,

for p = 10%, you have to mark 10% of 100 i.e. 10 docs of sci.med as​ relevant from the top of the retrieved list of 100 docs.

  1. Show the revised top k results after performing relevance feedback. Mark the documents as * which were judged as relevant during the relevant feedback phase.
  2. Use 𝛼= 1, 𝛽= 0.75, and 𝛾=0.25 as parameters for the Rocchio’s algorithm. For each iteration of the relevance feedback, you have to show top k docs and provide feedback in the same way as stated above.
  3. Consider the following queries & all documents inside the given folder as a relevant set (ground truth).

Query 1: Pretty good opinions on biochemistry machines​

Relevant set 1: Documents inside folder ​ sci.med

Query 2: Scientific tools for preserving rights and body​

Relevant set 2: Documents inside folder ​ talk.politics.misc

Query 3: Frequently asked questions on State-of-the-art visualisation tools​               Relevant set 3: Documents inside folder ​            sci.med

 

Report the following:

  1. PR curve plot for each of the queries. You have to plot the PR curve after each relevance feedback iteration. (Do around 3 to 4 feedback iterations)
  2. MAP for the above-mentioned query set after each relevance feedback iteration.

 

Note: This set of queries are just for reference, keep your code generalized to be able to accept different queries and mark whatever documents to be relevant as per our wish. 

  1. In the report, show how the query vector changes (for this particular set of queries) after applying each iteration of the Rocchio Algorithm. Justify or explain the results after each iteration. Show a 2D TSNE plot of the vectors to demonstrate the difference. You can use Sklearn’s inbuilt functions to make this plot.

 

 

  • A4_MT19133-0ueqto.zip