## Description

**Introduction:**

Recommendation systems and “suggestions” have become an essential part of a lot of web applications. Examples of these applications include recommended items in online stores, pages or people in social networking sites, movies or songs in streaming websites, etc. What has enabled these systems to expand widely to so many different applications is the rapidly growing collective user data and the use of Collaborative Filtering. “Collaborative Filtering” refers to methods of predicting a user’s opinion on an entity using other users’ opinion. These methods are based on the notion that there exist other users with similar behavior as the target user and finding them and observing their actions will reveal information that we could use to predict the target user’s behavior.

We can view this as a matrix problem by putting all users on rows and all items on columns of the matrix. Then the entry (𝑖, 𝑗) of the matrix will be the rating user 𝑖 has given to item 𝑗. The problem will now become estimating the empty entries of the matrix to predict what items a user will most probably like other than the ones they have rated.

In this project we are going to use a popular method for collaborative filtering called “Alternating Least Squares” to make a recommendation system on the MovieLens^{[1]} dataset that contains movies and user ratings.

- Download the dataset with 100k ratings and create a matrix named R containing user ratings with users on rows and movies on columns. Note that a lot of the values in the matrix will be missing and only a small fraction of the matrix entries will contain known data points. We want to run a matrix factorization job to get matrices U and V such that 𝑅
_{𝑚×𝑛 }≈ 𝑈_{𝑚×𝑘}𝑉_{𝑘×𝑛}in known data points. For this purpose, we calculate matrices 𝑈 and 𝑉 such that the squared error is minimized:

2 min ∑ (𝑟_{𝑖𝑗 }− (𝑈𝑉)_{𝑖𝑗})

𝑘𝑛𝑜𝑤𝑛 𝑖,𝑗

This can be implemented by putting 0 s where the data is missing and creating a weight matrix to calculate the squared error. Assume that the weight matrix 𝑊_{𝑚×𝑛} contains 1 in entries where we have known data points and 0 in entries where the data is missing. Then if R contains 0 in missing entries, we can formulate the above problem as:

You are free to choose your programming language; and there are implementations of the least square factorization in related packages for various programming languages; e.g. wnmfrule function in the Matrix Factorization Toolbox^{[2]} in MATLAB. Choose 𝑘 equal to 10, 50, and 100. What is the total least squared error in each case?

- Now test the recommendation system you designed by a “10-fold Cross-validation”; that is randomly split the data and take 90% of the known ratings for training and the other 10% unknown for testing. Note that in our model this can be achieved by putting zero weight on the data we want to make unknown. You shouldn’t remove rows or columns of the matrix entirely because that would jeopardize the idea behind this method. One way to avoid this problem is taking random points for testing, so that the data points we remove are dispersed throughout the matrix. However, we want the 10% testing data in each test to not overlap with the other 9 tests.

Assume that we assign an index 1 to N to the N known data points. Create a random ordering of the numbers 1 to N and split this list of numbers into 10 parts. Now each time take one of these 10 parts for testing and train on the rest.

Train your system as in part 1 and find the average absolute error over testing data (prediction error

|𝑅_{𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 }− 𝑅_{𝑎𝑐𝑡𝑢𝑎𝑙}|). What is the average error? (Average among 10 the tests for each system, and for each test among all entries)

What are the highest and lowest values of average error among the 10 tests? (Average over the testing entries in each test and find which of the 10 tests has highest or lowest average error)

- Now assume that if a user has rated a movie 3 or lower we conclude they didn’t like the movie, and if a user has rated a movie 4 or higher they have liked it. Based on the estimated values in the R matrix in parts 1 and 2, we can use a threshold on the estimated entry to predict if the user will like the movie or not. In your tests with the method in part 2, find out the number of entries where your system predicted the user would like the movie. Out of these, for what percentage did the user actually like the movie? (
*Precision*) Now find the entries in the testing data where the user did actually like the movie. Out of these entries, for what percentage did your algorithm predict the user would like the movie? (R*ecall*) Plot the ROC curve (*precision*over*recall)*for different values of the threshold for the designed recommendation system. - Try changing the weight matrix and use rating values as weights, instead of 1, for the known data points. Turn R into a 0-1 matrix where 𝑟
_{𝑖𝑗 }= 1 for known ratings and 𝑟_{𝑖𝑗 }= 0 for unknown entries. Run matrix factorization again with the same values of 𝑘 as in part 1. Is there a change in the total squared error?

Now, in order to avoid singular solutions, we modify the cost function to add a regularization term 𝜆:

This will make the matrix calculations more stable. You can find out how to implement the regularized version of ALS in a set of slides found in the reference link^{[3]} or in the reference paper [1]. Run your code for 𝜆 = 0.01, 0.1,1 and compare the results with the previous parts. Following the same evaluation method, plot the ROC curve.

- Create a recommendation system in the following manner:

Convert R to a 0-1 matrix where the entries are 1 for available data points and 0 for missing data points. Use ratings as weights and run your algorithm to find the top 𝐿 movies recommended to each user. Using the cross-validation method described earlier, find out the average precision of your algorithm for 𝐿 = 5. While calculating precision, ignore the picked entries for which you don’t have the actual rating.

Find your algorithm’s hit rate (what *fraction* of the test movies liked by the users are suggested by your system) and false-alarm rate (what *fraction* of the test movies not actually liked by the user, are suggested by your algorithm). Start from 𝐿 = 1 and while increasing 𝐿 measure hit rate and false-alarm rate for different values of 𝐿. Plot these values as points in a two-dimensional space with hit rate on the 𝑦 axis and false-alarm rate on the 𝑥 axis. Note that you do not need to run the factorization every time you change L, only the number of results you pick for recommendation will be different.

[1] Li, Yifeng, and Alioune Ngom. “The non-negative matrix factorization toolbox for biological data mining.” Source code for biology and medicine 8.1 (2013): 1.