ELEN4720 Problem 4 Solved

30.00 $

Category:

Description

5/5 - (1 vote)

Problem 1 (Markov chains)

In this problem, you will rank 769 college football teams based on the scores of every game in the 2019 season. The data provided in CFB2019scores.csv contains the result of one game on each line in the format

Team A index, Team A points, Team B index, Team B points.

If Team A has more points than Team B, then Team A wins, and vice versa. The index of a team refers to the row of “TeamNames.txt” where that team’s name can be found.

Construct a 769×769 random walk matrix M on the college football teams. First construct the unnormalized matrix Mc with values initialized to zeros. For one particular game, let i be the index of Team A and j the index of Team B. Then update

Mcii Mcii + 1{Team A wins} + pointspointsi+pointsi               j ,

pointsj

c                                             {Team B wins} + pointsi+pointsj ,

Mcij Mcij + 1{Team B wins} + pointspointsi+pointsj        j , {Team A wins} + pointspointsi+pointsi      j .

After processing all games, let M be the matrix formed by normalizing the rows of Mc so they sum to one. Let wt be the 1×769 state vector at step t. Set w0 to the uniform distribution. Therefore, wt is the marginal distribution on each state after t steps given that the starting state is chosen uniformly at random.

  1. Use wt to rank the teams by sorting in decreasing value according to this vector. List the top 25 team names (see accompanying file) and their corresponding values in wt for t = 10,100,1000,10000.
  2. We saw that wis related to the first eigenvector of MT. That is, we can find wby getting the first eigenvector and eigenvalue of MT and post-processing:

This is because by convention. Also, we observe that λ1 = 1 for this specific matrix. Plot kwt wk1 as a function of t for t = 1,…,10000.

Problem 2 (Nonnegative matrix factorization) – 40 points

In this problem you will factorize an N × M matrix X into a rank-K approximation WH, where W is N × K, H is K × M and all values in the matrices are nonnegative. Each value in W and H can be initialized randomly to a positive number, e.g., from a Uniform(1,2) distribution.

The data to be used for this problem consists of 8447 documents from The New York Times. (See below for how to process the data.) The vocabulary size is 3012 words. You will need to use this data to construct the matrix X, where Xij is the number of times word i appears in document j. Therefore, X is 3012×8447 and most values in X will equal zero.

  1. Implement and run the NMF algorithm on this data using the divergence penalty. Set the rank to 25 and run for 100 iterations. This corresponds to learning 25 topics. Plot the objective as a function of iteration.
  2. After running the algorithm, normalize the columns of W so they sum to one. For each column of W, list the 10 words having the largest weight and show the weight. The ith row of W corresponds to the ith word in the “dictionary” provided with the data. Organize these lists in a 5×5

Comments about Problem 2:

  • When dividing, you may get 0/0 = NaN. This will cause the algorithm to return all NaN values. You can add a very small number (e.g., 10−16) to the denominator to avoid this.
  • Each row in nytdata.txt corresponds to a single document. It gives the index of words appearing in that document and the number of times they appear. It uses the format “idx:cnt” with commas separating each unique word in the document. Any index that doesn’t appear in a row has a count of zero for that word. The vocabulary word corresponding to each index is given in the corresponding row of nytvocab.dat.
  • HW4-s6tyya.zip