Description

5/5 - (1 vote)

1 Written: Understanding word2vec (23 points)

Name: CS224n Assignment #2: word2vec Solved
SKU: 49397
Price: 50.00 USD
Availability: InStock

Let’s have a quick refresher on the word2vec algorithm. The key insight behind word2vec is that ‘a word is known by the company it keeps’. Concretely, suppose we have a ‘center’ word c and a contextual window surrounding c. We shall refer to words that lie in this contextual window as ‘outside words’. For example, in Figure 1 we see that the center word c is ‘banking’. Since the context window size is 2, the outside words are ‘turning’, ‘into’, ‘crises’, and ‘as’.

The goal of the skip-gram word2vec algorithm is to accurately learn the probability distribution P(O|C). Given a specific word o and a specific word c, we want to calculate P(O = o|C = c), which is the probability that word o is an ‘outside’ word for c, i.e., the probability that o falls within the contextual window of c.

Figure 1: The word2vec skip-gram prediction model with window size 2

In word2vec, the conditional probability distribution is given by taking vector dot-products and applying the softmax function:

exp(u^>_ov_c)

P(O = o | C = c) = (1)

Pw∈Vocab exp(u>wvc)

Here, u_ois the ‘outside’ vector representing outside word o, and v_cis the ‘center’ vector representing center word c. To contain these parameters, we have two matrices, U and V . The columns of U are all the ‘outside’ vectors u_w. The columns of V are all of the ‘center’ vectors v_w. Both U and V contain a vector for every w ∈ Vocabulary.^[1]

Recall from lectures that, for a single pair of words c and o, the loss is given by:

Jnaive-softmax(vc,o,U) = −logP(O = o|C = c). (2)

Another way to view this loss is as the cross-entropy^[2] between the true distribution y and the predicted distribution yˆ. Here, both y and yˆ are vectors with length equal to the number of words in the vocabulary. Furthermore, the k^thentry in these vectors indicates the conditional probability of the k^thword being an ‘outside word’ for the given c. The true empirical distribution y is a one-hot vector with a 1 for the true outside word o, and 0 everywhere else. The predicted distribution yˆ is the probability distribution P(O|C = c) given by our model in equation (1).

(3 points) Show that the naive-softmax loss given in Equation (2) is the same as the cross-entropy loss between y and yˆ; i.e., show that

CS 224n Assignment #2: word2vec (43 Points)

− ^Xy_wlog(ˆy_w) = −log(ˆy_o). (3)

w∈Vocab

Your answer should be one line.

(5 points) Compute the partial derivative of J_{naive-softmax}(v_c,o,U) with respect to v_c. Please write your answer in terms of y, yˆ, and U.
(5 points) Compute the partial derivatives of J_{naive-softmax}(v_c,o,U) with respect to each of the ‘outside’ word vectors, u_w’s. There will be two cases: when w = o, the true ‘outside’ word vector, and w 6= o, for all other words. Please write you answer in terms of y, yˆ, and v_c.
(3 Points) The sigmoid function is given by Equation 4:

1 e^x

σ(x) = = (4)

1 + e⁻^xe^x+ 1

Please compute the derivative of σ(x) with respect to x, where x is a vector.

(4 points) Now we shall consider the Negative Sampling loss, which is an alternative to the Naive Softmax loss. Assume that K negative samples (words) are drawn from the vocabulary. For simplicity of notation we shall refer to them as w₁,w₂,…,w_Kand their outside vectors as u₁,…,u_K. Note that o /∈ {w₁,…,w_K}. For a center word c and an outside word o, the negative sampling loss function is given by:

J_neg-sample)) (5)

for a sample w₁,…w_K, where σ(·) is the sigmoid function.³

Please repeat parts (b) and (c), computing the partial derivatives of J_neg-samplewith respect to v_c, with respect to u_o, and with respect to a negative sample u_k. Please write your answers in terms of the vectors u_o, v_c, and u_k, where k ∈ [1,K]. After you’ve done this, describe with one sentence why this loss function is much more efficient to compute than the naive-softmax loss. Note, you should be able to use your solution to part (d) to help compute the necessary gradients here.

(3 points) Suppose the center word is c = w_tand the context window is [w_t_−m, …, w_t₋₁, w_t, w_t₊₁, …, w_t_+m], where m is the context window size. Recall that for the skip-gram version of word2vec, the total loss for the context window is:

Jskip-gram(vc,wt−m,…wt+m,U) = X J(vc,wt+j,U) (6)

−m≤j≤m j6=0

Here, J(v_c,w_t_+j,U) represents an arbitrary loss term for the center word c = w_tand outside word wt+j. J(vc,wt+j,U) could be Jnaive-softmax(vc,wt+j,U) or Jneg-sample(vc,wt+j,U), depending on your implementation.

Write down three partial derivatives:

(i) ∂Jskip-gram(vc,wt−m,…wt+m,U)/∂U

(ii) ∂Jskip-gram(vc,wt−m,…wt+m,U)/∂vc

³Note: the loss function here is the negative of what Mikolov et al. had in their original paper, because we are doing a minimization instead of maximization in our assignment code. Ultimately, this is the same objective function.

Page 2 of 3

CS 224n Assignment #2: word2vec (43 Points)

(iii) ∂Jskip-gram(vc,wt−m,…wt+m,U)/∂vw when w 6= c

Write your answers in terms of ∂J(v_c,w_t_+j,U)/∂U and ∂J(v_c,w_t_+j,U)/∂v_c. This is very simple – each solution should be one line.

Once you’re done: Given that you computed the derivatives of J(v_c,w_t_+j,U) with respect to all the model parameters U and V in parts (a) to (c), you have now computed the derivatives of the full loss function J_skip-gramwith respect to all parameters. You’re ready to implement word2vec!

2 Coding: Implementing word2vec (20 points)

In this part you will implement the word2vec model and train your own word vectors with stochastic gradient descent (SGD). Before you begin, first run the following commands within the assignment directory in order to create the appropriate conda virtual environment. This guarantees that you have all the necessary packages to complete the assignment.

conda env create -f env.yml conda activate a2

Once you are done with the assignment you can deactivate this environment by running:

conda deactivate

(12 points) First, implement the sigmoid function in py to apply the sigmoid function to an input vector. In the same file, fill in the implementation for the softmax and negative sampling loss and gradient functions. Then, fill in the implementation of the loss and gradient functions for the skip-gram model. When you are done, test your implementation by running python word2vec.py.
(4 points) Complete the implementation for your SGD optimizer in py. Test your implementation by running python sgd.py.
(4 points) Show time! Now we are going to load some real data and train word vectors with everything you just implemented! We are going to use the Stanford Sentiment Treebank (SST) dataset to train word vectors, and later apply them to a simple sentiment analysis task. You will need to fetch the datasets first. To do this, run sh getdatasets.sh. There is no additional code to write for this part; just run python run.py.

Note: The training process may take a long time depending on the efficiency of your implementation (an efficient implementation takes approximately an hour). Plan accordingly!

After 40,000 iterations, the script will finish and a visualization for your word vectors will appear. It will also be saved as wordvectors.png in your project directory. Include the plot in your homework write up. Briefly explain in at most three sentences what you see in the plot.

3 Submission Instructions

You shall submit this assignment on GradeScope as two submissions – one for “Assignment 2 [coding]” and another for ‘Assignment 2 [written]”:

Run the sh script to produce your assignment2.zip file.
Upload your zip file to GradeScope to “Assignment 2 [coding]”.
Upload your written solutions to GradeScope to “Assignment 2 [written]”.

Page 3 of 3

[1] Assume that every word in our vocabulary is matched to an integer number k. u_kis both the k^thcolumn of U and the ‘outside’ word vector for the word indexed by k. v_kis both the k^thcolumn of V and the ‘center’ word vector for the word indexed by k. In order to simplify notation we shall interchangeably use k to refer to the word and the index-of-the-word.

[2] The Cross Entropy Loss between the true (discrete) probability distribution p and another distribution q is −^P_ip_ilog(q_i).

a2-dy1pvl.zip

CS224n Assignment #2: word2vec Solved

If Helpful Share:

Description

1 Written: Understanding word2vec (23 points)

(ii) ∂Jskip-gram(vc,wt−m,…wt+m,U)/∂vc

(iii) ∂Jskip-gram(vc,wt−m,…wt+m,U)/∂vw when w 6= c

2 Coding: Implementing word2vec (20 points)

3 Submission Instructions

Related products

SOLVED:COP 3223 Introduction to C Assignment 4

SOLVED:Programming Assignment 2: DNS Name Resolution Engine

SOLVED:Pizza Parlor Assignment solution