Description

5/5 - (2 votes)

These questions require thought, but do not require long answers. Please be as concise as possible.

1 Policy Gradient Methods (50 pts coding + 15 pts writeup)

Name: CS 234- Assignment #3 Solved
SKU: 73894
Price: 40.00 USD
Availability: InStock

The goal of this problem is to experiment with policy gradient and its variants, including variance reduction methods. Your goals will be to set up policy gradient for both continuous and discrete environments, and implement a neural network baseline for variance reduction. The framework for the policy gradient algorithm is setup in main.py, and everything that you need to implement is in the files networkutils.py, policy.py, policygradient.py and baselinenetwork.py. The file has detailed instructions for each implementation task, but an overview of key steps in the algorithm is provided here.

1.1 REINFORCE

Recall the policy gradient theorem,

∇_θJ(θ) = Eπ_θ[∇_θlogπ_θ(a|s)Q^π^θ(s,a)]

REINFORCE is a Monte Carlo policy gradient algorithm, so we will be using the sampled returns G_tas unbiased estimates of Q^π^θ(s,a). The REINFORCE estimator can be expressed as the gradient of the following objective function:

where D is the set of all trajectories collected by policy π_θ, and ) is trajectory i.

1.2 Baseline

One difficulty of training with the REINFORCE algorithm is that the Monte Carlo sampled return(s) G_tcan have high variance. To reduce variance, we subtract a baseline b_φ(s) from the estimated returns when computing the policy gradient. A good baseline is the state value function, V ^π^θ(s), which requires a training

update to φ to minimize the following mean-squared error loss:

LMSE

1.3 Advantage Normalization

After subtracting the baseline, we get the following new objective function:

where

A second variance reduction technique is to normalize the computed advantages, A^ˆi_t, so that they have mean 0 and standard deviation 1. From a theoretical perspective, we can consider centering the advantages to be simply adjusting the advantages by a constant baseline, which does not change the policy gradient. Likewise, rescaling the advantages effectively changes the learning rate by a factor of 1/σ, where σ is the standard deviation of the empirical advantages.

1.4 Coding Questions (50 pts)

The functions that you need to implement in networkutils.py, policy.py, policygradient.py, and baselinenetwork.py are enumerated here. Detailed instructions for each function can be found in the comments in each of these files.

Note: The ”batch size” for all the arguments is ^PT_isince we already flattened out all the episode observations, actions, and rewards for you. In networkutils.py,

buildmlp

In policy.py,

act
actiondistribution
init
std
actiondistribution

In policygradient.py,

initpolicy
getreturns
normalizeadvantage
updatepolicy

In baselinenetwork.py,

init
forward
calculateadvantage
updatebaseline

1.5 Testing

We have provided some basic tests to sanity check your implementation. Please note that the tests are not comprehensive, and passing them does not guarantee a correct implementation. Use the following command to run the tests:

You can also add additional tests of your own design in tests/testbasic.py.

1.6 Writeup Questions (15 pts)

(3 pts) To compute the REINFORCE estimator, you will need to calculate the values (we drop the trajectory index i for simplicity), where

Naively, computing all these values takes O(T²) time. Describe how to compute them in O(T) time.

(12 pts) The general form for running your policy gradient implementation is as follows:

if not using a baseline, or

if using a baseline. Here ENV should be cartpole, pendulum, or cheetah, and SEED should be a positive integer.

For each of the 3 environments, choose 3 random seeds and run the algorithm both without baseline and with baseline. Then plot the results using

where SEEDS should be a comma-separated list of seeds which you want to plot (e.g. –seeds 1,2,3). Please include the plots (one for each environment) in your writeup, and comment on whether or not you observe improved performance when using a baseline.

We have the following expectations about performance to receive full credit:

cartpole: Should reach the max reward of 200 (although it may not stay there)
pendulum: Should reach the max reward of 1000 (although it may not stay there)
cheetah: Should reach at least 200 (Could be as large as 950)

2 Reducing Variance in Policy Gradient Methods (35 pts)

In class, we explored REINFORCE as a policy gradient method with no bias but high variance. In this problem, we will explore methods to dramatically reduce variance in policy gradient methods, potentially at the cost of increased bias.

Let us consider an infinite horizon MDP M = hS,A,R,T ,γi. Let us define

A^π(s_t,a_t) = Q^π(s_t,a_t) − V ^π(s_t)

An approximation to the policy gradient is defined as

(1)

∞

g = Es_0:∞[^XA^π(s_t,a_t)∇_θlog π_θ(a_t,s_t)]

a0:∞ t=0

where the colon notation a : b represents the range [a,a + 1,a + 2,…b] inclusive of both ends.

(2)

(5 pts) Let us define the partial sum. Show that it is not necessarily true that Var(R_t₊₁) ≥ Var(R_t). [Hint: Construct a counterexample MDP where this statement does not hold.]
(10 pts) Prove that Var(R_t₊₁) ≥ Var(R_t) is true if we assume that r_t₊₁is, on average, correlated with the previous rewards, i.e. Cov(r_i,r_t₊₁) >
(5 pts) In practice, we do not have access to the true function A^π(s_t,a_t), so we would like to obtain an estimate instead. We will consider the general form of an estimator A^ˆ_t(s_0:∞,a_0:∞) that can be a function of the entire trajectory.

Let A^ˆ_t(s_0:∞,a_0:∞) = Q^ˆ_t(s_t_:∞,a_t_:∞)−b_t(s_0:t,a_0:t−1), where for all s_t,a_t, we have that Q^ˆ_tis an unbiased estimator of the true Q^π. Namely, we have that ). Note that b_tis an arbitrary function of the actions and states sampled before a_t. Prove that by using this estimate of Aˆ_t, we obtain an unbiased estimate of the policy gradient g. In other words, prove that

(5 pts) We will now look at a few different variants of A^ˆ_t. Recall the TD error δ_t^V^ˆ(s_t,a_t) = r_t+γV^ˆ(s_t₊₁)−V^ˆ(s_t). If V^ˆ= V ^π, prove that is an unbiased estimate of A^π.
(5 pts) Let us define. Show that. In general,

how does bias and variance change as k increases? (a few sentences of justification would suffice, no formal proof is necessary)

(5 pts) Show that

Assignment3-ct57yb.zip

CS 234- Assignment #3 Solved

If Helpful Share:

Description

1 Policy Gradient Methods (50 pts coding + 15 pts writeup)

1.1 REINFORCE

1.2 Baseline

1.3 Advantage Normalization

1.4 Coding Questions (50 pts)

1.5 Testing

1.6 Writeup Questions (15 pts)

2 Reducing Variance in Policy Gradient Methods (35 pts)

Related products

CS234 Homework 4 Solved

CS234 Assignment 1 Solved

CS234 Assignment 2 Solved