CMPE58y Homework3-Policy gradient with function approximation Solved

35.00 $

Category:

Description

Rate this product

In this homework you will implement policy gradient algorithm with a neural network for the cart pole task [1] in OpenAI Gym environment. As in previous homework, do not care about done variable. Terminate the episode after 500 iterations. You can consider the task is solved if you consistently get +450 reward.

2        Policy Gradient

As it is explained in the lecture, your RL agent can be a neural network. Since the environment is not complex, in this homework you will use a single layer with at most 4 neurons. (Our implementation has a single neuron and can solve the task approximately in 50 episodes where there are 50 rollouts in each episode. Considering the state space, there are 4 weights and 1 bias for the neuron. Activation function is a sigmoid. Discount factor is 0.99, learning rate is 0.05. Average reward of the roll-outs is used as baseline. Remember to check the course website for the explanation of causality principle.)

In the lecture, we used a Gaussian distribution for the probability distribution of actions, but there are 2 discrete actions in cart pole task. Therefore, it is not reasonable to use a Gaussian distribution. Instead, a Bernoulli distribution will be used and the output of the network will be the probability(p) of pushing the cart either to the left or to the right. (The other probability is 1-p naturally) Remember that:

T                                                        T

5θ J(θ) = Eπθ[(X5θ logπθ(at|st)(Xγk−t ∗ R(sk,ak))]

t=1                                                    k=t

Here:

(1)
πθ(at|st) = pn ∗ (1 − p)1−n

Where:

(2)
p = Pθ(at) = sigmoid(θ ∗ s + b) (3)

So:

1

))]                                       (4)

(5)

Because of the property of sigmoid:  
5θ p = 5θPθ(at) = p ∗ (1 − p) ∗ s

Then;

(6)
5θ logπθ(at|st) = n ∗ (1 − p) ∗ s + (1 − n) ∗ (−p) ∗ s (7)

After you correctly calculate these gradients, you can update your parameters using stochastic gradient descent as in the second homework.

3      Deliverables

Plot the reward over episodes. Submit your code (a jupyter notebook is preferred by me) to [email protected]. For any questions regarding the description, environment installation, hyperparameters and so on, you can send an e-mail. Cheating will be penalized by -200 points.

References

[1] Andrew G Barto, Richard S Sutton, and Charles W Anderson. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE transactions on systems, man, and cybernetics, (5):834–846, 1983.

  • hw3-policy-gradient-in1u7r.zip