EECS545 Homework #6 –Reinforcement learning Solved

35.00 $

Category:
Click Category Button to View Your Next Assignment | Homework

You'll get a download link with a: zip solution files instantly, after Payment

Securely Powered by: Secure Checkout Second Badge

Description

Rate this product

1        Conditional Variational Autoencoders

In this problem, you will implement a conditional variational autoencoder (CVAE) from [1] and train it on the MNIST dataset.

  • [10 points] Derive the variational lower bound of a conditional variational autoencoder. Show that:

logpθ (x|y) ≥ L(θ,ϕ;x,y)

= Eqϕ(z|x,y) [logpθ (x|z,y)] − DKL (qϕ (z|x,y)∥pθ (z|y)),                                         (1)

where x is a binary vector of dimension d, y is a one-hot vector of dimension c defining a class, z is a vector of dimension m sampled from the posterior distribution qϕ (z|x,y). The posterior distribution is modeled by a neural network of parameters ϕ. The generative distribution pθ (x|y) is modeled by another neural network of parameters θ. Similar to the VAE that we learned in the class, we assume the conditional

independence on the componenets of z: i.e.,), and

  • [10 points] Derive the analytical solution to the KL-divergence between two Gaussian distributions DKL (qϕ (z|x,y)∥pθ (z|y)). Let us assume that pθ (z|y) ∼ N(0,I) and show that:

,                                        (2)

where µj and σj are the outputs of the neural network that estimates the parameters of the posterior distribution qϕ (z|x,y).

You can assume without proof that

.

This is a consequence of conditional independence of the components of z.

  • [15 points] Fill in code for CVAE network as a Module class called CVAE in the starter code q1 cvae.py:
    • Implement the recognition_model function qϕ (z|x,y).
    • Implement the generative_model function pθ (x|z,y).
    • Implement the forward function by inferring the Gaussian parameters using the recognition model, sampling a latent variable using the reparametrization trick and generating the data using the generative model.
    • Implement the variational lowerbound loss_function L(θ,ϕ;x,y).
    • Train the CVAE and visualize the generated image for each class (i.e., 10 images).
    • Repeat the image generation three times with different random noise. Submit 3 x 10 array of subplots showing all the generated images, where the images in the same row are generated from the same random noise, and images in the same column are generated from the the same class label.

If trained successfully, you should be able to sample images x that reflect the given label y given the noise vector z.

2    Generative Adversarial Networks

This problem asks you to implement generative adversarial networks and train it on MNIST dataset. Specifically, you will implement the Deep Convolutional Generative Adversarial Networks (DCGAN) [2]. In the generative adversarial networks formulation, we have a generator network G that takes in random vector z and a discriminator network D that takes in an input image x. The parameters of G and D are optimized via the adversarial objective:

minmax ExpdatahlogD(x)i + Ezp(z)hlog(1 − D(G(z)))i,                                                        (3)

G       D

In practice, we alternate between training D and G where we first train G to maximize:

Ezp(z)hlogD(G(z))i,                                                                                          (4)

followed by training D to maximize:

ExpdatahlogD(x)i + Ezp(z)hlog(1 − D(G(z)))i,                                                                  (5)

Therefore, the two separate optimizations make up one full training step.

  1. [20 points] Fill in code for the DCGAN network in the py. Descriptions of what should be filled in is written as comments in the code itself.
    • Implement the sample_noise
    • Implement the build_discriminator
    • Implement the build_generator
    • Implement the get_optimizer
    • Implement the bce_loss
    • Using bce_loss, implement the discriminator_loss
    • Using bce_loss, implement the generator_loss
  2. [10 points] Using the default hyper-parameters in the starter code (NUM TRAIN, NUM VAL, NOISE DIM, batch size), train the DCGAN network and report 6 plots of sample images at the (0, 250, 500, 1000, 2000, 3000)th iterations. Note, each plot should include 16 (4×4) sample images of digits.

If trained successfully, you should see improvements in the sample qualities as the training progresses.

References

  • Kihyuk Sohn, Xinchen Yan, and Honglak Lee. Learning structured output representation using deep conditional generative models. In NeurIPS. 2015.
  • Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.

3      Deep Q-Network (DQN)

In this problem, you are asked to implement DQN algorithm[1] and train an agent to play OpenAI Gym CartPole task. We first encourage you to familiarize yourself with OpenAI Gym (http://gym.openai. com/) and the cartpole environment (https://github.com/openai/gym/blob/master/gym/envs/classic_ control/cartpole.py) as we will be using them in this homework.

  1. [15 points] Implement DQN algorithm in py. Specifically,
    • Fill in the blank(s) in function select_action to implement ϵ-greedy action selection method.
    • Fill in the blank(s) in function optimize to train the deep Q network.
    • Fill in the blank(s) in function main.
    • Fill in the blank(s) in class DQN.
  2. [10 points] Run the script to train the agent playing CartPole task and report the learning curve of the episode duration using the provided function plot_durations for the first 1000 episodes. While we provide a suggestion for an optional architecture DQN design (fc-relu-fc), you can change network architecture and hyper-parameters to make the agent perform better. It is expected to over achieve a reward greater than 300 for the 100 episodes-running average reward within 1000 episodes.

4    Policy Gradients

For this section, we will grade the best scoring question. If you do only one of the questions, we will grade only that question.

Recall the REINFORCE algorithm with objective

J(θ) = Eτp(τ;θ)[r(τ)]

Remember that the gradient of the objective function above can be approximated as follows:

where the outer sum runs over different agent episodes τ1,…,τN and )) represents a single episode.

  1. [10 points] Let. In this case, the above gradient estimator becomes

Show that the following is an unbiased estimate of the ∇θJ(θ) above:

That is, we can omit the rewards collected in the past while keeping the estimator unbiased. The new estimator has the advantage of having lower variance than the original estimator.

  1. [10 points] Show that adding a state dependent baseline b(s) does not introduce any bias in the estimator. i.e., Show that the following is an unbiased estimator of the gradient. Adding a baseline can further reduce the variance of the estimator.

!

[1] You can find the detailed algorithm in the Nature paper: Mnih et al., “Human-level control through deep reinforcement learning”, 2015.

  • 6_reinforcement_learning-tuj27x.zip