CSE571 – Assignment5 Solved

44.99 $

Description

5/5 - (1 vote)

Exercise 1.1 (8pt)
Sometimes MDPs are formulated with a reward function R(s, a) that depends only on the current state and action taken or with a reward function R(s) that only depends on the current state.
a. (2pt) Write the Bellman equations for these formulations for the optimal value function.
b. (3pt) Show how an MDP with reward function R(s, a, s′) can be transformed into a different MDP with reward function R(s, a), such that optimal policies in the new MDP correspond exactly to optimal policies in the original MDP. You must formally define the new MDP (its components) based on the old MDP.
c. (3pt) Now do the same to convert MDPs with R(s, a, s’) into MDPs with R(s). You must formally define the new MDP (its components) based on the old MDP.
Exercise 1.2 (10pt)
Consider the 3 × 3 world shown below. The transition model is the same as in our robot domain: 80% of the time the agent goes in the direction it selects; the rest of the time it moves at right angles to the intended direction.

Use discounted rewards with a discount factor of 0.99. Show the policy obtained in each case. Explain intuitively why the value of r leads to each policy.
a. r = 100
b. r = −3
c. r = 0
d. r = +3

Exercise 1.3 (8pt)
Consider the 101 × 3 world shown below. In the start state the agent has a choice of two deterministic actions, Up or Down, but in the other states the agent has one deterministic action, Right. Assuming a discounted reward function, for what values of the discount γ should the agent choose Up and for which Down? Compute the utility of each action as a function of γ. (Note that this simple example actually reflects many real-world situations in which one must weigh the value of an immediate action versus the potential continual long-term consequences, such as choosing to dump pollutants into a lake.)

Exercise 1.4 (12pt)
Apply policy iteration, showing each step in full, to determine the optimal policy when the initial policy is 𝞹(cool) = Slow and 𝞹(warm) = Fast. Show both the policy evaluation and policy improvement steps clearly until convergence. Assuming a discount factor of 0.5.

Exercise 1.5 (12pt)

Consider the car domain above (without knowing the T or R) and given the following experiences:

Episode 1:
cool, slow, cool, +1 cool, slow, cool, +1 cool, fast, cool, +2 cool, fast, cool, +2 cool, fast, warm, +2
warm, fast, overheated, -10

Episode 2: warm, slow, warm, +1 warm, slow, cool, +1 cool, slow, cool, +1 cool, fast, warm, +2 warm, slow, warm, +1 warm, slow, warm, +1 warm, fast, overheated, -10

a. (2pt) Estimating the parameters for T and R for model-based reinforcement learning.
b. (2pt) Use MC reinforcement learning method (direct evaluation) to estimate the V function, assuming g = 1.0.
c. (4pt) Assuming that the initial state values are all zeros, compute the updates in TD learning for policy evaluation (passive RL) to the state values after running through episode 1 and 2 in sequence. Show steps for a = 0.5 and g = 1.0.
d. (4pt) Assuming that the initial Q values are all zeros, compute the updates in Q learning (active RL) to the Q values after running through episode 1 and 2 in sequence. Show steps for a = 0.5 and g = 1.0.

  • assn5-b2iytu.zip