CS7642 Homework​ ​#4Q-Learning Solution

In ​ this​   ​ homework​       ​ you​      ​ will​      ​ have​    ​  the​      ​ complete​          ​ RL​ ​ ​experience.      ​  ​You     ​will​       ​  ​work   ​  towards​            ​ ​implementing  and​ ​evaluating​ ​the​ ​Q-learning​ ​algorithm​ ​on​ ​a ​ simple​ ​​domain.​ Q-learning​       ​ ​is ​ ​a​ ​fundamental ​​RL algorithm​ ​and​ ​has​ ​been​ successfully​       ​ ​used​ ​to​ ​solve ​a​​ ​variety​ ​of​ ​decision-making​ ​problems.​ For​​ ​this homework, ​ you​    ​ will​      ​ have​ ​ to​         ​ ​think   ​ carefully​           ​  ​about  ​  ​algorithm         ​  implementation,​ ​ specially​           ​ ​exploration parameters.


The​ ​domain​ ​you​ ​will​ ​be​ ​tackling​ ​is​ ​called​ ​Taxi​ ​(Taxi-v2).​ ​It​ ​is​ ​a​ ​discrete​ ​MDP​ ​which​ ​has​ ​been​ ​used for​ ​RL​ ​research​ ​in​ ​the​ ​past.​ This​      ​ ​will​ also​​ ​be​ ​your ​first​​ opportunity​       ​ to​         ​ become​            ​ familiar​             ​  ​with    ​ the​      OpenAI ​ Gym​     ​ environment​   ​ (​ https://gym.openai.com/​       ).​  This​   ​ is​ ​ ​a​  ​cool       ​ and​     ​  unique​              ​ platform​           ​ ​where users ​ can​            ​ test​ ​ their​    ​ RL​        ​ ​algorithms       ​ over​    ​  ​a ​ selection​       ​ of​         ​ domains.​


The​ ​Taxi​ ​problem​ ​was​ ​introduced​ ​in​ ​Dietterich(2000).​ ​It​ ​is​ a​ ​ grid-based​   ​ domain​ ​ where​ ​ the​      ​ ​goal​ ​of the ​ agent​          ​ is​ ​ to​    ​ pick​     ​ up​        ​ a​ ​ passenger​          ​  ​at         ​ one​     ​ location​            ​ and​     ​             ​drop ​     ​them ​ ​off ​ ​in  ​another.​ ​  ​There  ​ ​are      ​​4 fixed ​ locations,​           ​ ​each​ ​assigned ​ a​ ​ different             ​​  ​            letter. ​ ​The ​ agent​           ​ ​            has  ​​6​  actions;  ​ ​ ​4​  for​         ​ ​movement,      ​  ​1​  ​for   pickup, ​ and​      ​ 1​ ​ for​   ​ dropoff.​​ ​The​ ​domain​ ​has​ a​​ ​discrete ​​state​ ​space​ ​and​ ​deterministic​ transitions.​



Implement​ ​a​ ​basic​ ​version​ ​of​ ​the​ ​Q-learning​ ​algorithm ​ and​          ​ use​      ​ it​​  to ​ ​             ​solve​  the​            ​ taxi​      ​ ​domain.            ​ ​The      agent​ ​should​ ​explore​ ​the​ ​MDP,​ ​collect​ ​data​ ​to​ ​the​ ​learn ​ the​    ​ ​optimal​ ​policy​ ​and​ ​the ​ optimal​ Q-value ​ function.​ ​ (​ Be ​ mindful​    ​ of​         ​  ​how     ​            you​ ​ handle​        ​ terminal​​  states,​              ​ typically​            ​ if​​  St​     ​ ​is​  ​a  ​terminal​ ​​state, V(St+1) ​ =​ ​ 0).​      ​ Use​      ​ 𝛾​ ​ =​ ​ 0.90.​         ​  Also,​   ​ you​      ​ will​      ​ ​see      ​ ​            how​  ​Epsilon-Greedy ​  strategy​            ​ can​      ​  ​find     ​  the​      ​  ​optimal             policy ​ despite​   ​ of​         ​ ​finding​ ​sub-optimal​ ​q-values.​ ​Because​ ​we ​ are​   ​ looking​ ​ for​       ​  optimal​             ​ ​q-values, ​​you will ​ have​ ​ to​         ​ try​       ​ different​             ​ ​exploration      ​ strategies.​


You​ ​can​ ​evaluate​ ​your ​ agent​      ​ offline​ ​ ​or        ​ by​        ​ uploading​         ​ your​ ​​experiment​ file​  ​             to​ ​​the​ ​OpenAI ​​server using ​ a​ ​ GitHub​     ​ account.​ ​ The​ ​latter​​ ​will​ ​generate​ ​a​ ​learning ​ curve​            ​ (​ reward/steps ​ vs​         ​ ​episodes)           which ​ is​ ​ indicative​       ​ of​          ​            ​performance.​ ​The​ ​OpenAI​ ​server ​​will​ ​indicate ​​if​ ​your​ ​implementation​ ​has solved ​ the​            ​ domain,​              ​ that​     ​ is,​        ​ found​ ​ an​         ​optimal​ ​policy.​ ​Note​​ that​​ ​all​ ​evaluations​ ​uploaded​ ​to ​​the OpenAI ​ server​         ​ are​      ​ publicly​             ​ accessible.​          ​ However,​         ​ ​            please​ ​do​ ​not ​​attach ​​your code​ ​as​ ​a​ ​​gist write-up to​              ​ these​  ​ evaluation​       .​




Below​ ​are​ ​the​ ​optimal​ ​Q​ ​values​ ​for​ ​5​ ​(state,​ ​action)​ ​pairs​ ​of​ ​the​ ​Taxi​ ​domain.

  • Q(462, ​ 4)​ ​ ​ =​ ​ -11.374402515​ Q(398, ​ 3)​ ​ ​ =​ ​ 4.348907​
  • Q(253, ​ 0)​ ​ ​ =​ ​ -0.5856821173​
  • Q(377, ​ 1)​ ​ ​ =​ ​ 683​
  • Q(83, ​ 5)​ ​ ​ ​ ​ =​ ​ -12.8232660372​


The​ ​concepts​ ​explored​ ​in​ ​this​ ​homework​ ​are​ ​covered​ ​by:

  • Lectures
    • Convergence

○ Exploring​ ​Exploration

  • Readings
  • Asmuth-Littman-Zinkov-2008.pdf

○                            ​ ​       ​​ (​ ​chapters​ ​1-2)


  • hw4-1.zip