This homework is designed to follow up on the lecture about exploration algorithms, specifically about the multi-armed bandit. For this assignment, you will need to know about the basics of the bandit algorithm we talked about in class, and some basics of the epsilon greedy, upper confidence bound, and Thompson sampling algorithms. If you have not already, we propose you brush up on the lecture notes and read up about these algorithms online.
Find the folder with the provided code in the following google colab: https://colab.research.google.com/drive/19ht5cd7CoEkotj3bWnBaGKHaSdWhObH4?usp=sharing
Make a copy of the colab, edit it, and once you are done, submit it with your homework writeup.
- Make a copy of the colab to your drive, and then go through the skeleton code in it. At the very end, you will find a function that is supposed to run the bandit algorithms and plot their cumulative regret over time. Complete this function, and verify that it works by testing it with the two given environments and the FullyRandom solver.
Note: For a full score on this problem, the following must be true: each solver must be denoted by a different color, and each environment (Bernoulli bandit and Gaussian bandit) must be shown on a different plot. Make sure to label each of the two plots and each line in each plot with the associated algorithm as well. For formatting guidance, look at the given plot
- Once you have finished it, implement EpsilonGreedy, UCB, and Thompson Sampling solvers. Make sure when you run the colab notebook, it generates the two associated plots: one for the Bernoulli bandit with all the algorithms, and another for the Gaussian bandit with all the algorithms.
Submit a link to this completed colab to the submission link above. Before submission, once again make sure that the following are in order:
a) plot titles,
b) axis labels,
c) line legends.
Also, make sure the sharing settings are turned on so we can check your solution and run it.