IE613 Assignment1 Solved

30.00 $

Category:

Description

5/5 - (1 vote)

Question 1 Consider the problem of prediction with expert advice with d = 10. Assume that the losses assigned to each expert are generated according to independent Bernoulli distributions. The adversary/environment generates loss for experts 1 to 8 according to Ber(0.5) in each round. For the 9th expert, loss is generated according to Ber(0.5−∆) in each round. The losses for the 10th expert are generated according to different Bernoulli random variable in each round– for the first T/2 rounds, they are generated according to Ber(0.5+∆) and the remaining T/2 rounds they are generated according to Bernoulli random variable Ber(0.5 − 2∆). ∆ = 0.1 and T = 105. Generate (pseudo) regret values for different learning rates (η) for Weighted Majority algorithm. The averages should be taken over at least 20 sample paths (more is better). Display 95% confidence intervals for each plot. Vary c in the interval in steps of size 0.2 to get different learning rates. Implement Weighted Majority algorithm with

η             2log( )       .

Question 2Consider the problem of multi-armed bandit with K = 10 arms. Assume that the losses are generated as in Question 1. For each of the following algorithms generate (pseudo) regret for different learning rates (η) for each of the following algorithms. The averages should be taken over atleast 50 sample paths (more is better). Display 95% confidence intervals for each plot. Vary c in the interval [0.1 2.1] in steps of size 0.2 to get different learning rates.

  • Set η = cp2log(K)/KT.
  • P. Set η = cp2log(K)/KT, β = η, γ = Kη.
  • EXP-IX. Set η = cp2log(K)/KT, γ = η/2.

Question 3  In Question 2, which one of EXP3, EXP3.P and EXP3-IX performs better and why?

Question 4 Show that for any deterministic policy π there exists an environment ν such that RT(π,ν) ≥ T(1 − 1/K) for T rounds and K arms.

Question 5 Suppose we had defined the regret by

where It is the arm chosen by the policy π and xtIt is the reward observed in the round t. At first sight this definition seems like the right thing because it measures what you actually care about. Unfortunately,

1-1

1-2                                                                                                                                                           Homework 1: February 7

however, it gives the adversary too much power. Show that for any policy π (randomized or not) there exists a ν ∈ [0,1]K×T such that

Question 6  Let p ∈Pk be a probability vector and suppose Xˆ : [k]×R → R is a function such that for all x ∈ Rk, if A ∼ p,

k

E[Xˆ(A,xA)] = XpiXˆ(i,xi) = x1.

i=1

Show there exists an a ∈ Rk such that such that  and.

Question 7  Suppose we have a two-armed stochastic Bernoulli bandit with µ1 = 0.5 and µ2 = 0.55. Test your implementation of EXP3 from the Question 2. What happens when T = 105 and the sequence of rewards is xt1 = I{t ≤ T/4} and xt2 = I{t > T/4}?

Submission Format and Evaluation: Your should submit a report along with your code. Please zip all your files and upload via Moodle. The zipped folder should named as YourRegistrationNo.zip e.g.‘154290002.zip’. The report should contain two figures: first figure should have one plot corresponding to algorithm in Q.1 and the other should have 3 plots one corresponding to each algorithm in Q.2. For each figure, write a brief summary of your observations. We may also call you to a face-to-face session to explain your code.

Note: Please calculate (pseudo) regret for each algorithm in Q.2 for a given set of parameters as follows:

Let µit denote the mean of arm i in round t. Suppose an adversary generates sequence of loss vectors and an algorithm generates sequence of pulls, the (pseudo) regret for this sample path is

)]                                                                        (1.1)

T                                      T

= XµItt − minXµit                                                                                                                     (1.2)

i

t=1                                  t=1

Note that in this calculation we only considered the mean values of losses, not the actual losses suffered. It is Okay if this value turns out to be negative. There is no expectation over random choices of Its here. Now generate 20 such sample paths and take their average.

  • Assignment-1-qicl6i.zip