Description
The code in the ipynb le should do Problem 1 if you set hw7MC.algorithm = ‘value’. It should do Problem 2 if you set hw7MC.algorithm = ‘policy’
Problem 1
Complete the coding of the provided ipynb le which prices the Bermudan put option under GBM, with the same parameters as in the Excel worksheet from class (which has been posted on Canvas), using the Longsta-Schwartz method.
Report an estimated price, based on 10000 paths.
At each exercise date, do the regression using only the paths that are in-the-money (at that specic date so there may be dierent subsamples on dierent dates), not all of the paths.
Problem 2
The Longsta-Schwartz method can be regarded as an example of a Reinforcement Learning (RL) algorithm. It selects actions (exercise vs. continue) to try to maximize an expected reward (option payo) that depends on the transitions of a state variable (the underlying X).
In particular, Longsta-Schwartz takes a Value-function approach to solving the dynamic pro- gramming formulation of the Reinforcement Learning problem. It nds an estimate fˆ (same
notation as L7) of the value function for the continuation action, by using OLS regression, of simu- lated continuation payos on the state variable. This estimated continuation value fˆ is compared
against the value function for the exercise action, which is just the payo function (for example Payoff(X)=K−X inthecaseofaput):
If fˆ (X ) > Payoff(X ) then continue to hold at time t ntntn n
If fˆ (X ) ≤ Payoff(X ) then exercise at time t ntn tn n
Here we will consider a dierent approach to RL.
In contrast to Value-function RL, another approach to Reinforcement Learning is the Policy-
based approach. Rather than trying to estimate the value function (for the continuation action), it tries to more directly optimize the time-tn policy function, let’s denote it Φ, which maps each X to one of two outputs: {0,1}, where 0 denotes continuing to hold, while 1 denotes stopping (exercising).
If Φ(Xtn ) = 0 then continue to hold at time tn If Φ(Xtn ) = 1 then exercise at time tn
1
n
n
In the particular one-dimensional example of put pricing that we have been studying, we know what form the stopping policy function should take. In theory it should be an indicator function
Φcn (X) = 1X≤cn
with a parameter cn is a specic critical or threshold level of the stock price X. Below cn you should exercise, and above cn you should continue to hold the put. So, in principle, we could try to estimate the optimal threshold cn by choosing it to maximize the average, across all simulated paths, of the simulated payout resulting from the policy Φcn at time tn.
However, this optimization has some numerical diculties, due to the discontinuity of this hard stopping decision function Φ which only has two outputs {0,1}. So suppose that we optimize a smoother function, a soft stopping decision function φ which produces outputs in the interval between 0 and 1. Let φ have two parameters a,b (which may depend on the time slice n) and specically let φ be1 a sigmoid or logistic function of b(X − a):
φa,b(X) = 1 . (*) 1 + exp(−b(X − a))
For large negative b, the φa,b will behave similarly to Φa, in that it’s near 1 for X < a and near 0 for X > a. But unlike the hard stopping decision function, the soft decision function φ is more optimizer-friendly, because it varies continuously between 0 and 1. It can be interpreted as making the exercise decision randomly, with probability φa,b(X) of exercising, and probability 1 − φa,b(X) of continuing to hold, conditional on X. At time tn the optimizer should optimize
1Mm m m max φa,b(Xtn)×(K−Xtn)+(1−φa,b(Xtn))×(Continuation payout on the mth path)
a,b M m=1
where Xm denotes the mth simulated path. Then calculate payouts by converting this optimized
soft stopping decision into a hard stopping decision by
Φ(Xtn ) = 1φaˆ,ˆb(Xtn )≥0.5 × 1Payoff(Xtn )>0
where aˆ and ˆb denote the optimized parameter values. Multiplying by 1Payoff (Xtn )>0 makes sure that you are not exercising OTM options. It should not be needed if your φ has been trained correctly, but we include it as a precaution.
Implement this policy optimization approach, by completing the code in the ipynb le. Most of the coding is already provided.
1On this problem, which is simple in the sense that the exercise region in X-space is just a one-dimensional interval, a single sigmoid function (*) is sucient to approximate the optimal stopping policy.
On harder problems, where the exercise region may be a complicated subset of a multidimensional X-space, the function (*) can be upgraded to a deep neural network.
For instance see http://jmlr.org/papers/volume20/18-232/18-232.pdf
2