Machine Learning 1 Week 11-Decision Trees+Random Forests Solved

35.00 $

Category: Tags: , , , , , ,
Click Category Button to View Your Next Assignment | Homework

You'll get a download link with a: zip solution files instantly, after Payment

Securely Powered by: Secure Checkout

Description

Rate this product

Decision Trees, Random Forests, Boosting

The goal of this homework is to extend decision trees, using (1) random forests or (2) boosting. For this, we will make use of an existing decision tree implementation(availablein scikit-learn),thatwecanthenreuseforimplementingthetwomodelsofinterest.Asafirststep,wedownloadasimple two-dimensional classification dataset: the Iris data. The following code loads the data and retains only the first two input dimensions so that the problem can be easily visualized.

In [1]:

import sklearn,sklearn.datasets iris = sklearn.datasets.load_iris() X,T = iris.data[:,:2],iris.target

Thefunction plot_iris fromthemodules utils.py takesasinputaclassificationfunctionmappingadatamatrixcontainingthetwoinputfeatures foreachdatapointavectorrepresentingtheclassificationofeachdatapoint.Then,the plot_iris functionplotsthedecisionfunctioninsuperposition to the Iris dataset. In the example below, the prediction function assigns to each data point the output 0 (corresponding to the first class, shown in red).

In [2]:

%matplotlib inline
import numpy,utils
utils.plot_iris(X,T,lambda X: numpy.dot(X,[0,0]))

Decision Trees

Wenowconsiderthedecisiontreeclassifierreadilyavailablein scikit-learn.Weusethedefaultparametersoftheclassifierandonlyspecifyitsthe maximum tree depth.

In [3]:

import sklearn.tree

classifier = sklearn.tree.DecisionTreeClassifier(max_depth=5) Inordertotestthepredictionaccuracyoftheclassifier,oneneedstosplitthedatasetintoatrainingandtestset.Thefunction utils.split achieves

this by assigning a random 50% of the data for training and the remaining 50% for testing.

In [4]:
(Xtrain,Ttrain),(Xtest,Ttest) = utils.split(X,T)

Once the splitting is done, the training data can be used to fit the classifier. The learned prediction function and the test data are then sent to the Iris plotting function to visualize the classifier.

In [5]:

classifier.fit(Xtrain,Ttrain)
utils.plot_iris(Xtest,Ttest,classifier.predict)

Here, the classifier does a reasonable job at classifying the data, although the decision boundaries are a bit too rectangular, and somewhat unnatural.

Random Forest Classifier (30 P)

We would like to now compare the decision boundary of the decision tree with the one obtained with a random forest classifier. We consider a random forestcomposedof100trees.Eachtreeistrainedon50%subsetofthetrainingset.(Hint:Thefunction utils.split canbecalledwithseedsfrom0 to 100 in order to build these random subsets.) The prediction function should implement a majority voting between each tree in the forest. Voting ties do not need to be handled in a particular way.

Implement the fit and predict functions of the random forest classifier below In [6]:

class RandomForestClassifier:

def __init__(self):
self.trees = [sklearn.tree.DecisionTreeClassifier(max_depth=5)

for _ in range(100)]

def fit(self,X,y):
# ——————————- # TODO: replace by your solution # ——————————- import solutions solutions.rfcfit(self,X,y)
# ——————————-

def predict(self,X):
# ——————————-
# TODO: replace by your solution
# ——————————- import solutions
return solutions.rfcpredict(self,X) # ——————————-

The code below runs the random forest classifier on the same dataset as before.

In [7]:
cl = RandomForestClassifier()
(Xtrain,Ttrain),(Xtest,Ttest) = utils.split(X,T)
cl.fit(Xtrain,Ttrain)
utils.plot_iris(Xtest,Ttest,cl.predict)

Unlike the decision boundary obtained by a single decision tree, the random forest tends to produce more curved and natural-looking decision functions.

Quantitative Experiments

We now focus on understanding more quantitatively the effect on the model accuracy of choosing different models and their parameters. For this, we switchtotheregressioncase,andconsidertwodifferentdatasetsalsoavailablein scikit-learn,thebostondataset,andthediabetesdataset.

In [8]:

boston   = sklearn.datasets.load_boston()
diabetes = sklearn.datasets.load_diabetes()

Thefile utils.py providesamethod benchmark,thatteststheperformanceofamodelon100differenttrain/testsplits,andreturnstheaverage training and test performance scores. For regression task, the performance score is given by the R2 coefficient of determination (see here https://en.wikipedia.org/wiki/Coefficient_of_determination (https://en.wikipedia.org/wiki/Coefficient_of_determination)). A score of “1” is optimal. A score of “0” is essentially random guessing.

In [9]:

regressor = sklearn.tree.DecisionTreeRegressor(max_depth=5) strain,stest = utils.benchmark(regressor,boston) print(‘training: %.3f | test score: %.3f’%(strain,stest))

training: 0.938 | test score: 0.731

In the example above, the test data is predicted fairly well with a coefficient determination above 0.7. Furthermore, we can investigate the effect of depth on the decision tree:

In [10]:

for d in range(1,10):
regressor = sklearn.tree.DecisionTreeRegressor(max_depth=d) strain,stest = utils.benchmark(regressor,boston) print(‘depth: %d | training score: %.3f | test score: %.3f’%

                                                 (d,strain,stest))
depth: 1 | training score: 0.479 | test score: 0.382
depth: 2 | training score: 0.717 | test score: 0.630
depth: 3 | training score: 0.835 | test score: 0.684
depth: 4 | training score: 0.904 | test score: 0.719
depth: 5 | training score: 0.938 | test score: 0.723
depth: 6 | training score: 0.962 | test score: 0.722
depth: 7 | training score: 0.976 | test score: 0.720
depth: 8 | training score: 0.986 | test score: 0.715
depth: 9 | training score: 0.992 | test score: 0.709

Although the training error keeps increasing, the test error saturates once a depth of 5 has been reached. The same experiment can be performed on the diabetes dataset:

In [11]:

for d in range(1,10):
regressor = sklearn.tree.DecisionTreeRegressor(max_depth=d) strain,stest = utils.benchmark(regressor,diabetes) print(‘depth: %d | training score: %.3f | test score: %.3f’%

                                                  (d,strain,stest))
depth: 1 | training score: 0.319 | test score: 0.220
depth: 2 | training score: 0.462 | test score: 0.334
depth: 3 | training score: 0.557 | test score: 0.315
depth: 4 | training score: 0.649 | test score: 0.253
depth: 5 | training score: 0.739 | test score: 0.184
depth: 6 | training score: 0.820 | test score: 0.107
depth: 7 | training score: 0.884 | test score: 0.046
depth: 8 | training score: 0.930 | test score: -0.004
depth: 9 | training score: 0.960 | test score: -0.040

Here, the best depth is just 2, and the model quality seriously degrades as we continue growing the tree. This is the result of overfitting, i.e. as we make the model closer to the data (bias reduction), we are also become highly sensitive to noise in the data and in the sampling process (variance increase).

Implementing a Random Forest Regressor (30 P)

One way of reducing variance is to average a large number of models. This is the idea of random forests. Here, we consider a random forest regressor. Like for the random forest classifier, each tree is grown on a random subset of the training set containing only half of the examples. As in the first exercise, thefunction utils.split canbeusedtogeneratethesesubsets.Becausewearenowimplementingaregressionmodel,wereplacethemajority voting by a simple averaging of the prediction of the different trees. The implementation below inherits some useful methods from the class

sklearn.base.RegressorMixin inparticularthefunction score measuringthecoefficientofdetermination,whichthereforedoesnotneedtobe reimplemented.

Implement the fit and predict functions of the random forest regressor below.

In [12]:
class RandomForestRegressor(sklearn.base.RegressorMixin):

def __init__(self,max_depth=None,nb_trees=10):
self.trees = [sklearn.tree.DecisionTreeRegressor(max_depth=max_depth)

for _ in range(nb_trees)]

def fit(self,X,y):
# ——————————- # TODO: replace by your solution # ——————————- import solutions solutions.rfrfit(self,X,y)
# ——————————-

def predict(self,X):
# ——————————-
# TODO: replace by your solution
# ——————————- import solutions
return solutions.rfrpredict(self,X) # ——————————-

To check whether the random forest brings an improvement over the simple decision tree algorithm, we select the best decision tree obtained so far ( d=7 ), and compare its accuracy to our random forest regressor. Here, because of the averaging effect of the random forest, we can afford higher depths,forexample, d=9.Thecodebelowtesttheperformanceofrandomforestsofincreasinglymanytrees.

In [13]:

# Benchmark for baseline decision tree model

regressor = sklearn.tree.DecisionTreeRegressor(max_depth=7) strain,stest = utils.benchmark(regressor,boston)
print(“decision tree (optimal depth): | train: %.3f | test: %.3f”%

      (strain,stest))
# Benchmark for the random forest model with a growing number of trees

for nb_trees in [1,2,4,8,16,32]:
regressor = RandomForestRegressor(max_depth=9,nb_trees=nb_trees) strain,stest = utils.benchmark(regressor,boston)
print(“random forest with %2d tree(s): | train: %.3f | test: %.3f”%

          (nb_trees,strain,stest))
decision tree (optimal depth): | train: 0.976 | test: 0.718
random forest with  1 tree(s): | train: 0.808 | test: 0.635
random forest with  2 tree(s): | train: 0.885 | test: 0.738
random forest with  4 tree(s): | train: 0.919 | test: 0.790
random forest with  8 tree(s): | train: 0.937 | test: 0.814
random forest with 16 tree(s): | train: 0.947 | test: 0.827
random forest with 32 tree(s): | train: 0.951 | test: 0.830

As it can be observed from the results above, the test scores of a random forest are much better. Due to their high performance, random forests are often used in practical applications.

Implementing a Simple Boosted Tree Regressor (40 P)

Another extension to the simple decision tree regressor, is the boosted tree regressor. Here, instead of averaging a large number of trees grown from randomly sampled data, the extra trees serve to predict what the previous trees failed to predict, i.e. the residual error. Technically, the variant of the boosted tree regressor we consider here is defined as follows:

Let $F_k(x) = f_1(x) + f_2(x) + \dots + f_k(x)$ be the prediction of a boosted regressor with $k$ trees, and some ground truth function $y(x)$, the next boosted regressor adds an additional decision tree $f_{k+1}(x)$ trained on the residual function $r(x) = y(x) – F_k(x)$, and the resulting boosted classifier becomes $F_{k+1}(x) = f_1(x) + f_2(x) + \dots + f_k(x) + f_{k+1}(x)$.

Implement the methods fit and predict of the simple boosted regression tree below.

In [14]:
class SimpleBoostedTreeRegressor(sklearn.base.RegressorMixin):

def __init__(self,max_depth=None,nb_trees=10):
self.trees = [sklearn.tree.DecisionTreeRegressor(max_depth=max_depth)

for _ in range(nb_trees)]

def fit(self,X,y):
# ——————————- # TODO: replace by your solution # ——————————- import solutions solutions.btrfit(self,X,y)
# ——————————-

def predict(self,X):
# ——————————-
# TODO: replace by your solution
# ——————————- import solutions
return solutions.btrpredict(self,X) # ——————————-

The code below compares the boosted tree regressor to the simple decision tree on the diabetes dataset. Here, we use for the decision tree a depth 2, that yields maximum accuracy on this dataset. As boosting allows to grows complex decisions from weak regressors, we set maximum tree depth to 1.

In [15]:

# Benchmark for baseline decision tree model

regressor = sklearn.tree.DecisionTreeRegressor(max_depth=2) strain,stest = utils.benchmark(regressor,diabetes)
print(“decision tree (optimal depth): | train: %.3f | test: %.3f”%

      (strain,stest))
# Benchmark for the boosted tree regressor model with a growing number of trees

for nb_trees in [1,2,4,8,16,32,64]:
regressor = SimpleBoostedTreeRegressor(max_depth=1,nb_trees=nb_trees) strain,stest = utils.benchmark(regressor,diabetes)
print(“boosting with %2d trees(s): | train: %.3f | test: %.3f”%

          (nb_trees,strain,stest))
decision tree (optimal depth): | train: 0.462 | test: 0.334
boosting with  1 trees(s):
boosting with  2 trees(s):
boosting with  4 trees(s):
boosting with  8 trees(s):
boosting with 16 trees(s):
boosting with 32 trees(s):
boosting with 64 trees(s):
| train: 0.319 | test: 0.220
| train: 0.427 | test: 0.317
| train: 0.487 | test: 0.342
| train: 0.558 | test: 0.358
| train: 0.628 | test: 0.360
| train: 0.699 | test: 0.350
| train: 0.766 | test: 0.324

Like for the random forests, the boosted tree regressor also brings an improvement compared to the simple decision tree. Note that adding too many trees may still cause overfitting (here, a good number of trees is 16). If we would like to include more trees, an even weaker base model should be used if available.

Dependency of regression performance on model complexity

Finally, we can study how the performance of each model depends on the tree depth. In this last experiment, the number of trees in the random forest and boosted model are kept fixed, and the tree depth is varied. Experiments are performed for all datasets and algorithms and results are shown as plots.

In [ ]:

= [1,2,3,4,5,6,7,8]
= [boston,diabetes]
= ['boston','diabetes']
depths
datasets
names
algorithms = [sklearn.tree.DecisionTreeRegressor,
              RandomForestRegressor,
              SimpleBoostedTreeRegressor]

from matplotlib import pyplot as plt

for dataset,name in zip(datasets,names): plt.figure()

plt.title(name)
for algorithm in algorithms:

acc = [utils.benchmark(algorithm(max_depth=i),dataset)[1] for i in depths]

        plt.plot(depths,acc,'o-',label=algorithm.__name__)

plt.grid(True)
plt.xlabel(‘tree depth’) plt.ylabel(‘coefficient of determination’) plt.legend(loc=’lower right’)
plt.show()

It can be observed that the random forest method tends to prefer deep trees. Indeed, the variance increase caused by deeper trees is countered by the averaging mechanism. Conversely, the boosting algorithm prefers small trees as it is able to build complex models even from simple weak regressors.

  • Week11_Decision-TreesRandom-Forests-esgxzf.zip