The objective of this assignment is for you to experiment with supervised learning using random forests. We will focus on the classification task. The following explains what you need to implement.
You need to first implement a decision tree. Your implementation will include the induction part (for building the tree from the training data) and the inference part (for classifying new samples). We will use only datasets with real-valued attributes. Use CART as the base tree, which is a binary tree. No pruning is required.
When attempting to split a node, you need to select an attribute as well as a threshold. The process:
- To select a threshold for a given attribute:
- Let the values of the given attribute be v1, v2, …, vn, where n is the number of samples associated with the node.
- For threshold selection, the easiest way is to have the values sorted such that v1 < v2 < … < vn. Now you only need to test the following threshold values (v1+v2)/2, (v2+v3)/2, …, (vn-1+vn)/2.
- For each possible threshold, divide the n samples into two groups according to the threshold.
- For each group, compute its Gini’s impurity (here the summation is over the output classes):
- For each threshold, compute the total remaining impurity as nAGA+nBGB. Here A and B indicate the two group of samples. Select the one threshold with the lowest total impurity. This is the threshold for that attribute.
- Repeat the above for all the attributes (only those considered when splitting the current node; see attribute bagging), and select the one with the lowest total remaining impurity for splitting the node.
The next task is to build a forest. For this, you need to implement tree bagging (bagging of samples) and attribute bagging.
You need to do some experiments with your trees/forest. Three datasets are provided (as text files) for you to test with. In addition, you can check out the UCI Machine Learning Repository for other datasets. (There are numerous datasets there.)
You need to divide a dataset into a training subset (for tree induction) and a validation subset (for evaluation). Be sure to do the division randomly. You can also try cross-validation, but this is not required. Report the correct classification rates for both the training and validation subsets.
Go to the UCI Machine Learning Repository to get datasets for your experiments. To start, choose from the following datasets that have been used a lot in the literature: Iris, Breast Cancer, Glass, Ionosphere, Optical Recognition of Handwritten Digits, and Wine. You are not required to use all of them. You can go through the list of hundreds of datasets to find others to play with. In addition, I will also provide two two-attribute two-class datasets, ellipse.txt and cross.txt. These two are useful for visualizing the decision boundaries.
Regarding the experiments, below are a few things you can try. You are not required to try them all.
- Relative sizes of the training and validation subsets.
- Number of trees in the forest.
- Parameters used during tree induction, such as how many attributes to consider at each node splitting.
- Methods that limit a tree’s size. Examples include the minimum number of samples per node, or an upper bound on the tree’s depth.
- Comparisons of out-of-bag errors and validation-set errors.
- Extremely random forest: At each node splitting, just randomly select an attribute.