Chapter 6 - Decision Trees

Material

I have appended a zip file of the files we will discuss at the seminar. It includes code examples (to visualize decision trees) as well as a pdf discussing topics and summarizing the chapter. The zip-file can be found here. Download The zip-file can be found here.

Summary

Decision trees are a class of machine learning algorithms that are especially powerful in predicting, especially in connection to random trees (see Chapter 7 - Ensemble Learning and Random Forests).

Decision trees are Binary unbalanced trees where each decision is characterized by a feature LaTeX: kk and a threshold LaTeX: t_ktk.

Impurity measurements

The cost function of a decision tree can be described by

LaTeX: J(k,t_k) = \frac{m_{left}}{m}I_{left} + \frac{m_{right}}{m}I_{right}J(k,tk)=mleftmIleft+mrightmIright

where LaTeX: I_{node}Inodeis the impurity metric for the left and the right path in the tree, LaTeX: m_{node} mnodeis the number of samples on either the left or the right side of the subtree, and is the total number of samples in the subtree.

For classification: LaTeX: I_{node}Inodecan either be the Gini impurity metric, LaTeX: G_iGi, or the Shannon Entropy, LaTeX: H_iHi.
For regression: LaTeX: I_{node}Inode is the Mean Squared Error, LaTeX: \mathrm{MSE}_iMSEi.

Gini Impurity

The Gini impurity is defined as

LaTeX: G_i = 1 - \sum^{n}_{k=1} p_{i,k}^2Gi=1nk=1p2i,k

where LaTeX: p_{i,k}pi,kis the ratio of class k samples among all the training samples in the i:th node.

Shannon Entropy

The Shannon entropy is defined as

LaTeX: H_i = - \sum^{n}_{k=1,\, p_{i,k} \neq 0} p_{i,k} \log_2 \left(p_{i,k}\right)Hi=nk=1,pi,k0pi,klog2(pi,k).

Mean Squared Error (MSE)

The mean squared error is given by

LaTeX: \mathrm{MSE}_{node} = 
\begin{cases}
&\mathrm{MSE}_{node} = \sum_{i \in node} \left( \hat{y}_{node} - y^{(i)} \right)^2\\
                &\hat{y}_{node} = \frac{1}{m_{node}} \sum_{i \in node} y^{(i)}
\end{cases}MSEnode={MSEnode=inode(ˆynodey(i))2ˆynode=1mnodeinodey(i)

 

Regularization

  • max_depth: Forces an upper size of the decision tree.
  • min_samples_split: Minimum number of samples a node has to have to be able to split.
  • min_samples_leaf: Minimum number of samples necessary to be a leaf node.
  • min_weight_fraction_leaf: Same as min_samples_leaf but expressed as a fraction of the total number of weighted samples
  • max_leaf_nodes: Maximum number of leaf nodes allowed.
  • max_features: Maximum number of features evaluated for splitting a node.

Session agenda

  • Summary of the chapter. Slides will be provided shortly.
  • Discussion regarding when Decision trees could be a powerful tool.
  • Discussion regarding the problem of the Decision tree being sensitive to data perturbations
  • Exercise discussions (Focus: 2, 3, 4. Extra: 7+8)

Extra

Try to reflect over the heuristic method that we use to optimize the Decision Tree problem. Are there similar methods that are proven optimal?