Chapter 6 - Decision Trees

Material

I have appended a zip file of the files we will discuss at the seminar. It includes code examples (to visualize decision trees) as well as a pdf discussing topics and summarizing the chapter. The zip-file can be found here. Download The zip-file can be found here.

Summary

Decision trees are a class of machine learning algorithms that are especially powerful in predicting, especially in connection to random trees (see Chapter 7 - Ensemble Learning and Random Forests).

Decision trees are Binary unbalanced trees where each decision is characterized by a feature $k$ and a threshold LaTeX: t_k $t_k$ .

Impurity measurements

The cost function of a decision tree can be described by

$LaTeX: J(k,t_k) = \frac{m_{left}}{m}I_{left} + \frac{m_{right}}{m}I_{right}$ $J(k,t_k) = \frac{m_{left}}{m}I_{left} + \frac{m_{right}}{m}I_{right}$

where $LaTeX: I_{node}$ $I_{node}$ is the impurity metric for the left and the right path in the tree, $LaTeX: m_{node}$ $m_{node}$ is the number of samples on either the left or the right side of the subtree, and m is the total number of samples in the subtree.

For classification: $LaTeX: I_{node}$ $I_{node}$ can either be the Gini impurity metric, LaTeX: G_i $G_i$ , or the Shannon Entropy, LaTeX: H_i $H_i$ .
For regression: $LaTeX: I_{node}$ $I_{node}$ is the Mean Squared Error, $LaTeX: \mathrm{MSE}_i$ $\mathrm{MSE}_i$ .

Gini Impurity

The Gini impurity is defined as

$LaTeX: G_i = 1 - \sum^{n}_{k=1} p_{i,k}^2$ $G_i = 1 - \sum^{n}_{k=1} p_{i,k}^2$

where $LaTeX: p_{i,k}$ $p_{i,k}$ is the ratio of class k samples among all the training samples in the i:th node.

Shannon Entropy

The Shannon entropy is defined as

$LaTeX: H_i = - \sum^{n}_{k=1,\, p_{i,k} \neq 0} p_{i,k} \log_2 \left(p_{i,k}\right)$ $H_i = - \sum^{n}_{k=1,\, p_{i,k} \neq 0} p_{i,k} \log_2 \left(p_{i,k}\right)$ .

Mean Squared Error (MSE)

The mean squared error is given by

$LaTeX: \mathrm{MSE}_{node} = \begin{cases} &\mathrm{MSE}_{node} = \sum_{i \in node} \left( \hat{y}_{node} - y^{(i)} \right)^2\\ &\hat{y}_{node} = \frac{1}{m_{node}} \sum_{i \in node} y^{(i)} \end{cases}$ $\mathrm{MSE}_{node} = \begin{cases} &\mathrm{MSE}_{node} = \sum_{i \in node} \left( \hat{y}_{node} - y^{(i)} \right)^2\\ &\hat{y}_{node} = \frac{1}{m_{node}} \sum_{i \in node} y^{(i)} \end{cases}$

Regularization

max_depth: Forces an upper size of the decision tree.
min_samples_split: Minimum number of samples a node has to have to be able to split.
min_samples_leaf: Minimum number of samples necessary to be a leaf node.
min_weight_fraction_leaf: Same as min_samples_leaf but expressed as a fraction of the total number of weighted samples
max_leaf_nodes: Maximum number of leaf nodes allowed.
max_features: Maximum number of features evaluated for splitting a node.

Session agenda

Summary of the chapter. Slides will be provided shortly.
Discussion regarding when Decision trees could be a powerful tool.
Discussion regarding the problem of the Decision tree being sensitive to data perturbations
Exercise discussions (Focus: 2, 3, 4. Extra: 7+8)

Extra

Try to reflect over the heuristic method that we use to optimize the Decision Tree problem. Are there similar methods that are proven optimal?