Chapter 7 - Ensemble Learning and Random Forests

Many weak learners can be combined into one strong learner. Do many DT’s, KNN’s, logistic’s etc and combine results and the theory of large numbers will make accuracy high (regularization, increases stability). Prerequisites: Their errors should be correlated to as small extent as possible.

Hard-voting vs soft: Hard=classification soft=predict class probabilities.

Row sampling (training instances) Bagging: sampling with replacement. Allows resampling. Pasting: sampling without replacement: Does not allow resampling (obviously). Bagging has more bias than pasting (since resampling). Usually 63% of data is sampled in bagging: 1-exp(-x) function used and it approaches 63%.

63percent

oob: out of bag evaluation: Use this for each predictor instead of separate validation set (set to True). Use to reduce variance/overfitting.

Column sampling (features) : Useful when many features. Random patches: sampling both training instances and features. Random subspaces: sampling only features (but using whole training data-set). Use to reduce variance/overfitting at cost of more bias.

Random forest: Very similar as BaggingClassifier + DecisionTreeClassifier but optimized: The built-in version searches for an ideal feature among a random subset of features (max_samples=1.0) -> reduces variance for a some increased bias. E.g. if there is only 1 feature that is highly correlated to the sought prediction feature then two trees will have the same problems if they both use that 1 feature. Weak learners are not weighted (e.g. according to how well they predict training/validation sets) in vanilla RandomForest implementations but there are some studies that show that this can lead to better performance or at least stability (https://link.springer.com/article/10.1007/s00357-019-09322-8 Links to an external site.). Extremely randomized trees: Same as random forest but splitting thresholds also randomized.

Feature importance: A great “bonus” with RF’s. How much impurity is reduced on average. Look at every node that uses the feature in the ensemble and compare to parent node.

Boosting: Focus on fixing something that doesn’t work well. Has similarities with gradient descent but instead of tweaking pre-set parameters it adds new ones (weak learners) sequentially. Each weak-learner gets a weight assigned based on how well it estimates its training/validation data and when it is added the weights for all weak learners get normalized. Weak-learner weight-update function: x=error (r), y= weight:

weakLearnerWeight

training data weight updates: simple exp (if high error make it more likely to get picked) .

Often trained with decision-stumps (decision trees with depth=1)

Opinions on importance and meaning of boosting learning rate:

“In optimization literature, the two approaches [use of different learning rates] are the same. As they both converge to optimal solution. On the other hand, in machine learning, they are not equal. Because in most cases we do not make the loss in training set to 0 which will cause over-fitting. We can think about the first approach as a "coarse level grid search", and second approach as a "fine level grid search". Second approach usually works better, but needs more computational power for more iterations.” https://stats.stackexchange.com/questions/168666/boosting-why-is-the-learning-rate-called-a-regularization-parameter

“Speaking from personal experience, the difference between using a learning rate of 1.0 and a smaller learning rate like 0.1 is that models with the larger learning rate rapidly reduce the loss, but also rapidly reach a plateau. It's very common in the neural network community to call the "step size" of a gradient descent algorithm the "learning rate". But this is not the same as the "learning rate" in gradient boosting. In gradient boosting, the "learning rate" is used to "dampen" the effect of each additional tree to the model.” https://stats.stackexchange.com/questions/354484/why-does-xgboost-have-a-learning-rate Links to an external site.

Gradient Boosting: Weak learners predict pseudo-residuals instead of training set and for the validation set the pseudo-residual predictions are aggregated to produce the final prediction . Weak learner updates use MSE gradient-descent. Shrinkage: Low learning rate: Dampens effect of new weak-learners (NOT SAME THING AS LEARNING RATE IN DEEP LEARNING). [vs AdaBoost]: Gradient boosting more flexible than adaBoost since the loss function is general gradient descent, whereas ada is more "particular" (more heuristic) however ada is probably easier to intuitiely understand. https://datascience.stackexchange.com/questions/39193/adaboost-vs-gradient-boosting

XGBoost: Optimizes above by 1. Penalization heuristics for bad trees. 2. Shrinking of leaf nodes 3. Using RPROP SGD (more clever learning rate use). 4. Extra randomization.

Stacking: Model blending.