Chapter 5 - Support Vector Machines

Session Agenda

Chapter summary
Take a quick look at Exercise 6 and 10
Discuss example applications

Bring to the meeting: an example application of an SVM. It can be something you think up yourself, or an existing application that you've found. What makes it a good/bad use of an SVM? What are the limitations?

Chapter Summary

Support vector machines can be used for for linear and nonlinear classification, regression, and outlier detection. They are well suited for complex but small to medium sized datasets and scale well with the number of features, especially sparse features (see p. 164 of digital pre-print textbook).

All images from https://github.com/ageron/handson-ml2 Links to an external site. (apache2).

Linear SVM Classification

Draws a straight line in the feature space that separates classes. Large margin classification aims to fit the largest possible "street" (margin) between the classes (right in figure below).

large_margin_classification_plot

The right image above illustrates a hard margin SVM classifier. Only new instances falling within the margin would change the class boundary. The instances that define the outer edges of the margin support the discernment between classes, i.e. the support vectors.

Sensitive to feature scales and will tend to reject relatively small features (e.g. LaTeX: x_0 $x_0$ below):

sensitivity_to_feature_scales_plot

Hard margins require all instances to be linearly separable (see left below). A single outlier will skew the entire model (see right below). They do not generalize well.

sensitivity_to_outliers_plot

Soft margin classification allows us to balance the model between margin width (generalization) and margin violations (training instances incorrectly classified). Hyperparameter LaTeX: C $C$ is inversely proportional to model regularization. A large C yields a model that fits tightly to the training data.

regularization_plot

Two proposals for implementation with sklearn:

LinearSVC(C=1, loss="hinge"): faster than SVC(kernel="linear", C=1) but doesn't allow the kernel to be changed to efficiently handle non-linear datasets.
SGDClassifier(loss="hinge", alpha= 1/(m*C): Stochastic Gradient Descent. Not as fast as LinearSVC but can be used for online learning.

Nonlinear SVM Classification

Datasets that are not linearly separable can be handled by adding features, e.g. polynomial features or similarity features. Although the addition of features scales poorly, the "kernel trick" can be used to get the effect of those additional features without slowing the model excessively. However the implementation does not scale well with the number of instances, therefore it's best for small to medium sized datasets.

Polynomial Features

The figure below illustrates how polynomial features ( $LaTeX: x_2=\left(x_1\right)^2$ $x_2=\left(x_1\right)^2$ ) can be added to make data linearly separable.

higher_dimensions_plot

The polynomial kernel adds two hyperparameters: the degree of polynomial used (d below), and the weight of high degree polynomials compared to low degree polynomials (r below).

e.g. SVC(kernel="poly", degree=3, coef0=1, C=5)

moons_kernelized_polynomial_svc_plot

Similarity Features

Measuring how an instances resembles a landmark in the feature space can help make nonlinear datasets linearly separable, as shown below. The Gaussian Radial Basis Function (RBF) is used to measure an instance's distance from two points, giving two new features LaTeX: x_2 $x_2$ and LaTeX: x_3 $x_3$ .

kernel_method_plot

The hyperparameter $LaTeX: \gamma$ $\gamma$ defines the sharpness of the RBF distribution. A small $LaTeX: \gamma$ $\gamma$ widens the distribution, making the influence of each instance diffuse giving a more regularized model. A large $LaTeX: \gamma$ $\gamma$ sharpens the influence of each instance, giving a less regularized (tighter fitting) model.

e.g. SVC(kernel="rbf", gamma=5, C=0.001)

moons_rbf_svc_plot

SVM Regression

To use SVM for regression, the objective is reversed: instead of trying to fit the largest "street" (margin) between classes, the model tries to fit as many instances as possible within the margins. Hyperparameter $LaTeX: \epsilon$ $\epsilon$ defines the width of the margins.

e.g. LinearSVR(epsilon=1.5)

svm_regression_plot

As with SVM classifiers, kernalized SVM models can be used for non-linear regression.

e.g. SVR(kernel="poly", degree=2, C=100, epsilon=0.1)

svm_with_polynomial_kernel_plot