Chapter 5 - Support Vector Machines
Session Agenda
- Chapter summary
- Take a quick look at Exercise 6 and 10
- Discuss example applications
Bring to the meeting: an example application of an SVM. It can be something you think up yourself, or an existing application that you've found. What makes it a good/bad use of an SVM? What are the limitations?
Chapter Summary
Support vector machines can be used for for linear and nonlinear classification, regression, and outlier detection. They are well suited for complex but small to medium sized datasets and scale well with the number of features, especially sparse features (see p. 164 of digital pre-print textbook).
All images from https://github.com/ageron/handson-ml2 Links to an external site. (apache2).
Linear SVM Classification
Draws a straight line in the feature space that separates classes. Large margin classification aims to fit the largest possible "street" (margin) between the classes (right in figure below).
The right image above illustrates a hard margin SVM classifier. Only new instances falling within the margin would change the class boundary. The instances that define the outer edges of the margin support the discernment between classes, i.e. the support vectors.
Sensitive to feature scales and will tend to reject relatively small features (e.g. x0 below):
Hard margins require all instances to be linearly separable (see left below). A single outlier will skew the entire model (see right below). They do not generalize well.
Soft margin classification allows us to balance the model between margin width (generalization) and margin violations (training instances incorrectly classified). Hyperparameter C is inversely proportional to model regularization. A large C yields a model that fits tightly to the training data.
Two proposals for implementation with sklearn:
- LinearSVC(C=1, loss="hinge"): faster than SVC(kernel="linear", C=1) but doesn't allow the kernel to be changed to efficiently handle non-linear datasets.
- SGDClassifier(loss="hinge", alpha= 1/(m*C): Stochastic Gradient Descent. Not as fast as LinearSVC but can be used for online learning.
Nonlinear SVM Classification
Datasets that are not linearly separable can be handled by adding features, e.g. polynomial features or similarity features. Although the addition of features scales poorly, the "kernel trick" can be used to get the effect of those additional features without slowing the model excessively. However the implementation does not scale well with the number of instances, therefore it's best for small to medium sized datasets.
Polynomial Features
The figure below illustrates how polynomial features (x2=(x1)2) can be added to make data linearly separable.
The polynomial kernel adds two hyperparameters: the degree of polynomial used (d below), and the weight of high degree polynomials compared to low degree polynomials (r below).
e.g. SVC(kernel="poly", degree=3, coef0=1, C=5)
Similarity Features
Measuring how an instances resembles a landmark in the feature space can help make nonlinear datasets linearly separable, as shown below. The Gaussian Radial Basis Function (RBF) is used to measure an instance's distance from two points, giving two new features x2and
x3.
The hyperparameter γ defines the sharpness of the RBF distribution. A small
γ widens the distribution, making the influence of each instance diffuse giving a more regularized model. A large
γ sharpens the influence of each instance, giving a less regularized (tighter fitting) model.
e.g. SVC(kernel="rbf", gamma=5, C=0.001)
SVM Regression
To use SVM for regression, the objective is reversed: instead of trying to fit the largest "street" (margin) between classes, the model tries to fit as many instances as possible within the margins. Hyperparameter ϵ defines the width of the margins.
e.g. LinearSVR(epsilon=1.5)
As with SVM classifiers, kernalized SVM models can be used for non-linear regression.
e.g. SVR(kernel="poly", degree=2, C=100, epsilon=0.1)