Responsible for the session: Carolina and Bo

Chapter summary and additional resources

1. Clustering

Clustering - group similar instances into clusters (exists no universal definition of what a cluster is: depends on the context) - used for customer segmentation, data analysis, dimensionality reduction, anomaly detection (outlier detection), semi-supervised learning, search engines, image segmentation etc.
Anomaly detection - learn what normal data looks like
Density estimation - estimating the PDF of the random process that generated the dataset - used for anomaly detection (low-density regions are likely anomalies), data analysis and visualization

K-means (proposed 1957)

Description of algorithm: Specify the number of clusters k that the algorithm must find. It finds the centroids of each cluster iteratively. Does not behave well when “data blobs” have very different diameters (cares only about distance to cluster centroids) and may converge to local minima (depending on centroid initialization). The latter can be somewhat tackled through the hyperparameter n_init (default 10), and further improvements are found in K-means +\+.

See http://shabal.in/visuals/kmeans/1.html for visualization of the K-means algorithm.

Finding the optimal number of clusters:

plot inertia as a function of k and choose the “elbow value” of k.
use the silhouette score (mean silhouette coefficient over all instances): an instance’s silhouette coefficient is (b-a)/max(a,b), where a is mean distance to other instances in same cluster and b is the mean distance to the instances of the next closest cluster.

DBSCAN

Description of algorithm: counts instances within an instance’s $LaTeX: \epsilon$ -neighborhood. the instance is considered a core instance if it has at least min_samples in its $LaTeX: \epsilon$ -neighborhood. All instances in the neighborhood of a core instance belong to the same cluster (a long sequence of neighboring core instances forms a single cluster). Any non-core instance is considered an anomaly (indexed -1). Works well if clusters are dense enough, and well-separated.

2. Gaussian Mixtures

A Gaussian mixture have a distribution of the form $LaTeX: \sum\:p_iN\left(\mu_i,\Sigma_i\right)$ where LaTeX: p_i are positive number summing to 1 and can be interpreted as probabilities. To estimate the parameters, $LaTeX: p_i,\mu_i,\Sigma_i$ , similar recursive algoritms as described for clustering are used.

Comparison of k-means and EM on artificial data visualized with ELKI. Using the variances, the EM algorithm can describe the normal distributions exactly, while k-means splits the data in Voronoi-cells. The cluster center is indicated by the lighter, bigger symbol. From wikipedia.

An explanation and the math behind the so called EM-algorithm (expectation-maximization) can be found on wikipedia here.

There are some criteria that can be used for guessing the number of mixture components. The BIC and AIC can be mathematically motivated ways of avoiding overfitting when choosing the numbers p of parameters in a model when you have m data point. Here L is the log-likelihood score of the model on the data.

$LaTeX: \begin{align*} BIC\:&=\:\log\left(m\right)p-2\log\left(\hat L\right) \\ AIC\:&=\:2p-2\log\left(\hat L\right) \end{align*}$

For further information I recommend the wikipedia pages on Bayesian Information Criterion and Akaike Information Criterion

3. Word2Vec (not in book)

Word2vec is a way to find a vector space representation for words given a text corpus.

Word2vec description and fun observations (Links to an external site.)

Explore word analogies, word2vec illustration (Links to an external site.)

Hinton explains the idea in this video (the rest are available here)

Learning to predict the next word [Neural Networks for Machine Learning] (Links to an external site.)

If you want code example on training, this Gensim word2vec tutorial seems to be a good start

There is also a Word2vec tutorial in Tensorflow

Session agenda and exercises

The chapter summary will be interlaced with the following:

1. Questions and tasks on clustering - to discuss at session

Q: Can you prove why the K-means algorithm is guaranteed to converge in a finite number of steps?

Q: Is a high silhouette score good or bad? Explain why.

Task: choose a picture of your liking and perform image segmentation.

Q: What clustering algorithm would you use if you have a large dataset (out of all mentioned in the chapter and the computational time is of great importance)?

2. Questions and tasks on Gaussian mixtures - to discuss at session

Q: What kind of convergence is guaranteed (for the likelihood L) when the EM algorithm is applied to estimate a Gaussian mixture? Local or global, monotone ?

Q: What is the number p in the BIC and AIC for a Gaussian mixture with n components ?

Task: Try out different number n of Gaussian mixture components for the Iris data set, and use the BIC to find the best n.

3. Word2vec

Q: When do you think it is a good idea to use an unsupervised learning technique, such as word2vec, as the first layer in your network, and when do you think it is better if you ''train everything from scratch''.

4. Any questions regarding the ML-competition/project