Chapter 8 - Dimensionality reduction
SVD
- Singular-value decomposition. Any matrix can be factorized as A=UΣV∗, where U and V are orthogonal matrices and
Σ is a diagonal matrix with positive entries. For this session, it's going to be important to relate matrix multiplication to data transformation. Consider two independent uniformly distributed random variables
X1, X2, and define
X=[X1X2]⊤, sampled and plotted below.
What does AX=UΣV∗X do? Specifically what does each matrix
U,
Σ and
V∗ do when applied to this dataset?
PCA
- SVD on Gaussian Variables
I believe this is quite well covered in the book, but we will discuss a couple of things. If you work with sampled signals, or pictures numpy.linalgs SVD algorithm cannot be applied naively. Why? Run the following code:
import numpy as np
X = np.random.rand(100000,2)
np.linalg.svd(X)
You will most likely get
MemoryError: Unable to allocate 74.5 GiB for an array with shape (100000, 100000) and data type float64
Is it an unreasonable request to want to perform SVD on this extremely tall matrix? - not really. Let's say you want to whiten a couple of signals, 100000 samples is not that much. Is there a workaround?
A *note* on incremental PCA. The original authors of the algorithm included a forgetting factor, in order to prioritize new data over old data (in online settings). This is not implemented in scikit learn (unfortunately). Does anyone know what algorithms to use in this case?
ICA
Independent Component Analysis - Gaussian variables are forbidden.
ICA is not in the book. Using PCA you can whiten (decorrelate) variables. If your input data is normally distributed,
it will look something like the above picture. A linear transformation of datapoints can reflect/rotate (around the origin), stretch or a combination of this. Meaning, any linear combination of normally distributed data points will look like an ellipsoid. We see that the two random variables are negatively correlated in the above example. This means, that assuming the input data is decorelated, and normalized (i.e. looking something like this).
We can reconstruct the ball (circle) above by using PCA to find the axes of highest variance, which are the semiaxes of the ellipsoid, and normalize them. The problem is that a ball is invariant under rotation, meaning, we cannot separate X from
AX if,
A is a rotation matrix. Contrast this to
which is invariant only to rotations of 90∘.
Exercise 1.
a) Sample a 2d independent uniformely distributed random variable, and multiply it by a rotation matrix. The following code can be used to sample.
import numpy as np
X = np.random.rand(2000,2) - 0.5
Hint: A rotation around the origin by θ is on the following form
A=[cos(θ)−sin(θ)sin(θ)cos(θ)].
b) On your now rotated dataset, apply a stretching matrix (i.e. a diagonal matrix with strictly positive entries).
c) Once more rotate your dataset (using another angle than 1.a). What does your dataset look like?
-----------------------------------------------------------------------------
Linear ICA is about doing precisely the steps a,b and c - but backward, i.e. from a dataset find the sources with the least dependence. Assuming sources of unit variance (you can always rescale them later), we can reverse engineer the last two steps of the matrix multiplication by PCA, i.e. find the axes of highest - least variance, and normalize them. The last step is not so easy, we want to minimize dependence. There is no foolproof metric for dependence between random variables, thus ICA is a lot about finding a good proxy for independence. Once you have a metric for independence the last step becomes an optimization problem. One approach is to look at higher order moments, for example kurtosis, but there are other ways as well.
Exercise 2.
Download teach.zip Download teach.zip and extract the two images, import them into python, resize them them to have equal dimensions and flatten them into 1d (or 2d) arrays. Then plot them in a scatter plot. Helpful code
from PIL import Image
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
im_1 = Image.open(<path>) # To read image from file
plt.imshow(im_1, cmap = mpl.cm.binary, interpolation="nearest") # To show an image in python
im_2 = Image.open(<path>)
im_2_resized = im_2.resize(<tuple>) # Resizing
S1 = np.array(im_1) # Convert to numpy array
...
plt.plot(S1.flatten(), ...)
Challenge 1.
Download separate_this.zip Download separate_this.zip and extract the two images, convert to numpy arrays, flatten them and using sklearn's FastICA algorithm, separate the images, plot them and show me in class.
Clarification: The images (200, 150, 3) - (height, width,#(colors)), are already mixed, you only need separate them.
Hint: When recovering the images you will have to rescale and translate them, convert them to int such that they are something like (200, 150, 3), with pixels taking integer values in [0, 255].
----------------------------------------------------------------------------
Sklearn has implemented the "FastICA" algorithm, which is nicely described in the following survey paper https://www.cs.helsinki.fi/u/ahyvarin/papers/NN00new.pdf Links to an external site. . However, you should know that ICA is a technology by now, meaning, you can apply it without really knowing exactly how it works.
----------------------------------------------------------------------------
T-SNE and UMAP
Have a look atFirst million integers, laid out with UMAP
Links to an external site.
We will discuss T-SNE and UMAP more in-depth in class.
UMAP explanation Links to an external site.
Dont miss the UMAP Audio Explorer Links to an external site.