Chapter 14 - Deep Computer Vision Using Convolutional Neural Networks
Responsible for the session: Frida Heskebeck
Chapter summary
Basic idea
- Look at small parts of a figure to find lines in different directions.
- Look at the lines you found and combine those to shapes.
- Look at the shapes to get the full picture.
Receptive field
"The window a neuron is viewing from the layer below"
A neuron at position i,j in a layer is connected to the neurons at
- rows
i to
i+fh−1, and
- columns
j to
j+fw−1
in the layer below with fh and
fw is the height and width of the window. One can also include padding and then the receptive field will ve spaced out on the layer below.
Example from the picture below: The red neuron is positioned at (5,0) and has a receptive field of the neurons in the lower layer as (5 - 7, 0 - 2)
Convolutional layers
The output of a neuron is the weighted sum of the neurons in the receptive field. The weights are called a filter. A feature map is the picture you get after using the same filter for all possible receptive fields in a layer (the picture above shows one feature map). A convolutional layer has many feature maps, meaning that it looks for many features in the layer below (it uses many different filters). To be precise: The output of a neuron is the weighted sum of the neurons in the receptive field overall feature maps in the layer below. This can be expressed with this monster (more details in the book):
The size of the weights for a convolutional layer is [fh,fw,fn′,fn] corresponding to: [height of filter, width of filter, number of feature maps in layer below, number of feature maps in this layer]
The picture below illustrates that a neuron positioned at i,j has the same receptive field but a different filter for that receptive field, depending on what feature map the neuron belongs to.
Padding
What to do at the edges:
- Valid - Only looking at valid data, might ignore some data
- Same - Pad with zeros to use all data if stride 1 then output is the same dimension as input.
Pooling layers
Reduce the size of images, keep the important parts of the images, reduce sensitivity to translation invariance to some extent.
The receptive field of the layers works in the same way as before. The difference here is that instead of calculating the weighted sum, the neuron takes the maximum value of the receptive field (max pooling) or the average of the receptive field (average pooling).
The pooling is usually done on the feature maps individually.
Max pooling is more used today than the average pooling. There is also a global average pooling layer. It calculates the average over each feature map, hence outputting as many numbers as there were feature maps in the previous layer.
Data augmentation
Increase the number of training instances to get more training data. Tweek the input in some way so that each input is slightly different but still the same (if the picture is of a dog the picture should still be of a dog). For pictures, one can for example shift, rotate, resize, crop, flip, or change exposure/contrast.
Architectures
The overall message for the architecture of a CNN is to alternate convolutional layers with pooling layers and at the end have some fully connected layers. The number of feature maps is usually doubled after each pooling layer.
Inception module
Different very small convolutional layers in parallel that are then concatenated to one output. The inception module looks in the depth dimension of the feature maps and is a bottleneck layer which reduces the dimensionality of the feature maps (outputs fewer feature maps than the input). The inception module can be used as any other layer in your network.
Skip connections
Add the input of a layer a few steps ahead. This can be helpful if there is some layer that does not learn in the network.
Depthwise separable convolutional layer
The first part looks at only one feature map at the time and uses one filter for that feature map. The second part is a normal convolutional layer and has a 1x1 filter and hence only looks across the feature maps, not spacially.
Fully convolutional networks
Replace the dense layers at the top of a network with convolutional layers. By doing this the network can be used for input of different sizes.
Applications
There are many existing pre-trained networks that one can use Links to an external site.. One can use these for transfer learning and retrain the top layers.
Classification and localization - Find an object and mark its bounding box and label it.
Semantic segmentation - Classify each pixel in a picture.
Wavenet - Generate humanlike speech.
EEG-data - Many different examples of CNN's and EEG-data.
Many Links to an external site. many more Links to an external site....
Additional resources
Paper Links to an external site. where inception module was presented (same reference as in the book).
Wavenet Links to an external site..
EEGNet Links to an external site..
Nice overview Links to an external site..
Some applications Links to an external site..
Session Agenda
The meeting plan:
- Go through the chapter summary and discuss that.
- Discuss the suggested exercises.
- If we have time and if you are interested I can show you what I have done with CNN's and EEG-data.
Recommended exercises
The following exercises from the book but with stated focus.
1. What are the pros and cons of CNN?
3. What can we do to reduce memory usage while training?
4. What is the point of pooling layers?
9. Practice building a CNN, no need to get super performance but make it run.