Chapter 15 - Processing Sequences Using RNNs and CNNs
Chapter summary
All images from https://github.com/ageron/handson-ml2 (apache2), unless otherwise secified.
Introduction
- can be used to predict next value in a sequence
- RNN: recurrent neural network
- take input sequences of arbitrary length
- perform back propagation through time
- difficulties: unstable gradients, very limited short-term memory
- other nets can handle sequences:
- deep net: short sequences
- CNNs: long sequences
Recurrent neurons and layers
- so far, activations flow from in to out (feedforward)
- RNN includes feedback: neuron takes sum of input and its previous output
- package into layers
- each neuron in the layer gets all inputs and all outputs
- each neuron has a 2 sets of weights, one for inputs and one for the previous outputs
- considering the layer we can express the weights as matricies Wx Wy giving the output vector as
$\phi(W_x^Tx_{(t)}+W_y^Ty_{(t-1)} + b)$ with activation function and bias .
-
memory cells
- RN output depends on all previous inputs: memory
- single layer can only learn short patterns (e.g. 10 steps)
- can generalize the RN feedback as the feedback of a hidden state that is not necessarily the RN output (as we've looked at so far)
- the hidden state represents previous inputs
-
IO sequences
- sequence to sequence: take a seqence as input and output a sequence (e.g. shifted one time step forward to predict next value) why bother with this when only the last output contains new info? because the net may be trained to do more than just a project one step forward in time, may involve other transformations
- sequence to vector: input sequence but ignore all outputs until the last one (e.g. feed words of a review and output a sentiment score... +1 good or -1 bad)
- vector to sequence: input a vector once (or same vector repeatedly) and have RNN output a sequence (e.g. input an image and generate sequence of words used as a caption)
- encoder-decoder: sequence to vector followed by vector to sequence. e.g. translate a sentence. Better than word-by-word with a sequence-sequence RNN since words at the end of the sentence may affect how words at the start of the sentence should be translated.
Training RNNs
- unroll the layer Through Time, then use regular Back-Propogation (BPTT)
- the weights and biases (same for each time step) are tuned via back propagation at each time step (sum over all time steps)
-
NOTE only one set of weights and biases used for each element in the RNN layer unrolled forback propagation. Changes at each gradient step are applied cumulatively to the same W, B to be used at each step in the sequence during the next training iteration.
- forward pass 2) evaluate output using cost function and target values 3) back propagate to update weights and biases
- sequence length -> net depth -> vanishing gradients
Source: Illustrated Guide to Recurrent Neural Networks
Links to an external site.
Forecasting a time series
follow along in Jupyter Links to an external site.
- forecast future values or fill in missing values (impute)
- example sequence: 2 superimposed sine waves with noise
-
baseline metrics
- zero-order-hold (~0.02 MSE)
- linear regression, all time steps as input to 1 neuron (~ 0.004 MSE)
-
implementing simple RNNs
- 1 neuron RNN layer (~0.01 MSE), with 2 trainable parameters VS (sequence length) + 1 for linear regression
-
input_shape=[None,1]
-> None: sequence length does not need to be specified, and 1: we're sending in a univariate sequence -
SimpleRNN
: , i.e. neuron output fed back as the "hidden state" input in the next iteration together with the next input in the sequence. - output vector (last output) by default;
return_sequences=true
for sequence output - why not just use moving average methods? We don't have to remove then trends and seasonality before when making predictions.
-
deep RNNs
- more neurons, more layers
-
- pass all outputs between layers (
return_sequences=true
) -
SimpleRNN
layers: 20, 20, 1 (MSE ~0.003) - last layer only feeds back single output; can use vanilla neuron instead for a bit more activation function flexibility and faster convergence
- pass all outputs between layers (
-
forecasting several time steps ahead
- using the previous model, we can make several predictions in a row, shifting each prediction into our input sequence before making the next prediction (MSE [ZOH, linreg, deep] = [0.2,0.02,0.03] for 10 steps ahead), but no better than linear regression
- keeping the sequence to vector RNN structure, train model explicitly to output a sequence of 10 future values (target dimension is 10) Links to an external site., gives MSE ~0.008, however we're only including the last output vector in our loss function
-
- using sequence to sequence RNN structure, at every time step the model outputs a sequence of the next ten values (loss accounts for output at each time step, not just the last one)
- all layers
return_sequences=true
, including the output layer at each time step... useTimeDistributed
instead of dense Links to an external site. - since we're interested in the MSE of the last time step only, define custom metric to be used in the model (though the loss used for training is still plain MSE, applied at every time step)
- gives MSE ~0.006
Handling long sequences
- fighting the unstable gradients problem
- Tackling the short-term memory problem
LSTM
Illustrated Guide to LSTM's and GRU's: A step by step explanation
Links to an external site.
Intuitive description of the difference between cell state (long term state) and hidden state:
Let us assume we are using an LSTM model to classify a movie review as good/bad.
Just as we read a review we keep track of the essence of what we have read so far, the hidden state keeps track of the “essence” of all the words we have seen so far.
Now imagine we read a review that starts off saying “ this movie sucks…”, and then no matter what is said after that we keep track of that key beginning and factor it our decision to classify the review as bad.
Implementing this kind of selective remembering of something that was said way back in the beginning of a sentence cannot be accomplished by just the hidden state alone. This requires us to have some additional form of mechanism to selectively remember/forget the past, and most importantly this capacity needs to be learnt automatically from data.
So in summary, hidden state is overall state of what we have seen so far. Cell state is selective memory of the past. Both these states are trainable with data.
GRU
One problem with the LSTM is that it is overly complex. In a normal RNN there is one trainable input weight per neuron, but in the LSTM there are 5. So it's a big leap in terms of complexity and they also take a long time to train. The GRU is a compromise where there are 3 trainable input weights. In the GRU the cell state was removed and instead replaced with a more sophisticated gating system. The update gate takes the role of the forget and input gates. Reset-gate decides how much past information to forget.
Using conv1D layers instead of the above RNN architectures
this article argues that the above RNN based solution models should be "reconsidered" and that conv1D networks should be goto place for sequence predictions:
https://arxiv.org/pdf/1803.01271.pdf Links to an external site.
The basic building blocks of above are stacked conv1D layers used together with so called dilated convolutions which means that only parts of outputs from layers are passed on (regularizing effect?). The conv kernel is also only slide backwards in time, rendering the network asymmetric (compared to a normal CNN). The output from each F-prop is used as input the next timestep.
Note that there are multiple publications on this issue and no consensus on best practices (it is a fast moving field). A conv1D layer is different from an RNN layer as it does not hold any memory (but can be used in a way which does - see below).
WaveNet
WaveNet https://arxiv.org/abs/1609.03499v2 Links to an external site. started as a special type of sequence - sequence NN with stacked conv1D layers.
Additional resources
RNN Layer: First think of a normal ANN with three identical hidden layers. The output of the first one becomes the input to the next along with the next input in the sequence and so on. The key difference to a normal ANN is that the weights are shared between all layers. This is somewhat similar to CNN's BUT in a CNN the weights are shared between filters in one layer, NOT between different layers.
Session agenda
We'll start with a summary of the chapter, take a look at a simple implementation of RNNs using tf.keras, then discuss the recommended exercises.
Recommended exercises
1. Can you think of a few applications for a sequence-to-sequence RNN? What about a sequence-to-vector RNN, and a vector-to-sequence RNN?
8. Which neural network architecture could you use to classify videos?
9. Train a classification model for the SketchRNN dataset, available in TensorFlow
Datasets. See https://arxiv.org/abs/1704.03477
Links to an external site. for the sketchRNN paper. Here is an interesting article about it : https://medium.com/analytics-vidhya/analyzing-sketches-around-the-world-with-sketch-rnn-c6cbe9b5ac80
Links to an external site.