Chapter summary

All images from https://github.com/ageron/handson-ml2 (apache2), unless otherwise secified.

Introduction

can be used to predict next value in a sequence
RNN: recurrent neural network
take input sequences of arbitrary length
perform back propagation through time
difficulties: unstable gradients, very limited short-term memory
other nets can handle sequences:
- deep net: short sequences
- CNNs: long sequences

Recurrent neurons and layers

so far, activations flow from in to out (feedforward)
RNN includes feedback: neuron takes sum of input and its previous output

package into layers

each neuron in the layer gets all inputs and all outputs
each neuron has a 2 sets of weights, one for inputs and one for the previous outputs
considering the layer we can express the weights as matricies Wx Wy giving the output vector as

$\phi(W_x^Tx_{(t)}+W_y^Ty_{(t-1)} + b)$ with activation function $\phi$ and bias $b$ .
memory cells
- RN output depends on all previous inputs: memory
- single layer can only learn short patterns (e.g. 10 steps)
- can generalize the RN feedback as the feedback of a hidden state $h_{(t)}$ that is not necessarily the RN output (as we've looked at so far)
- the hidden state represents previous inputs

IO sequences
- sequence to sequence: take a seqence as input and output a sequence (e.g. shifted one time step forward to predict next value) why bother with this when only the last output contains new info? because the net may be trained to do more than just a project one step forward in time, may involve other transformations
- sequence to vector: input sequence but ignore all outputs until the last one (e.g. feed words of a review and output a sentiment score... +1 good or -1 bad)
- vector to sequence: input a vector once (or same vector repeatedly) and have RNN output a sequence (e.g. input an image and generate sequence of words used as a caption)
- encoder-decoder: sequence to vector followed by vector to sequence. e.g. translate a sentence. Better than word-by-word with a sequence-sequence RNN since words at the end of the sentence may affect how words at the start of the sentence should be translated.

Training RNNs

unroll the layer Through Time, then use regular Back-Propogation (BPTT)
the weights and biases (same for each time step) are tuned via back propagation at each time step (sum over all time steps)
NOTE only one set of weights and biases used for each element in the RNN layer unrolled forback propagation. Changes at each gradient step are applied cumulatively to the same W, B to be used at each step in the sequence during the next training iteration.
1. forward pass 2) evaluate output using cost function and target values 3) back propagate to update weights and biases

sequence length -> net depth -> vanishing gradients

Source: Illustrated Guide to Recurrent Neural Networks

Forecasting a time series

follow along in Jupyter

forecast future values or fill in missing values (impute)
example sequence: 2 superimposed sine waves with noise
baseline metrics
- zero-order-hold (~0.02 MSE)
- linear regression, all time steps as input to 1 neuron (~ 0.004 MSE)
implementing simple RNNs
- 1 neuron RNN layer (~0.01 MSE), with 2 trainable parameters VS (sequence length) + 1 for linear regression
- input_shape=[None,1] -> None: sequence length does not need to be specified, and 1: we're sending in a univariate sequence
- SimpleRNN: $h_0 = y_0$ , i.e. neuron output fed back as the "hidden state" input in the next iteration together with the next input in the sequence.
- output vector (last output) by default; return_sequences=true for sequence output
- why not just use moving average methods? We don't have to remove then trends and seasonality before when making predictions.
deep RNNs
- more neurons, more layers

- pass all outputs between layers (return_sequences=true)
- SimpleRNN layers: 20, 20, 1 (MSE ~0.003)
- last layer only feeds back single output; can use vanilla neuron instead for a bit more activation function flexibility and faster convergence
forecasting several time steps ahead
- using the previous model, we can make several predictions in a row, shifting each prediction into our input sequence before making the next prediction (MSE [ZOH, linreg, deep] = [0.2,0.02,0.03] for 10 steps ahead), but no better than linear regression
- keeping the sequence to vector RNN structure, train model explicitly to output a sequence of 10 future values (target dimension is 10), gives MSE ~0.008, however we're only including the last output vector in our loss function

- using sequence to sequence RNN structure, at every time step the model outputs a sequence of the next ten values (loss accounts for output at each time step, not just the last one)
- all layers return_sequences=true, including the output layer at each time step... use TimeDistributed instead of dense
- since we're interested in the MSE of the last time step only, define custom metric to be used in the model (though the loss used for training is still plain MSE, applied at every time step)
- gives MSE ~0.006

Handling long sequences

fighting the unstable gradients problem
Tackling the short-term memory problem

LSTM

Illustrated Guide to LSTM's and GRU's: A step by step explanation

asdf

Intuitive description of the difference between cell state (long term state) and hidden state:

Let us assume we are using an LSTM model to classify a movie review as good/bad.

Just as we read a review we keep track of the essence of what we have read so far, the hidden state keeps track of the “essence” of all the words we have seen so far.

Now imagine we read a review that starts off saying “ this movie sucks…”, and then no matter what is said after that we keep track of that key beginning and factor it our decision to classify the review as bad.

Implementing this kind of selective remembering of something that was said way back in the beginning of a sentence cannot be accomplished by just the hidden state alone. This requires us to have some additional form of mechanism to selectively remember/forget the past, and most importantly this capacity needs to be learnt automatically from data.

So in summary, hidden state is overall state of what we have seen so far. Cell state is selective memory of the past. Both these states are trainable with data.

GRU

One problem with the LSTM is that it is overly complex. In a normal RNN there is one trainable input weight per neuron, but in the LSTM there are 5. So it's a big leap in terms of complexity and they also take a long time to train. The GRU is a compromise where there are 3 trainable input weights. In the GRU the cell state was removed and instead replaced with a more sophisticated gating system. The update gate takes the role of the forget and input gates. Reset-gate decides how much past information to forget.

gru

Using conv1D layers instead of the above RNN architectures

this article argues that the above RNN based solution models should be "reconsidered" and that conv1D networks should be goto place for sequence predictions:

https://arxiv.org/pdf/1803.01271.pdf

The basic building blocks of above are stacked conv1D layers used together with so called dilated convolutions which means that only parts of outputs from layers are passed on (regularizing effect?). The conv kernel is also only slide backwards in time, rendering the network asymmetric (compared to a normal CNN). The output from each F-prop is used as input the next timestep.

Note that there are multiple publications on this issue and no consensus on best practices (it is a fast moving field). A conv1D layer is different from an RNN layer as it does not hold any memory (but can be used in a way which does - see below).

WaveNet

WaveNet https://arxiv.org/abs/1609.03499v2 started as a special type of sequence - sequence NN with stacked conv1D layers.

Additional resources

https://hackernoon.com/rnn-or-recurrent-neural-network-for-noobs-a9afbb00e860

RNN Layer: First think of a normal ANN with three identical hidden layers. The output of the first one becomes the input to the next along with the next input in the sequence and so on. The key difference to a normal ANN is that the weights are shared between all layers. This is somewhat similar to CNN's BUT in a CNN the weights are shared between filters in one layer, NOT between different layers.

RNNpic

Session agenda

We'll start with a summary of the chapter, take a look at a simple implementation of RNNs using tf.keras, then discuss the recommended exercises.

Recommended exercises

1. Can you think of a few applications for a sequence-to-sequence RNN? What about a sequence-to-vector RNN, and a vector-to-sequence RNN?

8. Which neural network architecture could you use to classify videos?

9. Train a classification model for the SketchRNN dataset, available in TensorFlow
Datasets. See https://arxiv.org/abs/1704.03477 for the sketchRNN paper. Here is an interesting article about it : https://medium.com/analytics-vidhya/analyzing-sketches-around-the-world-with-sketch-rnn-c6cbe9b5ac80