Chapter 16 - Natural Language Processing with RNNs and Attention

Responsible for the session: Pex Tufvesson

 

Note: The third release of O'Reilly's book "Hands-on Machine Learning with Scikit-Learn, Keras & TensorFlow" was published in November 2019. A year has passed since, and in addition to the NLP models that the book refers to such as OpenAI's GPT-2 model Links to an external site. (which was updated in Nov 2019 as a 1.5B parameter model Links to an external site. with source-code) and BERT Links to an external site., we should keep an eye on at least Open-AI's GPT-3 Links to an external site. (175B parameters). 

 

Chapter summary

Your task is to read chapter 16 in the book, then read my summary below. When we meet, let's discuss if you agree with me or not on how I see things! I have made two experiments based on the theories in this chapter:

  • Generating Swedish proverbs using character RNNs, aka "are you drunk?"
  • Translating English idioms to Swedish using Sentiment Analysis, aka "Better than Google Translate"

Enjoy! / Pex

 

Generating Text Using a Character RNN

In the blog post by Andrej Karpathy with the title The Unreasonable Effectiveness of Recurrent Neural Networks Links to an external site., he trains a RNN on Paul Graham essays, Links to an external site.Shakespearean text, Wikipedia articles, LaTEX code for an algebraic geometry Links to an external site. and on the whole Linux Source Code Links to an external site..

He used lua and torch for the task, but also have a fully working 100-line bare-metal vanilla Python implementation Links to an external site.. A faster and more readable lua/torch-implementation Links to an external site. has been done as well.

Results? Plain text, LaTEX code and C-code that would fool you into believing it's the real thing unless you take a closer look.

The book then implements these ideas using TensorFlow / Keras / Sck-Kit. It's a matter of grabbing the input text, tokenizing it, splitting the text into a 90% training, 5% validation and 5% test set.

To make gradient descent happy, the input text is windowed and shuffled. The goal is to produce the next character after a window:

Fig16.1_windowing_of_text.png

The RNN handles each character as a one-hot encoded vector: 39 bits where only one of the bits is supposed to be a '1'.

Stateless RNN

The model they've chosen is:

model = keras.models.Sequential([
keras.layers.GRU(128, return_sequences=True, input_shape=[None, max_id],

dropout=0.2, recurrent_dropout=0.2),
keras.layers.GRU(128, return_sequences=True,

dropout=0.2, recurrent_dropout=0.2),
keras.layers.TimeDistributed(keras.layers.Dense(max_id, activation="softmax"))
])
model
.compile(loss="sparse_categorical_crossentropy", optimizer="adam")
history = model.fit(dataset, epochs=20)

...and training takes a couple of hours.

The resulting text generator let's you choose two things:

  1. The seed. The initial state of the RNN - the first couple of words to set the tone of the generated text to come.
  2. The temperature. The allowed randomness. A temperature of 0 will always choose the next character as the one with the highest probability from the last softmax layer. The more you raise the temperature, the more likely the generator is to choose another char, from the probability vector gotten from the last softmax layer.

Choosing a low temperature will probably push the RNN into an endless loop of repeating the same chars over and over again.

 

Let the fun begin! aka: Are you drunk?

The course is Hands-On, so let's get started: I want to generate text myself. As you may or may not know, I am the webmaster of the largest Swedish proverb collection at https://livet.se/ord Links to an external site. - which has been online since 2005. I grabbed the latest MySql dump from the site, and ran

mysql -uroot -p -Dpex.livet -s -e"SELECT prov_se FROM ord_proverbs WHERE NOT ((prov_se == '') || (prov_se == '_'));" > mysql_log.txt

...which results in a text file with 3146kB of Swedish text, from 42318 Swedish proverbs.

I will not paste the code in here, since it's quite long. Total training time on an Intel Corei9 laptop was 30 hours. Training time on an AMD Ryzen9 3900XT was 60 hours.

I did use an nVidia 1080 Ti GPU for training, and it took 1h45m. However, RNNs trained in Tensorflow have quite strict rules on what features you're allowed to use, so recurrent_dropout was supposed to be set to 0.2, but tensorflow's GPU implementation does not allow that. You can read more about that here: https://www.tensorflow.org/api_docs/python/tf/keras/layers/GRU Links to an external site.

Before doing anything clever, the characters in the source text were:

!"#%&'()*+,-./0123456789:;<=>?[\]_`abcdefghijklmnopqrstuvwxyz|}~ ¤¥¨«­´¶·»½¿àãäåæçèéëíïôöøüý–—‘’”…

However, to keep the number of tokens down, I replaced or removed some characters to make the set smaller. All characters in the source text, after replacing:

!"#%&'()*+,-./0123456789:;<=>?[\]_`abcdefghijklmnopqrstuvwxyz|}~ «­´ãäåéöøü

Using these chars, tokenizing a string like "Skulle gå ud etter öl?" becomes: [[7, 14, 22, 9, 9, 3, 1, 13, 21, 1, 22, 10, 1, 3, 5, 5, 3, 6, 1, 23, 9, 34]] ...and then back to "['skulle gå ud efter öl?']" again. There are 77 unique tokens in my encoding.

The neural network I used was

Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
gru (GRU) (None, None, 128) 79488
_________________________________________________________________
gru_1 (GRU) (None, None, 128) 99072
_________________________________________________________________
time_distributed (TimeDistri (None, None, 77) 9933
=================================================================
Total params: 188,493
Trainable params: 188,493
Non-trainable params: 0
_________________________________________________________________


As a first text, I predicted one character to complete the sentence "hon är kvinn". It chose "o", which seems fair. I was aiming for "a", but I should be happy with an "o" as well.

Playing around with the temperature, leads to results like:


### Using seed_text='e' t=0.200000,
generated_text='en av det som ska vara som för att spela sig själv.
beskull: en person som ska vara stora problem och'

### Using seed_text='e' t=1.000000,
generated_text='er framgång skulle titta jord för att många nationer förstår att göra.
uttami: elemerings pirsom att '

### Using seed_text='e' t=2.000000,
generated_text='ers djug: göds vnamt scekjofin.
scojöbuke: hnviextmåb.. utmansata advinuten. så lämj "naskel: klok.
v'


For the rest of my generated proverbs, I choose 0.3 as the temperature. And, I have manually removed all text after the first "." character.
The generated proverbs, when using letters "a"-"k" as seed, are:

'a stället av stället som kan göra det som för att man är beskyller det att man ska bli värd.'
'ber som för att man inte har till dem.'
'cken med en stor förstånd.'
'de för att de ska bli valje som ska man inte vara som den man som gör någon in för att spela en skill'
'finner sig själva.'
'g som man inte har till den som ska på det som inte behöver honom.'
'h som man inte har till den som ska på det som inte behöver honom.'
'inte med en stor första känniskor som gör en vidskepelse.'
'ja för att de ska bli valje som ska man inte vara som den man som gör någon in för att spela en skill'
'ka för min själv.'


To get texts that makes more sense, I used longer seeds:


'En sommar ' => 'En sommar som gör aldrig något som de ska vara så länge ett genom att skydda den som inte har till den som ska'
'Som man ' => 'Som man inte har gjort någonting.
fisk för att man ska vara sälja som talar om det som lever något annat än '
'Tidigare ' => 'Tidigare är det inte skillnaden mellan två saken som vill ha ett litet till den som inte har till den.
man sk'
'Man tar ' => 'Man tar sig själva.
det är bete vädret som gör det som är betänga som talar om det som lever något.
det är b'
'Utanför ' => 'Utanför att man inte har till den som är betende men inte betänga att ta andra att skilja sig av att glömma '
'I Lund ' => 'I Lund för att man inte har till folk som gör det som är betänga som talar om det som lever något.
det är b'
'Tvinga ' => 'Tvinga som gör det som för tiden för att spela sinnet för att skilja sig en största första första graden.
v'
'Efteråt ' => 'Efteråt som gör det som för tiden för att spela sinnet för att skilja sig en största första första graden.
v'
'Liggande ' => 'Liggande som gör det som för tiden för att spela sinnet för att skilja sig en största första första graden.
v'
'Tango ' => 'Tango sig själva.
det är bete vädret som gör det som är betänga som talar om det som lever något.
det är b'
'En Dans, ' => 'En Dans, men det är betenare att vara för att man inte har gått ut som till den.
en stor man är en stor som v'
'Fredag ' => 'Fredag som gör aldrig något som de ska vara så länge ett genom att skydda den som inte har till den som ska'
'Vi ska gå ' => 'Vi ska gå i filmen.
det är bete som finns i spillning av att betrakta sig en av det med som ska avskilla sig a'
'Spännande ' => 'Spännande springer och skillnaden med vatten på det som inte behöver honom.
det är inte samma som man ska beta'
'Tomten ' => 'Tomten som gör aldrig något som de ska vara så länge ett genom att skydda den som inte har till den som ska'
'Lucia ' => 'Lucia som de första vänner.
det finns inget som får inte betraktar att ge det som för mycket som ger sig e'
'En dvärg ' => 'En dvärg och skydda sig själva.
kvinnor skulle hålla vilja som inte har till dem.
det är bete när du har till'
'Med svärd ' => 'Med svärd för att man inte har till folk som gör det som är betänga som talar om det som lever något.
det är b'
'Jul ' => 'Jul som gör det som för tiden för att spela sinnet för att skilja sig en största första första graden.
v'
'Glögg ' => 'Glögg i hemma.
konsten att tro att de inte har fått en man som man inte har till den.
man ska vara skillna'
'Slut ' => 'Slut är det som kan göra sig själv.
det är bete när det är en sallan som inte har för mycket stor skapar '

 

Looking at texts coming out of a badly trained RNN like this actually helps you in identifying what's wrong with texts coming out of a well trained neural network like GPT-2 or GPT-3. The same kind of "I have no clue of what this text is supposed to say" is present in more advanced text generators as well. However, they do fewer grammatical errors.

I was forced to use recurrent_dropout = 0.0 instead of 0.2. Regardless, creating sentences from two GRU layers may not be the ideal way of inventing proverbial masterpieces.

 

Stateful RNN

To be able to get generated text that makes more sense, the RNN needs to keep its state between epochs in the training. The new model code becomes:

model = keras.models.Sequential([
keras.layers.GRU(128, return_sequences=True, stateful=True, dropout=0.2,
recurrent_dropout=0.2, batch_input_shape=[batch_size, None, max_id]),
keras.layers.GRU(128, return_sequences=True, stateful=True, dropout=0.2,
recurrent_dropout=0.2),
keras.layers.TimeDistributed(keras.layers.Dense(max_id, activation="softmax"))
])

...and the shuffling of windows cannot be done as wildly as before, since the GRU layers will keep their state between windows. The upside is that the state will be meaningful for every character that it gets as input, speeding up the training process. The downside is that batching and handling of the input text needs to be done more carefully.

 

Sentiment Analysis

To be able to "understand" text, or at least "be able to draw some conclusion based on text", we need to group characters together into words. Tokenizing words by separating them by the space character is the naïve approach, but we can do better.

Stemming

Identifying word using the space character is OK for English and many other written languages that use spaces between words, but not all languages use spaces this way. Chinese does not use spaces between words, Vietnamese uses spaces even within words, and languages such as German often attach multiple words together, without spaces. Swedish concatenates words, and adds postscripts more often than English. Think of "car" vs. "the car" vs. "the car's" and the Swedish equivalent "bil", "bilen" and "bilens".

The process of tokenizing words has many solutions, but you'll do better at analysing texts when using stemming Links to an external site.. Stemming is the process of reducing inflected (or sometimes derived) words to their word stem Links to an external site., base or root Links to an external site. form—generally a written word form.

Stemming has been around since around 1960, and one version of stemming mentioned in the book is unsupervised subword regularization Links to an external site., a method from 2018.

The IMDb reviews dataset

The book walks through how to map highly subjective movie review texts into two categories "positive" or "negative" using the IMDb reviews dataset. It contains 50000 movie reviews together with their corresponding binary positivity label.

The detailed description on how this was done you can read in the book yourself. Briefly, they grab the first 300 chars of the reviews, split them into words, and pad the review to a fixed length with a "padding" token. Seems to be the wrong order of doing things, but then they make a high-score of the 10000 most frequent words and tokenize all words into 10000 "known words" and 1000 out-of-vocabulary buckets.

Embeddings

Since 11000 features is way too much for a one-hot binary representation, they use embeddings. An embedding is a trainable dense vector that represents a category. Since these embeddings are trainable, they will gradually improve during training; and if they represent fairly similar categories, gradient descent will be pushing them closer together. The number of dimensions of the embedding is a hyperparameter to tweak.

The idea of using vectors to represent words dates back to the 1960s, and many sophisticated techniques have been used to generate useful vectors, including using neural networks. But things really took off in 2013, when Tomáš Mikolov and other Google researchers published a paper describing an efficient technique to learn word embeddings using neural networks Links to an external site., significantly outperforming previous attempts.

As a rule of thumb embeddings typically have 10 to 300 dimensions, depending on the task.

To improve their small IMDb-example, they use a pre-trained "nnlm-en-dim50" sentence embedding that's already packed with knowledge by having synonyms grouped using a 7 billion word corpus.

It feels like they are cheating big time. But, I guess we should see pre-trained embeddings as just another tool in the tool box.

 

Let the fun begin! aka: Better than Google Translate.

The course is for getting our hands dirty, so let's get started. I happen to collect proverbs, and one certain class of proverbs are idioms Links to an external site.. They are really difficult to translate, since often the words you see in a sentence is not at all related to common knowledge about these words. For instance, take these examples:

That´s a close shave. -> Det var nära ögat.
That´s another cup of tea. -> Det är en annan femma.
You look like a dying duck in a thunder­storm.
-> Du ser ut som om du sålt smöret och tap­pat pengarna.
Put that in your pipe and smoke it! -> Där fick du så du teg!
Nothing ventured, nothing gained. -> Friskt vågat, hälften vunnet.
Make a mountain out of a molehill. -> Göra en höna av en fjäder.
Have a finger in the pie. -> Ha ett finger med i spe­let.
Give somebody the benefit of the doubt. -> Hellre fria än fälla.
I smell a rat. -> Här ligger en hund be­graven.
Like a bolt from the blue. -> Som en blixt från en klar himmel.

Putting these into Google Translate Links to an external site. gives us:

That's a close shave. -> Det är en nära rakning.
That´s another cup of tea. -> Det är ännu en kopp te.
You look like a dying duck in a thunder­storm. -> Du ser ut som en döende anka i åskväder.
Put that in your pipe and smoke it! -> Lägg det i röret och rök det!
Nothing ventured, nothing gained. -> Inget vågat ingenting vunnit. ***
Make a mountain out of a molehill. -> Gör ett berg av en mullvad.
Have a finger in the pie. -> Ha ett finger i pajen.
Give somebody the benefit of the doubt. -> Ge någon fördelen av tvivel.
I smell a rat -> jag känner lukten av en råtta.
Like a bolt from the blue. -> Som en bult från det blå.

...and it's obvious that one of the world's richest company doesn't know a thing about idioms. At least not when it comes to translating them to Swedish.

*** Google is actually boasting about having "verified translations" that are clearly dead wrong:

Google Translate error 2020-12-05 kl. 21.13.41.png

Can we do better than Google - with the help of Google? Let's find out:

We could for instance take a look at tokenizing sentences. Google has published their Universal Sentence Encoder Links to an external site.. It will for instance help you in making a confusion matrix for sentences like this:

Textual_Similarity.png

...how can we utilize this for translating English idioms to Swedish? I have a list of translated idioms at https://livet.se/ord/källa/Idiom Links to an external site. - there's 1890 "perfect" translations there (for a certain definition of perfect). Let's see if we can get a hold of these in a way that's more manageable for Python than a webpage.

sudo mysql -uroot -p -Dpex.livet -s -e"SELECT prov_en,prov_se FROM ord_proverbs WHERE ((author_id = 20402) AND NOT ((prov_en = '') || (prov_se = '') || (prov_se = '_')));" > idioms_log.txt
(vpex1) pex@p16 ~/phd/201205_Translate_English_idioms_to_Swedish $ head idioms_log.txt
Good riddance! Ajöss med den!
An/one's Achilles' heel En/någons akilleshäl
Not on your life/nelly! Aldrig i livet!
All roads lead to Rome Alla vägar bär till Rom
All hands to the pumps! Alle man till pumparna!
All or nothing Allt eller intet
All told Allt som allt
For all one is worth Allt vad tygen håller
All that glitters is not gold Allt är inte guld som glimmar
Always fall on one's feet Alltid komma ned på fötterna

...seems like a good start. We now have a text file, where one English idiom with its Swedish translation is separated with a tab character on every line.

    # Encoding English_idioms text with the Universal Sentence Encoder
# The Universal Sentence Encoder is a powerful Transformer model (in its large version)
# allowing to extract embeddings directly from sentences instead of from individual words.

# Load a pre-trained model for sentence tokenizing from tensorflow hub:
print("### Grabbing pre-trained sentence encoder:")
model = tfhub.load("https://tfhub.dev/google/universal-sentence-encoder/4")

# Grab the English Idioms that we fully know a prefect translation of:
print("### Loading idioms:")
English_idioms = []
Swedish_idioms = []
with open("idioms_log.txt") as f:
all_lines = f.readlines()
for line in all_lines:
sp = line.split("\t")
English_idioms.append(sp[0])
Swedish_idioms.append(sp[1][:-1])

# We are first using the model to encode all the idioms that the user shall search for:
print("### Translating all English_idioms to embeddings:")
print("There are %d idioms to process" % len(English_idioms))
batch_size = 10
embeddings = []
for i in range(0, len(English_idioms), batch_size):
embeddings.append(model(English_idioms[i : i + batch_size]))
English_idioms_embeddings = tf.concat(embeddings, axis=0)

# When a user is searching for an idiom, we can extract
# its embedding and find the most similar idiom
# in our database of translated idioms. In our case
# we use a simple vector dot product as a similary function:
def find_best_translation(English_idiom: str) -> str:
embedding = model([English_idiom])
# compute dot product with each English idiom:
scores = English_idioms_embeddings @ tf.transpose(embedding)
idiom_number = np.argmax(tf.squeeze(scores).numpy())
return idiom_number

search_texts = [
"Sweat like a horse",
"Work for a week",
"My cup of tea",
"child of time",
"six and a half dozen",
"penny",
"pretty",
]
for search_text in search_texts:
idiom_no = find_best_translation(search_text)
print(
"You searched for '%s', and I found: '%s' -> '%s'."
% (search_text, English_idioms[idiom_no], Swedish_idioms[idiom_no])
)

while search_text != "quit":
search_text = input(
"Write an English idiom you'd like to translate to Swedish, 'quit' to quit:\n"
)
idiom_no = find_best_translation(search_text)
print(
"You searched for '%s', and I found: '%s' -> '%s'."
% (search_text, English_idioms[idiom_no], Swedish_idioms[idiom_no])
)

 

How well does this perform, then? Let's run it:

### Grabbing pre-trained sentence encoder:
2020-12-05 10:54:45.915011: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2020-12-05 10:54:45.926230: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7f9b5fe629c0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-12-05 10:54:45.926247: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
### Loading idioms:
### Translating all English_idioms to embeddings:
There are 1890 idioms to process
You searched for 'Sweat like a horse', and I found: 'Rough it; work like a horse' -> 'Slita hund'.
You searched for 'Work for a week', and I found: 'Work like a slave' -> 'Arbeta som en slav'.
You searched for 'My cup of tea', and I found: 'Not be somebody's cup of tea' -> 'Inte vara någons likör'.
You searched for 'child of time', and I found: 'Child's play' -> 'En barnlek'.
You searched for 'six and a half dozen', and I found: 'It's as broad as it's long; it's six of one and half a dozen of the other' -> 'Det går på ett ut'.
You searched for 'penny', and I found: 'A pretty penny' -> 'En vacker slant'.
You searched for 'pretty', and I found: 'A raving beauty' -> 'En strålande skönhet'.
Write an English idiom you'd like to translate to Swedish, 'quit' to quit:
flower
You searched for 'flower', and I found: 'Hair-splitting' -> 'Hårklyverier'.
Write an English idiom you'd like to translate to Swedish, 'quit' to quit:
welcome
You searched for 'welcome', and I found: 'There you are!' -> 'Där ser du!'.
Write an English idiom you'd like to translate to Swedish, 'quit' to quit:
first and last
You searched for 'first and last', and I found: 'First and foremost' -> 'Först och främst'.
Write an English idiom you'd like to translate to Swedish, 'quit' to quit:
split hair
You searched for 'split hair', and I found: 'Split hairs' -> 'Ägna sig åt hårklyverier'.
Write an English idiom you'd like to translate to Swedish, 'quit' to quit:
haircut
You searched for 'haircut', and I found: 'Tear one's hair' -> 'Slita sitt hår'.
Write an English idiom you'd like to translate to Swedish, 'quit' to quit:
quit
You searched for 'quit', and I found: 'Kill (the) time' -> 'Fördriva tiden'.

 

Well, it's not perfect, but it's a lot better than Google Translate!

 

Encoder-Decoder NN for Machine Translation

This sub-chapter describes a paper on Sequence to Sequence Learning with Neural Networks Links to an external site. for translating English to French. The author of the book is French, btw.

The encoder is fed the source sentences in reversed order "because doing so introduced many short term dependencies between the source and the target sentence". Translation behaves like a LIFO - last word in is the one that's supposed to be translated and output first.

Fig16.3_a_simple_machine_translation_model.png

The source sentences are tokenized into multiple word ids, and then an embedding layer will turn each word into a n-dimentional vector.

There are a lot of fine details missing. They use a mixed bag of tricks to speed up:

  • Start-of-sequence (SOS) tokens, End-of-sequence (EOS) tokes, to handle multiple length sentences.
  • Feeding the decoder "the correct French word" after each step at training time, to speed up learning.
  • Grouping sentences to be translated into buckets of similar length, and use differently trained NNs for different lengths. Use of padding to make a too short sentence fit into a NN trained for longer sentences.
  • Masking out EOS tokens.
  • Use sampled softmax Links to an external site. instead of softmax. 
  • Bidirectional RNNs.

Fig16.5_bidirectional_RNN.png

Using a bidirectional RNN, they run two recurrent layers on the same inputs, one reading the words from left to right and the other reading them from right to left. Then they simply combine their outputs at each time step by concatenating them.

Beam Search

One way of improving language translation is to keep track of many candidates of the so-far most promising candidates for output. Instead of doing a final commit of which word to choose, a set of "translations under trial" is kept.

The softmax score for a single word is instead replaced with a combined softmax score for the last k "uncommitted" words together.

Attention Mechanisms

In 2014, the paper Neural Machine Translation by Jointly Learning to Align and Translate Links to an external site. was published describing Bahdanau attention.

They both try to shorten the path through the neural network when doing translations, by using an alignment model to choose which part of the encoder to match up with the decoder's internal state. The alignment model is a time-distributed dense layer, trained along with the encoder and decoder.

The attention mechanisms has the additional benefit of making neural networks explainable. Since they pinpoint what data was used to make a certain decision, they also provide a way for humans to correct mistakes that the neural network made. Since we can get the attention vectors when a "bad decision" was made, we can find out what kind of training data that needs to be added to avoid making the same bad decision again.

 

Transformers

From the paper Attention is all you need Links to an external site.. The funny thing about having spent so much time on reading about the neural networks architectures in this chapter is that you can forget all about them. At least when it comes to language processing and translation, where the Transformer architecture renders the others obsolete.

While using 1000 times fewer operations during training (at this level, that's a lot of $$$ on the electricity bill), the Transformer exhibits similar performance (measured through BLEU score Links to an external site.) as previous state-of-the-art methods.

The book describes the Transformers architecture in words, but if you prefer to get a description in code, head over to https://www.tensorflow.org/tutorials/text/transformer Links to an external site. - there you could easily spend a couple of days or perhaps a life with Python translating Portuguese to English!

 

My general NLP observations (...and others' observations)

GPT-3 has trouble with Bias & Unreliability:

  • If you ask it for writing a kid's story, it will deviate and start writing a horror plot text.
  • If we give it a task, the solution is chosen that looks most likely given the training data.

Francois Chollet author of "Deep learning with python" states:
"It is only constrained by plausability and not other important things such as factfullness or consitency, which is why it's so easy to generate things with GPT-3 that are untrue or even self contradictory."

 

Commercial uses of NLP

GPT-2 and BERT can be downloaded and run locally. OpenAI.com has a price-list for using their GPT-3 model: 100.000 tokes are "free", so you can get started for free, but you'll need to pay when you're doing it on an enterprise level.

Cost of training is substantial: https://www.theregister.com/2020/11/04/gpt3_carbon_footprint_estimate/ Links to an external site. 
"More specifically, they estimated teaching the [GPT-3] neural super-network in a Microsoft data center using Nvidia GPUs required roughly 190,000 kWh"

OpenAI has sold GPT-3 to Microsoft. Links to an external site.

 

Additional resources

Papers and webpages:

 

Session Agenda

The meeting plan for the 11th of December 2020:

  • Go through the chapter summary above and discuss that.
  • Discuss the suggested exercises.
  • If we have time and if you are interested let's talk about GPT-3.

 

Recommended exercises

If we get the time, we could discuss the following exercises from the book:

  1. What are the pros and cons of using a stateful RNN versus a stateless RNN?

  2. Why do people use Encoder–Decoder RNNs rather than plain sequence-to-sequence RNNs for automatic translation?

  3. How can you deal with variable-length input sequences? What about variable-length output sequences?

  4. What is beam search and why would you use it? What tool can you use to implement it?

  5. What is an attention mechanism? How does it help?

  1. When would you need to use sampled softmax?

There's a Jupyter Notebook with examples to play around with NLP:

https://github.com/ageron/handson-ml2/blob/master/16_nlp_with_rnns_and_attention.ipynb Links to an external site.