Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md
sutskever_et_al_2014__PCA.png	sutskever_et_al_2014__PCA.png

Session XI: Recurrent Neural Networks, including GRUs and LSTMs

Meeting date: April 19th, 2017

For our third consecutive session, we focused on CS224d, which is taught by Richard Socher and covers Natural Language Processing with Deep Learning. The Stanford University School of Engineering released the Winter 2017 lectures on April 3rd, so we began working from that collection.

In addition, we were treated to relevant talks by two heavy-hitters from the field of data science:

Claudia Perlich on predictability and how it creates biases when your target is created by mixtures (slides here)
Brian Dalessandro on generating text with Keras LSTM models

A summary blog post, replete with photos of the session, can be found here.

Recommended Preparatory Work

The recommended preparatory work for Session XI was lectures seven through nine of CS224d (2017), each of which is 75 to 80 minutes long:

Summary

Topic highlights of the session included:

From Lecture 7 (Introduction to TensorFlow)

Programming Model

"the big idea": express a numeric computation as a graph
- graph nodes:
  - operations
  - have any number of inputs and outputs
- graph edges:
  - tensors
  - flow between nodes
variables:
- "stateful" nodes
- output their current value
- their state is retained across multiple executions of a graph
- primarily used for model parameters
placeholders:
- nodes whose values are fed in at execution time
- used for, e.g., model inputs, labels
mathematical operations, e.g.:
- MatMul: multiply two matrix values
- Add: add elementwise (with broadcasting)
- ReLU: activate with elementwise rectified linear function

Getting Output

use e.g. sess.run(fetches, feeds)
fetches:
- list of graph nodes
- return the outputs of these nodes
feeds:
- dictionary-mapping from graph nodes to concrete values
- specifies the value of each graph node given in the dictionary

From Lecture 8 (Recurrent Neural Networks and Language Models)

Language Models

language model
- computes a probability for a sequence of words
- e.g., P(w_1, ..., w_T)
- useful for machine translation, e.g.:
  - word ordering: p(the cat is small) > p(small the is cat)
  - word choice: p(walking home after school) > p(walking house after school)

Traditional Language Models

probability is usually conditioned on window of n previous words
- an incorrect, but necessary, Markov assumption
to estimate probabilities, compute the probability of:
- unigrams, bigrams
- ...conditioned on one, two previous word(s)
even with a small-ish corpus (e.g., 100k words), this quickly becomes a lot of probabilities
- i.e., an exponential increase in n-grams with n words

Recurrent Neural Networks (to the rescue!)

RNNs tie the weights at each time step
condition the neural network on all previous words
RAM requirement only scales linearly with the number of words
use the cross-entropy loss function, but predict words instead of classes

The Unstable Gradient Problem

gradients can vanish or explode
multiplying the same matrix at each step during backpropagation makes training RNNs hard
typically gradients vanish, and in the case of language modelling or question-answering, words from time steps far away are not taken into consideration when training to predict the next word
an example where this is a problem:
- Jane walked into the room. John walked in too. It was late in the day. Jane said hi to __.
clipping trick for exploding gradients:
- introduced by Tomas Mikolov
- makes a big difference for RNNs
- clip gradients to a maximum value
- e.g.: clip large value (say, 100) to some maximum (say, 5)
- doesn't work for vanishing gradients because multiplying small numbers would cause jumps over local minimum
trick for vanishing gradients:
- initialise weights to identity matrix I and f(z) = rect(z) = max(z,0)
- makes a "huge difference" (Socher)
- idea first introduced in Socher et al. (2013)
- new experiment with RNNs in Le et al. (2015)

Sequence Modelling for Other Tasks

classify each word into:
- Named Entity Recognition
- entity-level sentiment in context
- opinionated expressions
example application and slides in Irsoy and Cardie (2014)

Evaluation

F1 score is common
more hidden layers does not always improve network performance

From Lecture 9 (Machine Translation and Advanced Recurrent LSTMs and GRUs)

Recap of most important concepts (see slides four through six for formulae)

word2vec
GloVe
neural net & max-margin error
multi-layer neural net & backpropagation
recurrent neural networks
cross-entropy error
mini-batched stochastic gradient descent

Machine Translation

methods are statistical
use large-scale parallel corpora, e.g., those produced by European Parliament
the first parallel corpus was the Rosetta Stone
the systems in traditional approaches are very complex

Deep Learning (to the rescue, again!)

traditional Machine Translation systems required hundreds of curated features and decades of research, leading to many specialised companies
with short sentences, an RNN encoder (e.g., encoding German)-decoder (outputting English) pair works

RNN Translation Model Extensions

train different RNN weights for encoding and decoding
compute every hidden state in decoder form
train deep RNNs, i.e., with multiple layers
potentially train bidirectional encoder
train input sequence in reverse order for simpler optimisation problem
- i.e., instead of ABC --> XY, train with CBA --> XY so that equivalent words tend to be closer
better units:
- the "main improvement" (Socher)
- Gated Recurrent Units
  - introduced by Cho et al. (2014)
  - keep around memories to capture long-distance dependencies
  - allow error messages to flo at different strengths depending on inputs
  - contain:
    - update gate
    - reset gate
    - new memory content: if reset gate unit is ~0, then previous memory is ignored and only the new word's information is stored
    - final memory at time step t combines current and previous time steps
  - take-home message: essentially, RNNs weight each word equally; GRUs, meanwhile, ignore unimportant words in a sequence while retaining important words in memory
- Long Short-Term Memory Units (LSTMs)
  - introduced by Hochreiter & Schmidhuber (1997)
  - relative to GRUs, these units are even more complex
  - at each time step, LSTMs are able to modify:
    - input gate: "current cell matters"
    - forget: gate 0, i.e., forget paste
    - output: how much cell is exposed
    - new memory cell
  - the final memory cell and final hidden state have separate equations (see slide 41 for all)
  - "very hip" (Socher)
    - en vogue default model for most sequence-labelling tasks
    - "very powerful", especially when stacked and made even deeper (each hidden layer is already computed by a deep internal network)
    - most useful if you have "lots and lots" of data
  - in 2015, Deep LSTMs were slightly behind the performance of traditional methods
  - by 2016, Deep LSTMs were unquestionably better (e.g., at WMT 16 competition; metamind ensemble finished second, but all top performers with respect to BLEU score for evaluating machine translation were Deep LSTMs)
  - PCA of vectors from last time-step hidden layer in e.g., in Sutskever et al. (2014), while they should be interpreted with caution of selection bias, suggest meaning -- not simply word order -- is captured by LSTM approach

Up Next

We are taking a break until early June while I (Jon Krohn) work on an Introduction to Deep Learning with TensorFlow project. When we return, we'll cover the next six lectures of the course, which cover:

Neural Machine Translation and Models with Attention
GRUs and Further Topics in NMT
End-to-End Models for Speech Processing
Convolutional Neural Networks
Tree Recursive Neural Networks and Constituency Pairing
Coreference resolution

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

week11

week11

README.md

Session XI: Recurrent Neural Networks, including GRUs and LSTMs

Recommended Preparatory Work

Summary

From Lecture 7 (Introduction to TensorFlow)

Programming Model

Getting Output

From Lecture 8 (Recurrent Neural Networks and Language Models)

Language Models

Traditional Language Models

Recurrent Neural Networks (to the rescue!)

The Unstable Gradient Problem

Sequence Modelling for Other Tasks

Evaluation

From Lecture 9 (Machine Translation and Advanced Recurrent LSTMs and GRUs)

Recap of most important concepts (see slides four through six for formulae)

Machine Translation

Deep Learning (to the rescue, again!)

RNN Translation Model Extensions

Up Next

Files

week11

Directory actions

More options

Directory actions

More options

Latest commit

History

week11

Folders and files

parent directory

README.md

Session XI: Recurrent Neural Networks, including GRUs and LSTMs

Recommended Preparatory Work

Summary

From Lecture 7 (Introduction to TensorFlow)

Programming Model

Getting Output

From Lecture 8 (Recurrent Neural Networks and Language Models)

Language Models

Traditional Language Models

Recurrent Neural Networks (to the rescue!)

The Unstable Gradient Problem

Sequence Modelling for Other Tasks

Evaluation

From Lecture 9 (Machine Translation and Advanced Recurrent LSTMs and GRUs)

Recap of most important concepts (see slides four through six for formulae)

Machine Translation

Deep Learning (to the rescue, again!)

RNN Translation Model Extensions

Up Next