Meeting date: April 19th, 2017
For our third consecutive session, we focused on CS224d, which is taught by Richard Socher and covers Natural Language Processing with Deep Learning. The Stanford University School of Engineering released the Winter 2017 lectures on April 3rd, so we began working from that collection.
In addition, we were treated to relevant talks by two heavy-hitters from the field of data science:
- Claudia Perlich on predictability and how it creates biases when your target is created by mixtures (slides here)
- Brian Dalessandro on generating text with Keras LSTM models
A summary blog post, replete with photos of the session, can be found here.
The recommended preparatory work for Session XI was lectures seven through nine of CS224d (2017), each of which is 75 to 80 minutes long:
- Introduction to TensorFlow
- Recurrent Neural Networks and Language Models, and
- Machine Translation and Advanced Recurrent LSTMs and GRUs
Topic highlights of the session included:
- "the big idea": express a numeric computation as a graph
- graph nodes:
- operations
- have any number of inputs and outputs
- graph edges:
- tensors
- flow between nodes
- graph nodes:
- variables:
- "stateful" nodes
- output their current value
- their state is retained across multiple executions of a graph
- primarily used for model parameters
- placeholders:
- nodes whose values are fed in at execution time
- used for, e.g., model inputs, labels
- mathematical operations, e.g.:
- MatMul: multiply two matrix values
- Add: add elementwise (with broadcasting)
- ReLU: activate with elementwise rectified linear function
- use e.g.
sess.run(fetches, feeds)
- fetches:
- list of graph nodes
- return the outputs of these nodes
- feeds:
- dictionary-mapping from graph nodes to concrete values
- specifies the value of each graph node given in the dictionary
- language model
- computes a probability for a sequence of words
- e.g.,
P(w_1, ..., w_T)
- useful for machine translation, e.g.:
- word ordering:
p(the cat is small) > p(small the is cat)
- word choice:
p(walking home after school) > p(walking house after school)
- word ordering:
- probability is usually conditioned on window of n previous words
- an incorrect, but necessary, Markov assumption
- to estimate probabilities, compute the probability of:
- unigrams, bigrams
- ...conditioned on one, two previous word(s)
- even with a small-ish corpus (e.g., 100k words), this quickly becomes a lot of probabilities
- i.e., an exponential increase in n-grams with n words
- RNNs tie the weights at each time step
- condition the neural network on all previous words
- RAM requirement only scales linearly with the number of words
- use the cross-entropy loss function, but predict words instead of classes
- gradients can vanish or explode
- multiplying the same matrix at each step during backpropagation makes training RNNs hard
- typically gradients vanish, and in the case of language modelling or question-answering, words from time steps far away are not taken into consideration when training to predict the next word
- an example where this is a problem:
- Jane walked into the room. John walked in too. It was late in the day. Jane said hi to __.
- clipping trick for exploding gradients:
- introduced by Tomas Mikolov
- makes a big difference for RNNs
- clip gradients to a maximum value
- e.g.: clip large value (say, 100) to some maximum (say, 5)
- doesn't work for vanishing gradients because multiplying small numbers would cause jumps over local minimum
- trick for vanishing gradients:
- initialise weights to identity matrix I and
f(z) = rect(z) = max(z,0)
- makes a "huge difference" (Socher)
- idea first introduced in Socher et al. (2013)
- new experiment with RNNs in Le et al. (2015)
- initialise weights to identity matrix I and
- classify each word into:
- Named Entity Recognition
- entity-level sentiment in context
- opinionated expressions
- example application and slides in Irsoy and Cardie (2014)
- F1 score is common
- more hidden layers does not always improve network performance
- word2vec
- GloVe
- neural net & max-margin error
- multi-layer neural net & backpropagation
- recurrent neural networks
- cross-entropy error
- mini-batched stochastic gradient descent
- methods are statistical
- use large-scale parallel corpora, e.g., those produced by European Parliament
- the first parallel corpus was the Rosetta Stone
- the systems in traditional approaches are very complex
- traditional Machine Translation systems required hundreds of curated features and decades of research, leading to many specialised companies
- with short sentences, an RNN encoder (e.g., encoding German)-decoder (outputting English) pair works
- train different RNN weights for encoding and decoding
- compute every hidden state in decoder form
- train deep RNNs, i.e., with multiple layers
- potentially train bidirectional encoder
- train input sequence in reverse order for simpler optimisation problem
- i.e., instead of
ABC --> XY
, train withCBA --> XY
so that equivalent words tend to be closer
- i.e., instead of
- better units:
- the "main improvement" (Socher)
- Gated Recurrent Units
- introduced by Cho et al. (2014)
- keep around memories to capture long-distance dependencies
- allow error messages to flo at different strengths depending on inputs
- contain:
- update gate
- reset gate
- new memory content: if reset gate unit is ~0, then previous memory is ignored and only the new word's information is stored
- final memory at time step t combines current and previous time steps
- take-home message: essentially, RNNs weight each word equally; GRUs, meanwhile, ignore unimportant words in a sequence while retaining important words in memory
- Long Short-Term Memory Units (LSTMs)
- introduced by Hochreiter & Schmidhuber (1997)
- relative to GRUs, these units are even more complex
- at each time step, LSTMs are able to modify:
- input gate: "current cell matters"
- forget: gate 0, i.e., forget paste
- output: how much cell is exposed
- new memory cell
- the final memory cell and final hidden state have separate equations (see slide 41 for all)
- "very hip" (Socher)
- en vogue default model for most sequence-labelling tasks
- "very powerful", especially when stacked and made even deeper (each hidden layer is already computed by a deep internal network)
- most useful if you have "lots and lots" of data
- in 2015, Deep LSTMs were slightly behind the performance of traditional methods
- by 2016, Deep LSTMs were unquestionably better (e.g., at WMT 16 competition; metamind ensemble finished second, but all top performers with respect to BLEU score for evaluating machine translation were Deep LSTMs)
- PCA of vectors from last time-step hidden layer in e.g., in Sutskever et al. (2014), while they should be interpreted with caution of selection bias, suggest meaning -- not simply word order -- is captured by LSTM approach
We are taking a break until early June while I (Jon Krohn) work on an Introduction to Deep Learning with TensorFlow project. When we return, we'll cover the next six lectures of the course, which cover:
- Neural Machine Translation and Models with Attention
- GRUs and Further Topics in NMT
- End-to-End Models for Speech Processing
- Convolutional Neural Networks
- Tree Recursive Neural Networks and Constituency Pairing
- Coreference resolution