Meeting date: March 6th, 2017
This session marked the beginning of our coverage of Stanford's CS224d course, which is taught by Richard Socher and focuses on Deep Learning applied to Natural Language Processing.
In addition, we enjoyed fascinating technical talks from:
- Jessica Graves on AI applications in the fashion industry, and
- Grant Beyleveld on his implementation of the U-Net Convolutional Networks for object recognition in images (slides here)
A summary blog post, replete with photos of the session, can be found here.
N.B.: this document was updated on May 22nd, 2017 to reflect Christopher Manning's 2017 iteration of the course.
The recommended preparatory work for Session IX was the first three lectures of CS224d, each of which is 75-80 minutes:
Topic highlights of the session included:
From Lecture 1 (Course Intro, NLP, Deep NLP; 2017 lecture here)
- proficiency in Python
- college-level calculus and linear algebra
- understanding of the fundamentals of probability and statistics
- knowledge of machine learning (i.e., equivalent of Stanford CS229), e.g.:
- cost functions
- simple derivatives
- how to optimise with gradient descent
- phonetic/phonological analysis (if starting with speech) or OCR (optical character recognition)/tokenization (if starting with text)
- morphological analysis
- syntactic analysis
- semantic interpretation
- discourse processing
Phonetic/phonological analysis (in level 1), level 3, and level 4 are covered in this course.
- simple:
- spell checking
- keyword search
- finding synonyms
- moderate:
- extracting information from websites, e.g.:
- product price
- dates
- location
- people or company names
- classifying
- school-grade reading level
- sentiment of longer documents
- extracting information from websites, e.g.:
- complex:
- machine translation
- spoken-dialog systems
- answering of non-straightforward questions
- search
- written
- spoken
- digital advertising
- language translation
- automated
- assisted
- sentiment analysis
- marketing
- finance/trading
- speech recognition
- chatbots / dialog agents:
- automation of customer support
- controlling devices (2017)
- ordering goods (2017)
- the representation, learning, and use of linguistic, situational, world, or visual information is complex
- examples:
- she in this example is dependent on the associated verb:
- "Jane hit June and then she [fell / ran]."
- the ambiguity of words: "I made her duck."
- she in this example is dependent on the associated verb:
- it is a subfield of machine learning, specifically of representation learning (i.e., where representations (=features) are learned by machines as opposed to created by humans)
- ML works well because of human-designed representations and input feature
- e.g.: the features for named entity recognition (locations, organisation names, etc.; Finkel, 2010)
- ML becomes a weight-optimisation problem to make the best final prediction
- ~80% of time: describing the data with features a computer can understand requires domain-specific knowledge, typically Ph.D.-level talent
- ~20% of time: optimising weights on features
- ML works well because of human-designed representations and input feature
- in contrast, deep learning:
- representation learning attempts to automatically learn useful features or representations
- algorithms attempt to learn (multiple levels of) representation and an output
- modelling directly on raw inputs (e.g., words)
- CS224d focuses on various families of artificial neural networks
- (A)NNs are the dominant model family inside deep learning
- is DL simply stacked logistic regression units?
- to an extent, however the end-to-end (e.g., text input-to-probability output) modelling principles distinguish it; there are connections to biological neuroscience in some cases
- CS224d does not take a historical approach, instead focusing on leading contemporary methods for NLP problems
- the history of Deep Learning models (i.e., since the ~1960s) is well-covered by Jürgen Schmidhuber (2015) "Deep Learning in Neural Networks: An Overview"
- manually-designed features are often:
- over-specified
- incomplete
- take a long time to design, validate
- in contrast, learned features are:
- easy to adapt
- fast to learn
- therefore, deep learning provides a framework for representing (e.g., linguistic, visual, world) information that is:
- flexible
- universal
- learnable
- deep learning is useful for both:
- unsupervised learning (e.g., with raw text alone)
- supervised learning (e.g., with labelled data like positive or negative sentiment)
- first large data set DL breakthrough happened in speech recognition
- the University of Toronto's Dahl et al. (2012) "Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition"
- until 2014, the bulk of deep-learning research groups focused on machine vision, which was the second application of DL after speech recognition
- i.e., AlexNet (Krizhevsky et al., 2012, "ImageNet Classification with Deep Convolutional Neural Networks")
- this approach combines the ideas and goals of NLP, then applies representation learning and deep learning methods to solve them
- in recent years, this approach has facilitated large strides across broad aspects of NLP, e.g.:
- levels:
- speech
- morphology
- syntax
- semantics
- applications:
- machine translation
- sentiment analysis
- question answering
- levels:
- traditional: phonemes
- DL: train model to predict phonemes (or words, directly) from sound features and represent them as vectors
- traditional: morphemes (e.g.: "un-"(prefix), "-interest-"(stem), "-ed"(suffix))
- DL:
- every morpheme is a vector
- neural network combines two vectors into one vector
- neural word vectors can be visualised in two-dimensional space
- e.g., Thang, Socher & Manning, 2013, "Better Word Representation with Recursive Neural Networks for Morphology"
- traditional: phrases, in discrete categories like NP or VP
- DL:
- every word and ever phrase is a vector
- a neural network combines two vectors into one vector
- e.g., Socher, Lin, Ng & Manning, 2011
- traditional:
- lambda calculus
- carefully-engineered functions
- takes specific other functions as inputs
- no notion of similarity or fuzziness of language
- lambda calculus
- DL:
- vectors represent every:
- word
- phrase
- logical expression
- again, neural network combines two vectors into one vector
- e.g., Bowman, Angeli, Potts & Manning, 2014
- vectors represent every:
- traditional: curated sentiment dictionaries combined with either:
- bag-of-words representations (i.e., ignoring word order)
- hand-designed engation features (this doesn't capture "everything")
- DL: one RNN model used simultaneously for:
- morphology
- syntax
- logical semantics
- common: a lot of feature engineering to capture world and other knowledge, e.g., regular expressions (Berant et al., 2014)
- DL: Can use same model as in morphology section above
- stores in vectors:
- morphology
- syntax
- logical semantics
- sentiment
- stores in vectors:
- traditional:
- many levels of translation have been tried in the past
- were very large complex systems
- DL: vectors!
- Socher's start-up
- acquired by Salesforce
- performs:
- sentiment analysis
- named-entity recognition
- part-of-speech tagging
- answers synthetic questions (whoa!)
- even if it requires multiple passes to understand meaning (wow wow wow)
- machine translation
- leverages ConvNet to label images
- n dimensions in word vector:
- minimum 25
- typically 300
- 1000 for advanced cases
From Lecture 2 (Word Vectors; 2017 lecture here)
- traditional:
- use a dictionary definition (this is a "denotational" representation)
- use a taxonomy like WordNet that has:
- hypernyms ("is-a") relationships and synonym sets
- problems with this discrete representation:
- great as a resource by misses nuances, e.g., of synonyms (these are not binary but gradual)
- does not include new words
- subjective
- requires human time to create and to adapt
- not straightforward to compute accuracy of word similarity
- nearly all rule-based and statistical NLP work regards words as atomic symbols, creating massive one-hot representation vectors:
- speech: 20k
- PTB (Penn Treebank 3): 50k
- big vocabulary: 500k
- Google 1TB web-crawl corpus: 13 million
- with one-hot encoding (a "localist" representation; in contrast to "distributed" representation where vector locations are continuous, "smeared"), similar words are not encoded differently from unrelated ones
- "distributional" contrasts with the "denotational" representations above
- "You shall know a word by the company it keeps" (JR Firth, 1957, 11) -- Wittgenstein proposed similar
- "One of the most successful ideas of modern statistical NLP" (Manning)
- there is great value in representing a word by the means of its neighbours
- earliest paper on this idea is Rumelhart, Hinton and Williams, 1986 (see the authors' letter to Nature as well)
- most influential modern paper on this is Bengio et al., 2003 though it was largely ignored at the time
- Collobert & Weston, 2008 revived Bengio's modern approach
- a recent, even simpler and faster model is Mikolov et al., 2013
- "predict between every word and its context words"
- contains two algorithms:
- Skip-Grams (SG)
- predict context words given a target (this is position independent)
- Continuous Bag of Words (CBOW)
- predict target word from bag-of-words context, i.e., the average (vector space coordinates?) of all context words together
- Skip-Grams (SG)
- contains two (moderately efficient) training methods
- hierarchical softmax
- negative sampling
- Christopher covers the Skip-Gram algorithm and (inefficient) naive softmax training
- predict (i.e., output) probability of context words (e.g., p(w_t-2|w_t), p(w_t+5|w_t) )in word window (of length 2m words) around center word in position t
- predict surrounding words in a window of length 2m of every word in corpus
- objective function: maximise the log-probability of any context word given the current center word
- every word has two vectors to make the math easier (and provides slightly better results):
- as center word
- as output word
- when considered with semantics (as opposed to syntax), ignoring word order improves results
- one could "cheat" at getting a high similarity score by making the vectors arbitrarily long (cosine distance can't be gamed this way)
- essentially "dynamic logistic regression"
- analogies
- linear relationships between vectors efficiently encode dimensions of similarity
- analogies testing dimensions of similarity can be solved quite well by doing vector subtraction in the embedding space
- e.g.:
- syntactically:
- x_apple - x_apples ~= x_car - x_cars ~= x_family - x_families
- semantically (i.e., for verb and adjective morphological forms; SemEval-2012 Task 2):
- x_shirt - x_clothing ~= x_chair - x_furniture
- x_king - x_man ~= x_queen - x_woman
- syntactically:
- symmetric windows (left or right context is equivalent), with lengths of five to ten are common
- in 2017, Lecture 3:
- Richard indicates that this achieves a similar outcome to word2vec but we are computing the co-occurrence matrix directly instead of by minimising the w2v cost function
- store most of the important information in:
- dense vector: fixed, small number of dimensions
- typically 25-100 dimensions
- dense vector: fixed, small number of dimensions
- dimensionality-reduction methods:
- singular value decomposition of co-occurrence matrix X
- hacks to X:
- function words (the, he, has) are too frequent
- ...therefore, syntax has too much impact
- solutions:
- min(X, t), with t~100
- ignore them all
- ramped windows that count closer words more
- Pearson correlations instead of counts, with negative values floor at zero
- and more
- function words (the, he, has) are too frequent
- problems with SVD:
- computational cost scales quadratically for n-by-m matrix
- impractical to fit millions of words or documents
- challenging to incorporate new words or documents
- the learning regime is different relative to DL models
- computational cost scales quadratically for n-by-m matrix
- hacks to X:
- directly learn low-dimensional word vectors
- an old idea
- learn representations by backpropagating errors (Rumelhart, Hinton & Williams, 1986)
- a neural probabilistic language model (Bengio et al., 2003)
- NLP ("almost") from scratch (Collobert & Weston, 2008)
- word2vec (Mikolov et al. 2013)
- recent
- simpler
- faster
- an old idea
- singular value decomposition of co-occurrence matrix X
- Richard Socher's table summarises the traditional (count-based) and contemporary (DL) NLP approaches with respect to:
- techniques
- key papers
- pros and cons
From Lecture 3 ("More on Word Vectors" in 2016; "GloVe" in 2017)
- with a large corpus (e.g., Google 1TB corpus):
- you could have 40B tokens and windows
- you would not have enough memory for a single update with gradient descent, or you'd have to wait a very long time
- ergo, gradient descent is not an optimal solution for (probably) all neural nets
- stochastic gradient descent:
- the solution!
- update model parameters after each window t
- Global Vectors for Word Representation (Pennington, Socher, & Manning (2014))
- the "best of both worlds" (Richard), i.e., count-based and direct prediction approaches
- enables fast training and scales to huge corpora
- nevertheless has good performance even with a small corpus and/or small vectors (because of efficient use of statistics)
- we have U and V from all vectors u and v
- both capture similar co-occurrence information
- best solution is to sum them, i.e., X_final = U + V
- one of many hyperparameters explored in Pennington et al.
- intrinsic:
- evaluation on a specific/intermediate subtask
- fast to compute
- helps to understand that system
- not clear if really helpful unless correlation to real task is established
- e.g.:
- word-vector analogies, e.g., man:woman :: king:queen
- can evaluate systematically with the Word Vector Analogies list from Google (N.B.: I was unable to find this immediately and Richard's link didn't work)
- word similarity scores from here
- word-vector analogies, e.g., man:woman :: king:queen
- extrinsic:
- evaluation on a real task
- can take a long time to compute accuracy
- doesn't clarify whether subsystem is the problem or its interaction with other subsystems
- progress is made if replacing one subsystem with another improves accuracy
- e.g., named-entity recognition
- extrinsic evaluations are the focus of CS224d/N
- in Richard's example, improve accuracy by (one should evaluate this for your own data set and model):
- increasing dimensionality of vector space up to ~300
- window size of eight around each center word suits GloVe vectors well
- asymmetric context (e.g., only words to the left) underperforms symmetric context
- more training time (i.e., iterations) improves accuracy
- more data improves accuracy(e.g., [Common Crawl with 42B tokens] > [Wikipedia with 1.6B tokens])
- N.B.: better results could potentially be obtained on downstream tasks with different hyperparameters
- ability to also classify words and phrases accurately
- that said, for some advanced models encountered later in the course, like sentiment analysis, re-training the word vector space from scratch can yield much better results
- Huang, Socher, Manning & Ng (2014) makes strides toward resolving word ambiguity (i.e., having the same word in multiple locations of vector space)
- the next three lectures of the course, which cover the use of neural networks to learn word-vector features