1. How do we have usable meaning in a computer?
Common solution: WordNet, a thesaurus containing lists of synonym sets and hypernyms(“is a relationships”)
2. Problems with resources like WordNet
- missing nuance
- missing new meanings of words, Impossible to keep up-to-date
- Requires human labor to create and adapt
- Can’t compute accurate word similarity
3. Representing words as discrete symbols
In traditional NLP, we regard words as discrete symbols: hotel, conference, motel - a localist representation -> Words can be represented by one-hot vectors!
motel = [0 1 0 0 0 0 0 0]
hotel = [0 0 0 0 0 1 0 0]
Vectors dimension = number of words in vocabulary (e.g., 500,000)
4. Problems with words as discrete symbols
Problem: in web search, if user searches for “Seattle motel”, we would like to match documents containing “Seattle hotel”.
But, motel = [0 1 0 0 0 0 0 0]
hotel = [0 0 0 0 0 1 0 0] These two vectores are orthogonal. There is no natural notion of similarity for one-hot vectors!
- Could try to rely on WordNet’s list of synonyms to get similarity? But it is well-known to fail badly: incomleteness
- Instead: learn to encode similarity in the vectors themselves
5. Representing words by their context
- Distributional semantics: A word’s meaning is given by the words that frequently appear close-by
- “You shall know a word by the company it keeps”
- When a word w appears in a text, its context is the set of words that appear nearby (within a fixed-size window)
- Use the many contexts of w to build up a representation of w
6. Word vectors
We will build a dense vector for each word, chosen so that it is similar to vectors of word that appear in similar contexts
banking = [0.286 0.792 -0.177 … 0.271]
Note: word vectors are sometimes called word embedding or word representations. They are a distributed representation.
7. Word meaning as a neural word vector - visualization
8. Word2vec: Overview
- We have a large corpus of text
- Every word in a fixed vocabulary is represented by a vector
- Go through each position t in the text, which has a center word c and context(“outside”) words o
- Use the similarity of the word vectors for c and o to calculate the probability of o given c(or vice versa)
- Keep adjusting the word vectors to maximize this probability
9. Word2vec: objective function
10. Word2vec Overview with Vectors
11. Word2vec: prediction function
12. Training a model by optimizing parameters
13. To Train the model: Compute all vector gradients!