1. How do we have usable meaning in a computer?

Common solution: WordNet, a thesaurus containing lists of synonym sets and hypernyms(“is a relationships”)

2. Problems with resources like WordNet

  • missing nuance
  • missing new meanings of words, Impossible to keep up-to-date
  • Subjective
  • Requires human labor to create and adapt
  • Can’t compute accurate word similarity

3. Representing words as discrete symbols

In traditional NLP, we regard words as discrete symbols: hotel, conference, motel - a localist representation -> Words can be represented by one-hot vectors!

motel = [0 1 0 0 0 0 0 0]

hotel = [0 0 0 0 0 1 0 0]

Vectors dimension = number of words in vocabulary (e.g., 500,000)

4. Problems with words as discrete symbols

Problem: in web search, if user searches for “Seattle motel”, we would like to match documents containing “Seattle hotel”.

But, motel = [0 1 0 0 0 0 0 0]

hotel = [0 0 0 0 0 1 0 0] These two vectores are orthogonal. There is no natural notion of similarity for one-hot vectors!


  • Could try to rely on WordNet’s list of synonyms to get similarity? But it is well-known to fail badly: incomleteness
  • Instead: learn to encode similarity in the vectors themselves

5. Representing words by their context

  • Distributional semantics: A word’s meaning is given by the words that frequently appear close-by
  • “You shall know a word by the company it keeps”
  • When a word w appears in a text, its context is the set of words that appear nearby (within a fixed-size window)
  • Use the many contexts of w to build up a representation of w

6. Word vectors

We will build a dense vector for each word, chosen so that it is similar to vectors of word that appear in similar contexts

banking = [0.286 0.792 -0.177 … 0.271]

Note: word vectors are sometimes called word embedding or word representations. They are a distributed representation.

7. Word meaning as a neural word vector - visualization


8. Word2vec: Overview


  • We have a large corpus of text
  • Every word in a fixed vocabulary is represented by a vector
  • Go through each position t in the text, which has a center word c and context(“outside”) words o
  • Use the similarity of the word vectors for c and o to calculate the probability of o given c(or vice versa)
  • Keep adjusting the word vectors to maximize this probability


9. Word2vec: objective function

Word2Vec Word2Vec

10. Word2vec Overview with Vectors


11. Word2vec: prediction function


12. Training a model by optimizing parameters


13. To Train the model: Compute all vector gradients!