1. Review: Main idea of word2vec

Review

2. Word2vec parameters and computations

Computations

3. Word2vec maximizes objective function by putting similar words nearby in space

Word2vec

4. Optimization: Gradient Descent

GD

5. Stochastic Gradient Descent

SGD

6. Stochastic gradients with word vectors!

SGD

SGD

7. Word2vec: More details

why two vectors? -> Easier optimization. Average both at the end

  • But can do it with just one vector per word

Two model variants: 1. Skip-grams(SG) Predict contexts(“outside”) words (position independent) given center word 2. Continuous Bag of Words(CBOW) Predict center word from (bag of) context words We presented: Skip-gram model

Additional efficiency in training: 1. Negative sampling So far: Focus on naive softmax(simpler, but expensive, training method)

8. The skip-gram model with negative sampling

SG

SG

SG

9. But why not capture co-occurance counts directly?

With a co-occurrence matrix X

  • 2 options: windows vs. full document
  • Window: Similar to word2vec, use window around each word -> captures both syntactic (POS) and semantic information
  • Word-document co-occurrence matrix will give general topics (all sports terms will have similar entries) leading to “Latent Semantic Analysis”

10. Example: Window based co-occurance matrix

  • Window length 1 (more common: 5-10)
  • Symmetric (irrelevant whether left or right context)
  • Example corpus: I like deep learning. I like NLP. I enjoy flying.

co-occurance matrix

11. Problems with simple co-occurrence vectors

  • Increase in size with vocabulary
  • Very high dimensional: requires a lot of storage
  • Subsequent classification models have sparsity issues -> Modles are less robust

12. Solution: Low dimensional vectors

  • Idea: store “most” of the important information in a fixed, small number of dimensions: a dense vector
  • Usually 25-1000 dimensions, similar to word2vec
  • How to reduce the dimensionality?

13. Method 1: Dimensionality Reduction on X

Dimensionality Reduction

14. Simple SVD word vectors in Python

SVD

SVD

15. Hacks to X (several used in Rohde et al. 2005)

Scaling the counts in the cells can help a lot

  • Problem: function words (the, he, has) are too frequent -> syntax has too much impact. Some fixes: # min(X, t), with t ~ 100 # Ignore them all
  • Ramped windows that count closer words more
  • Use Pearson correlations instead of counts, then set negative values to 0
  • Etc.

16. Interesting semantic patterns emerge in the vectors

Rohde

17. Count based vs. direct prediction

count vs. predict

18. Encoding meaning in vector differences [Pennington, Socher, and Manning, EMNLP 2014]

19. Combining the best of both worlds GloVe [Pennington et al., EMNLP 2014]

20. GloVe results

21. How to evaluate word vectors?

  • Related to general evaluation in NLP: Intrinsic vs. Extrinsic
  • Intrinsic:
    • Evaluation on a specific/intermediate subtask
    • Fast to compute
    • Helps to understand that system
    • Not clear if really helpful unless correlation to real task is established
  • Extrinsic:
    • Evaluation on a real task
    • Can take a long time to compute accuracy
    • Unclear if the subsystem is the problem or its interaction or other subsystems
    • If replacing exactly one subsystem with another improves accuracy -> Winning!

22. Intrinsic word vector evaluation

23. Glove Visualizations

24. Details of instrinsic word vector evaluation

25. Analogy evaluation and hyperparameters

26. On the Dimensionality of Word Embedding [Zi Yin and Yuanyuan Shen, NeurIPS 2018]

27. Another intrinsic word vector evaluation

28. Correlation evaluation

29. Word senses and word sense ambiguity

  • Most words have lots of meanings!
    • Especially common words
    • Especially words that have existed for a long time
  • Example: pike
  • Dose one vector capture all these meanings or do we have a mess?

30. Improving Word Representations Via Global Context And Multiple Word Prototypes (Huang et al. 2012)

31. Linear Algebraic Structure of Word Senses, with Applications to Polysemy (Arora, …, Ma, …, TACL 2018)

32. Extrinsic word vector evaluation