### 1. Review: Main idea of word2vec

### 2. Word2vec parameters and computations

### 3. Word2vec maximizes objective function by putting similar words nearby in space

### 4. Optimization: Gradient Descent

### 5. Stochastic Gradient Descent

### 6. Stochastic gradients with word vectors!

### 7. Word2vec: More details

why two vectors? -> Easier optimization. Average both at the end

- But can do it with just one vector per word

Two model variants: 1. Skip-grams(SG) Predict contexts(“outside”) words (position independent) given center word 2. Continuous Bag of Words(CBOW) Predict center word from (bag of) context words We presented: Skip-gram model

Additional efficiency in training: 1. Negative sampling So far: Focus on naive softmax(simpler, but expensive, training method)

### 8. The skip-gram model with negative sampling

### 9. But why not capture co-occurance counts directly?

With a co-occurrence matrix X

- 2 options: windows vs. full document
- Window: Similar to word2vec, use window around each word -> captures both syntactic (POS) and semantic information
- Word-document co-occurrence matrix will give general topics (all sports terms will have similar entries) leading to “Latent Semantic Analysis”

### 10. Example: Window based co-occurance matrix

- Window length 1 (more common: 5-10)
- Symmetric (irrelevant whether left or right context)
- Example corpus: I like deep learning. I like NLP. I enjoy flying.

### 11. Problems with simple co-occurrence vectors

- Increase in size with vocabulary
- Very high dimensional: requires a lot of storage
- Subsequent classification models have sparsity issues -> Modles are less robust

### 12. Solution: Low dimensional vectors

- Idea: store “most” of the important information in a fixed, small number of dimensions: a dense vector
- Usually 25-1000 dimensions, similar to word2vec
- How to reduce the dimensionality?

### 13. Method 1: Dimensionality Reduction on X

### 14. Simple SVD word vectors in Python

### 15. Hacks to X (several used in Rohde et al. 2005)

Scaling the counts in the cells can help a lot

- Problem: function words (the, he, has) are too frequent -> syntax has too much impact. Some fixes: # min(X, t), with t ~ 100 # Ignore them all
- Ramped windows that count closer words more
- Use Pearson correlations instead of counts, then set negative values to 0
- Etc.

### 16. Interesting semantic patterns emerge in the vectors

### 17. Count based vs. direct prediction

### 18. Encoding meaning in vector differences [Pennington, Socher, and Manning, EMNLP 2014]

### 19. Combining the best of both worlds GloVe [Pennington et al., EMNLP 2014]

### 20. GloVe results

### 21. How to evaluate word vectors?

- Related to general evaluation in NLP: Intrinsic vs. Extrinsic
- Intrinsic:
- Evaluation on a specific/intermediate subtask
- Fast to compute
- Helps to understand that system
- Not clear if really helpful unless correlation to real task is established

- Extrinsic:
- Evaluation on a real task
- Can take a long time to compute accuracy
- Unclear if the subsystem is the problem or its interaction or other subsystems
- If replacing exactly one subsystem with another improves accuracy -> Winning!

### 22. Intrinsic word vector evaluation

### 23. Glove Visualizations

### 24. Details of instrinsic word vector evaluation

### 25. Analogy evaluation and hyperparameters

### 26. On the Dimensionality of Word Embedding [Zi Yin and Yuanyuan Shen, NeurIPS 2018]

### 27. Another intrinsic word vector evaluation

### 28. Correlation evaluation

### 29. Word senses and word sense ambiguity

- Most words have lots of meanings!
- Especially common words
- Especially words that have existed for a long time

- Example: pike
- Dose one vector capture all these meanings or do we have a mess?