Word2Vec
Where word embeddings began
Word2Vec was published by Mikolov at Google in 2013. The core idea is straightforward β words appearing in similar contexts carry similar meanings.
Two architectures exist. CBOW takes surrounding words as input and predicts the center word. Skip-gram does the reverse: one word predicts its neighbors.
Structure: a shallow 2-layer neural network
Word2Vec is technically a neural network, just a very shallow one. Input layer β one hidden layer β output layer. That's it.
Feed in "cat" as a one-hot vector, and the output layer predicts context words like "cute" or "animal." After training, you extract the hidden layer's weight matrix β those weights ARE the word vectors.
If the hidden layer size is 300, every word becomes a 300-dimensional vector. Cosine similarity between these vectors gives you semantic distance.
Training your own vs. pretrained models
Word2Vec trains a neural network directly on your data. You need tens or hundreds of thousands of examples, and a GPU helps. The payoff is domain-optimized vectors.
Pretrained models like OpenAI's text-embedding-3-small are already trained on massive internet text. You call an API and get vectors back. Works with small datasets and supports multilingual out of the box, but won't be tuned to your domain.
Why it matters for RecSys
The real value is that vector arithmetic works: "king - man + woman = queen." Apply this idea to items and you get Item2Vec.
Compute similarity in the learned vector space, and you can find "items like this one" without explicit tags or categories.
How It Works
Extract word-context pairs from large text corpus
Train prediction with shallow neural net (1 hidden layer)
Speed up training with Negative Sampling
Hidden layer weights become word embeddings
Pros
- ✓ Fast training, handles millions of words
- ✓ Captures semantic relations via vector arithmetic
Cons
- ✗ Cannot distinguish polysemy (bank = finance? riverbank?)
- ✗ Ignores long-range dependencies beyond the context window