📐

Word2Vec

Where word embeddings began

Word2Vec was published by Mikolov at Google in 2013. The core idea is straightforward — words appearing in similar contexts carry similar meanings.

Two architectures exist. CBOW takes surrounding words as input and predicts the center word. Skip-gram does the reverse: one word predicts its neighbors.

Structure: a shallow 2-layer neural network

Word2Vec is technically a neural network, just a very shallow one. Input layer → one hidden layer → output layer. That's it.

Feed in "cat" as a one-hot vector, and the output layer predicts context words like "cute" or "animal." After training, you extract the hidden layer's weight matrix — those weights ARE the word vectors.

If the hidden layer size is 300, every word becomes a 300-dimensional vector. Cosine similarity between these vectors gives you semantic distance.

Training your own vs. pretrained models

Word2Vec trains a neural network directly on your data. You need tens or hundreds of thousands of examples, and a GPU helps. The payoff is domain-optimized vectors.

Pretrained models like OpenAI's text-embedding-3-small are already trained on massive internet text. You call an API and get vectors back. Works with small datasets and supports multilingual out of the box, but won't be tuned to your domain.