What Is Embedding?
You need numbers to compute β the starting point of all AI
Computers don't know that "cat" and "dog" are similar. They're different strings. Embedding solves this β similar meanings become nearby numbers, different meanings become distant numbers.
"cat" β [0.82, -0.15, 0.41, ...]
"dog" β [0.79, -0.12, 0.38, ...]
"car" β [-0.33, 0.67, -0.21, ...]
Cat and dog vectors are close, car is far. That's all embedding is.
Embedding is the goal, methods vary
"I want to convert words into vectors" β that's the goal called embedding. The method has evolved over time.
Statistics-based (pre-embedding era):
TF-IDF: Vectors from word frequency. No semantics
LSA: Dimensionality reduction via matrix decomposition. Slight semantic capture
Neural network-based (2013~):
Word2Vec: 2-layer neural net. Made "king - man + woman = queen" possible
GloVe: Hybrid of co-occurrence statistics + matrix decomposition
FastText: Character-level decomposition, robust to typos and neologisms
Transformer-based (2018~):
BERT: Context-aware embeddings. Same "bank" gets different vectors for finance vs. riverbank
GPT family: Large-scale pretrained models
OpenAI text-embedding-3-small: 1536 dimensions, multilingual, ready via API
Why embeddings matter for RecSys
The core question in recommendation is "will this user like this item?"
Put users and items in the same vector space, and distance becomes preference. Close means likely to enjoy, far means unlikely.
This idea is shared by most RecSys approaches from Matrix Factorization to Two-Tower models.
Training your own vs. pretrained models
Training your own creates vectors from your data. Domain-optimized but needs tens of thousands of examples.
Pretrained models (OpenAI, etc.) take your text and return vectors from an already-trained model. Works with small datasets and supports multilingual out of the box. Most projects should start here.
How It Works
Input unstructured data (text, images, etc.)
Embedding model converts to fixed-size numeric vector
Measure similarity via vector distance (cosine similarity, etc.)
Close vectors = semantically similar things
Pros
- ✓ Makes unstructured data mathematically comparable
- ✓ Multilingual and multimodal unification happens naturally in vector space
Cons
- ✗ Vectors alone cannot explain "why" things are similar (not interpretable)
- ✗ Embedding quality heavily depends on training data β biased data produces biased vectors