Transformer-based Recommendation
Self-Attention processes the entire behavior sequence at once
SASRec (2018) is the landmark model β the first to apply Transformer Self-Attention to recommendations.
Same goal as GRU4Rec β predict the next item. The difference is in methodology.
GRU vs Transformer
GRU processes sequences front-to-back. For the 10th item's representation to carry information from the 1st item, hidden state must propagate through 9 steps. Information dilutes.
Transformer lets every position attend to every other position directly. The 10th item can reference the 1st directly. Stronger on long-range dependencies, and far better for GPU parallelization.
Real-world impact
Alibaba, JD.com and other large e-commerce platforms reported significant CTR improvements after replacing GRU4Rec with Transformer-based models. However, larger models mean serving latency becomes an issue.
How It Works
Add Position Encoding to item sequence
Learn item-item relationships via Multi-Head Self-Attention
Refine representations with Feed-Forward + Layer Norm
Predict next item from the last position's output
Pros
- ✓ Directly captures long-range dependencies (advantage over GRU)
- ✓ Fast training via GPU parallelization
Cons
- ✗ O(nΒ²) attention complexity β cost grows with sequence length
- ✗ Serving latency harder to manage than GRU