⚡

Transformer-based Recommendation

Self-Attention processes the entire behavior sequence at once

SASRec (2018) is the landmark model — the first to apply Transformer Self-Attention to recommendations.

Same goal as GRU4Rec — predict the next item. The difference is in methodology.

GRU vs Transformer

GRU processes sequences front-to-back. For the 10th item's representation to carry information from the 1st item, hidden state must propagate through 9 steps. Information dilutes.

Transformer lets every position attend to every other position directly. The 10th item can reference the 1st directly. Stronger on long-range dependencies, and far better for GPU parallelization.

Real-world impact

Alibaba, JD.com and other large e-commerce platforms reported significant CTR improvements after replacing GRU4Rec with Transformer-based models. However, larger models mean serving latency becomes an issue.