🎭

BERT4Rec

Masked Language Model for recommendations β€” learning preferences by filling blanks

SASRec uses left-to-right unidirectional attention. BERT4Rec (2019, Sun et al.) goes bidirectional.

In sequence [A, B, C, D, E], replace C with [MASK] and predict it using context from A, B, D, and E. Exactly what BERT does with text.

Unidirectional vs Bidirectional

SASRec: A β†’ B β†’ C β†’ ? (predict right side only)
BERT4Rec: A β†’ ? β†’ C β†’ D (use both sides)

Bidirectional seems intuitively stronger, but in actual serving you need "next item prediction," so at inference time you mask the last position.

Train-serve gap

Training masks random positions; serving predicts the last position. This gap can affect performance. Recent variants address this discrepancy.

How It Works

1

Replace random positions with [MASK] in user behavior sequence

2

Predict masked items with Bidirectional Transformer

3

Bidirectional context is reflected in representations

4

At serving time, mask last position to predict next item

Pros

  • Bidirectional context β€” richer representations than unidirectional
  • Can reuse BERT ecosystem tools and techniques

Cons

  • Objective mismatch between training and serving (train-serve gap)
  • Not always superior to SASRec (depends on data)

Use Cases

Recover missing preferred items from user history Next-click prediction in e-commerce sessions

References