Gated Bidirectional Linear Attention for Generative Retrieval

AIPR assessment

Problem difficulty: hard, competitive, and saturated. Long-sequence encoder efficiency for industrial recommendation sits in a crowded area with strong prior work on FlashAttention, linear attention, state-space models, and systems-level scaling. The strengths reinforce each other: a simple architectural idea, production-scale validation, explicit ablations, and latency measurements make the main result believable and useful. The weaknesses also interact: novelty is local, the closest bidirectio

Abstract

In recommender systems, generative retrieval typically uses an encoder-decoder setup: an encoder processes a user interaction history, and an autoregressive decoder then generates recommended items. In large-scale streaming services, active users accumulate very long histories over time. As histories grow, the encoder becomes a major latency bottleneck because softmax attention scales quadratically with sequence length. In our experiments, using bidirectional attention in the encoder substantially improves quality. However, most sub-quadratic attention methods focus on causal attention. We propose Gated Bidirectional Linear Attention (GBLA), a linear-time bidirectional attention layer that extends kernelized linear attention with three lightweight components: local causal mixing (Conv1D), sequence-level key gating for soft forgetting, and a gated RMSNorm output. On a large-scale Yandex Music dataset, a hybrid encoder that interleaves self-attention (SA) and GBLA in a 1:2 ratio (one SA block followed by two GBLA blocks) matches bidirectional self-attention quality. On H100 GPUs, GBLA reaches up to an $8.2\times$ single-layer speedup at a history length of 32768, compared to FlashAttention-v3. Finally, we show that the same hybrid design generalizes beyond our proprietary setting, consistently preserving self-attention retrieval quality on public Amazon benchmarks.

Score Breakdown

Holistic Impression

76

Novelty

67

Rigor

83

Applicability

82

Clarity

79

Citation

69

Confidence: 85%

Gated Bidirectional Linear Attention for Generative Retrieval

AIPR assessment

Abstract

Score Breakdown

More from this week

More in Information Retrieval