Gated Bidirectional Linear Attention for Generative Retrieval

Information RetrievalarXiv:2606.07317PDF

AIPR assessment

Problem difficulty: hard, competitive, and saturated. Long-sequence encoder efficiency for industrial recommendation sits in a crowded area with strong prior work on FlashAttention, linear attention, state-space models, and systems-level scaling. The strengths reinforce each other: a simple architectural idea, production-scale validation, explicit ablations, and latency measurements make the main result believable and useful. The weaknesses also interact: novelty is local, the closest bidirectio

Abstract

In recommender systems, generative retrieval typically uses an encoder-decoder setup: an encoder processes a user interaction history, and an autoregressive decoder then generates recommended items. In large-scale streaming services, active users accumulate very long histories over time. As histories grow, the encoder becomes a major latency bottleneck because softmax attention scales quadratically with sequence length. In our experiments, using bidirectional attention in the encoder substantially improves quality. However, most sub-quadratic attention methods focus on causal attention. We propose Gated Bidirectional Linear Attention (GBLA), a linear-time bidirectional attention layer that extends kernelized linear attention with three lightweight components: local causal mixing (Conv1D), sequence-level key gating for soft forgetting, and a gated RMSNorm output. On a large-scale Yandex Music dataset, a hybrid encoder that interleaves self-attention (SA) and GBLA in a 1:2 ratio (one SA block followed by two GBLA blocks) matches bidirectional self-attention quality. On H100 GPUs, GBLA reaches up to an $8.2\times$ single-layer speedup at a history length of 32768, compared to FlashAttention-v3. Finally, we show that the same hybrid design generalizes beyond our proprietary setting, consistently preserving self-attention retrieval quality on public Amazon benchmarks.

Score Breakdown

Holistic Impression
76
Novelty
67
Rigor
83
Applicability
82
Clarity
79
Citation
69
Confidence: 85%

More from this week

More in Information Retrieval