Gated Bidirectional Linear Attention for Generative Retrieval
AIPR assessment
Problem difficulty: hard, competitive, and saturated. Long-sequence encoder efficiency for industrial recommendation sits in a crowded area with strong prior work on FlashAttention, linear attention, state-space models, and systems-level scaling. The strengths reinforce each other: a simple architectural idea, production-scale validation, explicit ablations, and latency measurements make the main result believable and useful. The weaknesses also interact: novelty is local, the closest bidirectio
Abstract
In recommender systems, generative retrieval typically uses an encoder-decoder setup: an encoder processes a user interaction history, and an autoregressive decoder then generates recommended items. In large-scale streaming services, active users accumulate very long histories over time. As histories grow, the encoder becomes a major latency bottleneck because softmax attention scales quadratically with sequence length. In our experiments, using bidirectional attention in the encoder substantially improves quality. However, most sub-quadratic attention methods focus on causal attention. We propose Gated Bidirectional Linear Attention (GBLA), a linear-time bidirectional attention layer that extends kernelized linear attention with three lightweight components: local causal mixing (Conv1D), sequence-level key gating for soft forgetting, and a gated RMSNorm output. On a large-scale Yandex Music dataset, a hybrid encoder that interleaves self-attention (SA) and GBLA in a 1:2 ratio (one SA block followed by two GBLA blocks) matches bidirectional self-attention quality. On H100 GPUs, GBLA reaches up to an $8.2\times$ single-layer speedup at a history length of 32768, compared to FlashAttention-v3. Finally, we show that the same hybrid design generalizes beyond our proprietary setting, consistently preserving self-attention retrieval quality on public Amazon benchmarks.
Score Breakdown
More from this week
- RealDocBench: A Benchmark for Field-Level QA and Layout Understanding on Real-World Regulated Documents
- GitInject: Real-World Prompt Injection Attacks in AI-Powered CI/CD Pipelines
- vla.cpp: A Unified Inference Runtime for Vision-Language-Action Models
- ScaleDisturb: Exploiting Temporal Asymmetry to Amplify Read Disturbance in Modern DRAM Chips
- The CIFAR Synthetic Evidence Corpus for Detecting AI-Generated Evidence