FastOCR: Dynamic Visual Fixation via KV Cache Pruning for Efficient Document Parsing
AIPR assessment
Problem difficulty is high: efficient inference for OCR on large VLMs is a crowded and technically unforgiving setting, where many recent methods trade off accuracy against speed. The strongest strengths reinforce each other, since the method is simple, training-free, and supported by consistent gains on realistic document benchmarks and across multiple model families. The weaknesses also compound, because the mechanism is only indirectly validated, the cache is not actually shrunk, and the pape
Abstract
Vision-Language Models (VLMs) have shown strong promise on Optical Character Recognition (OCR), yet the sheer number of visual tokens required to encode dense documents incurs prohibitive inference cost. Existing pruning methods rely on physical eviction, e.g., permanently discarding visual tokens during the prefill stage. While effective for natural images, this strategy fundamentally breaks down on OCR, where virtually every visual token may correspond to a character or structural element, and any irreversible loss leads to catastrophic accuracy degradation. We observe that, although document images appear globally dense and seemingly unprunable, the model's attention to them is in fact temporally sparse: at each decoding step it concentrates on a small region that shifts gradually across steps, much as a human reader fixates on successive words rather than perceiving an entire page at once. Motivated by this Dynamic Visual Fixation phenomenon, we recast the intractable global pruning problem as a tractable local, dynamic one and propose FastOCR, a training-free framework with two complementary modules. Specifically, Focal-Guided Pruning identifies a small set of focal layers and selects the most task-relevant visual tokens from them at each step, while Cross-Step Fixation Reuse exploits the gradual shift of fixation to warm-start each step from the previous one. By dynamically adjusting which tokens are attended rather than evicting any from the cache, FastOCR avoids permanent information loss. Extensive experiments show that FastOCR serves as a plug-and-play acceleration module, generalizing consistently across five VLMs of varying sizes and architectures. On Qwen2.5-VL, FastOCR retains 98% of the unpruned model's accuracy while attending to only 5% of the visual tokens per decoding step, reducing attention latency by 3.0$\times$.
Score Breakdown
More from this week
- VeriCache: Turning Lossy KV Cache into Lossless LLM Inference
- TriAxialKV: Toward Extreme Low-Precision KV-Cache Quantization for Agentic Inference Tasks
- ContraFix: Agentic Vulnerability Repair via Differential Runtime Evidence and Skill Reuse
- Loaded Dice: Solving the Non-Selection Problem for Scalable Probabilistic RowHammer Defense
- Stop Starving or Stuffing Me: Boosting Firmware Fuzzing Efficiency with On-demand Input Delivery
More in Computer Vision
- RealDocBench: A Benchmark for Field-Level QA and Layout Understanding on Real-World Regulated Documents
- Towards Interactive Video World Modeling: Frontiers, Challenges, Benchmarks, and Future Trends
- Rethinking Continual Anomaly Detection on the Edge: Benchmarking Under Realistic Industrial Conditions
- Beyond Visual Fidelity: Benchmarking Super-Resolution Models for Large-Scale Remote Sensing Imagery via Downstream Task Integration
- Railway Artificial Intelligence Learning Benchmark (RAIL-BENCH): A Benchmark Suite for Perception in the Railway Domain