FastOCR: Dynamic Visual Fixation via KV Cache Pruning for Efficient Document Parsing

AIPR assessment

Problem difficulty is high: efficient inference for OCR on large VLMs is a crowded and technically unforgiving setting, where many recent methods trade off accuracy against speed. The strongest strengths reinforce each other, since the method is simple, training-free, and supported by consistent gains on realistic document benchmarks and across multiple model families. The weaknesses also compound, because the mechanism is only indirectly validated, the cache is not actually shrunk, and the pape

Abstract

Vision-Language Models (VLMs) have shown strong promise on Optical Character Recognition (OCR), yet the sheer number of visual tokens required to encode dense documents incurs prohibitive inference cost. Existing pruning methods rely on physical eviction, e.g., permanently discarding visual tokens during the prefill stage. While effective for natural images, this strategy fundamentally breaks down on OCR, where virtually every visual token may correspond to a character or structural element, and any irreversible loss leads to catastrophic accuracy degradation. We observe that, although document images appear globally dense and seemingly unprunable, the model's attention to them is in fact temporally sparse: at each decoding step it concentrates on a small region that shifts gradually across steps, much as a human reader fixates on successive words rather than perceiving an entire page at once. Motivated by this Dynamic Visual Fixation phenomenon, we recast the intractable global pruning problem as a tractable local, dynamic one and propose FastOCR, a training-free framework with two complementary modules. Specifically, Focal-Guided Pruning identifies a small set of focal layers and selects the most task-relevant visual tokens from them at each step, while Cross-Step Fixation Reuse exploits the gradual shift of fixation to warm-start each step from the previous one. By dynamically adjusting which tokens are attended rather than evicting any from the cache, FastOCR avoids permanent information loss. Extensive experiments show that FastOCR serves as a plug-and-play acceleration module, generalizing consistently across five VLMs of varying sizes and architectures. On Qwen2.5-VL, FastOCR retains 98% of the unpruned model's accuracy while attending to only 5% of the visual tokens per decoding step, reducing attention latency by 3.0$\times$.

Score Breakdown

Holistic Impression

75

Novelty

72

Rigor

74

Applicability

77

Clarity

83

Citation

79

Confidence: 85%

FastOCR: Dynamic Visual Fixation via KV Cache Pruning for Efficient Document Parsing

AIPR assessment

Abstract

Score Breakdown

More from this week

More in Computer Vision