RealDocBench: A Benchmark for Field-Level QA and Layout Understanding on Real-World Regulated Documents

Computer VisionarXiv:2606.07401PDF

AIPR assessment

Problem difficulty is high. This is a competitive, fairly saturated benchmark area where strong systems already exist, and the paper still separates systems across real regulated document genres, which is harder than reporting gains on an uncrowded task. The strengths reinforce each other: real documents, typed field supervision, a paired layout track, and a uniform harness make the benchmark more credible and more useful. The weaknesses also interact: vendor authorship, model-assisted gold crea

Abstract

Document parsing systems are increasingly deployed in high-stakes, regulated workflows such as mortgage underwriting, financial reporting, supply-chain logistics, and clinical records. Yet most public benchmarks evaluate parsers on clean academic layouts or synthetic prose, and report a single OCR or markdown-level similarity score. Such documents and metrics correlate poorly with what downstream agents actually need: the correct value for a specific field on a messy real-world page. We introduce RealDocBench, a two-track benchmark built from real regulated documents. The QA track contains 1,356 field-level questions over 581 documents spanning four domains, where each question is paired with a typed gold_dict of key-to-value answers and parsers are scored on both per-field and strict per-question accuracy. The layout track contains 1,500 human-verified page images annotated with COCO-style bounding boxes under a nine-class public taxonomy, scored with a Hungarian matcher that includes adjacency-aware split/merge recovery. We evaluate eighteen systems, spanning commercial parsing APIs, general-purpose VLMs, and open-source OCR models, under a uniform extraction-and-scoring protocol, and report accuracy alongside per-page cost and cache-busted latency. RealDocBench exposes a wide performance spread that single-number benchmarks hide, a persistently hard medical sub-domain, and sharp cost/latency trade-offs across operating points. We release the datasets, parser adapters, and evaluation harness to support reproducible, field-level comparison of document parsing systems.

Score Breakdown

Holistic Impression
82
Novelty
83
Rigor
79
Applicability
83
Clarity
85
Citation
76
Confidence: 85%

More from this week

More in Computer Vision