RealDocBench: A Benchmark for Field-Level QA and Layout Understanding on Real-World Regulated Documents

AIPR assessment

Problem difficulty is high. This is a competitive, fairly saturated benchmark area where strong systems already exist, and the paper still separates systems across real regulated document genres, which is harder than reporting gains on an uncrowded task. The strengths reinforce each other: real documents, typed field supervision, a paired layout track, and a uniform harness make the benchmark more credible and more useful. The weaknesses also interact: vendor authorship, model-assisted gold crea

Abstract

Document parsing systems are increasingly deployed in high-stakes, regulated workflows such as mortgage underwriting, financial reporting, supply-chain logistics, and clinical records. Yet most public benchmarks evaluate parsers on clean academic layouts or synthetic prose, and report a single OCR or markdown-level similarity score. Such documents and metrics correlate poorly with what downstream agents actually need: the correct value for a specific field on a messy real-world page. We introduce RealDocBench, a two-track benchmark built from real regulated documents. The QA track contains 1,356 field-level questions over 581 documents spanning four domains, where each question is paired with a typed gold_dict of key-to-value answers and parsers are scored on both per-field and strict per-question accuracy. The layout track contains 1,500 human-verified page images annotated with COCO-style bounding boxes under a nine-class public taxonomy, scored with a Hungarian matcher that includes adjacency-aware split/merge recovery. We evaluate eighteen systems, spanning commercial parsing APIs, general-purpose VLMs, and open-source OCR models, under a uniform extraction-and-scoring protocol, and report accuracy alongside per-page cost and cache-busted latency. RealDocBench exposes a wide performance spread that single-number benchmarks hide, a persistently hard medical sub-domain, and sharp cost/latency trade-offs across operating points. We release the datasets, parser adapters, and evaluation harness to support reproducible, field-level comparison of document parsing systems.

Score Breakdown

Holistic Impression

82

Novelty

83

Rigor

79

Applicability

83

Clarity

85

Citation

76

Confidence: 85%

RealDocBench: A Benchmark for Field-Level QA and Layout Understanding on Real-World Regulated Documents

AIPR assessment

Abstract

Score Breakdown

More from this week

More in Computer Vision