RealDocBench: A Benchmark for Field-Level QA and Layout Understanding on Real-World Regulated Documents
AIPR assessment
Problem difficulty is high. This is a competitive, fairly saturated benchmark area where strong systems already exist, and the paper still separates systems across real regulated document genres, which is harder than reporting gains on an uncrowded task. The strengths reinforce each other: real documents, typed field supervision, a paired layout track, and a uniform harness make the benchmark more credible and more useful. The weaknesses also interact: vendor authorship, model-assisted gold crea
Abstract
Document parsing systems are increasingly deployed in high-stakes, regulated workflows such as mortgage underwriting, financial reporting, supply-chain logistics, and clinical records. Yet most public benchmarks evaluate parsers on clean academic layouts or synthetic prose, and report a single OCR or markdown-level similarity score. Such documents and metrics correlate poorly with what downstream agents actually need: the correct value for a specific field on a messy real-world page. We introduce RealDocBench, a two-track benchmark built from real regulated documents. The QA track contains 1,356 field-level questions over 581 documents spanning four domains, where each question is paired with a typed gold_dict of key-to-value answers and parsers are scored on both per-field and strict per-question accuracy. The layout track contains 1,500 human-verified page images annotated with COCO-style bounding boxes under a nine-class public taxonomy, scored with a Hungarian matcher that includes adjacency-aware split/merge recovery. We evaluate eighteen systems, spanning commercial parsing APIs, general-purpose VLMs, and open-source OCR models, under a uniform extraction-and-scoring protocol, and report accuracy alongside per-page cost and cache-busted latency. RealDocBench exposes a wide performance spread that single-number benchmarks hide, a persistently hard medical sub-domain, and sharp cost/latency trade-offs across operating points. We release the datasets, parser adapters, and evaluation harness to support reproducible, field-level comparison of document parsing systems.
Score Breakdown
More from this week
- GitInject: Real-World Prompt Injection Attacks in AI-Powered CI/CD Pipelines
- vla.cpp: A Unified Inference Runtime for Vision-Language-Action Models
- ScaleDisturb: Exploiting Temporal Asymmetry to Amplify Read Disturbance in Modern DRAM Chips
- The CIFAR Synthetic Evidence Corpus for Detecting AI-Generated Evidence
- The Windows IOCTL Census: A Corpus-Scale, Multi-Architecture Database of the Driver Control-Code Surface
More in Computer Vision
- Towards Interactive Video World Modeling: Frontiers, Challenges, Benchmarks, and Future Trends
- Rethinking Continual Anomaly Detection on the Edge: Benchmarking Under Realistic Industrial Conditions
- FastOCR: Dynamic Visual Fixation via KV Cache Pruning for Efficient Document Parsing
- Beyond Visual Fidelity: Benchmarking Super-Resolution Models for Large-Scale Remote Sensing Imagery via Downstream Task Integration
- Railway Artificial Intelligence Learning Benchmark (RAIL-BENCH): A Benchmark Suite for Perception in the Railway Domain