DORA: Dataflow-Instruction Orchestration Architecture for DNN Acceleration

Computer ArchitecturearXiv:2605.23833PDF

AIPR assessment

This is a hard, crowded systems problem, not a niche benchmark. Many groups have optimized Versal and FPGA DNN acceleration for years, so consistent gains against strong baselines matter. The strengths reinforce each other: a working hardware prototype, a concrete ISA, and a compiler flow that generates deployable instructions. The weaknesses also compound: the design is specialized, some evaluation components are not fully open or directly reproducible, and several claims of broad flexibility r

Abstract

As deep neural networks develop significantly more diverse and complex, achieving high performance and efficiency on complicated DNN models faces pressing challenges. Modern DNN workloads are increasingly diverse in operation types, tensor shapes, and execution dependencies, making it difficult to sustain high hardware efficiency across models. In addition, a generic accelerator often incurs substantial overhead when executing diverse workloads. To address these problems, we propose DORA, an instruction-based overlay architecture that explicitly describes dataflow via a proposed ISA, enabling fine-grained control of data movement, computation, and synchronization at the layer level. To support flexibility while achieving high performance, DORA adopts a novel on-chip memory management and computation parallelism management mechanism. DORA proposes a compilation framework that can generate instructions for given DNN workloads after a two-stage design space exploration. DORA framework also incorporates a MILP-based and a heuristic-based search engine to generate the schedule solution for different needs and constraints. We prototype DORA on the AMD Versal VCK190 platform, demonstrating its deployability on existing reconfigurable systems. Experimental results show that DORA maintains stable efficiency, with less than 5\% variation on a single vector processor across workloads exhibiting up to 6$\times$ variation in operation counts. Compared to state-of-the-art accelerators, DORA consistently achieves higher performance, delivering up to 5$\times$ throughput improvement. The heuristic-based scheduler further achieves up to 90\% optimality under practical time constraints. DORA is open-sourced at https://github.com/arc-research-lab/DORA.git.

Score Breakdown

Holistic Impression
74
Novelty
72
Rigor
74
Applicability
76
Clarity
73
Citation
80
Confidence: 85%

More from this week

More in Computer Architecture