DORA: Dataflow-Instruction Orchestration Architecture for DNN Acceleration
AIPR assessment
This is a hard, crowded systems problem, not a niche benchmark. Many groups have optimized Versal and FPGA DNN acceleration for years, so consistent gains against strong baselines matter. The strengths reinforce each other: a working hardware prototype, a concrete ISA, and a compiler flow that generates deployable instructions. The weaknesses also compound: the design is specialized, some evaluation components are not fully open or directly reproducible, and several claims of broad flexibility r
Abstract
As deep neural networks develop significantly more diverse and complex, achieving high performance and efficiency on complicated DNN models faces pressing challenges. Modern DNN workloads are increasingly diverse in operation types, tensor shapes, and execution dependencies, making it difficult to sustain high hardware efficiency across models. In addition, a generic accelerator often incurs substantial overhead when executing diverse workloads. To address these problems, we propose DORA, an instruction-based overlay architecture that explicitly describes dataflow via a proposed ISA, enabling fine-grained control of data movement, computation, and synchronization at the layer level. To support flexibility while achieving high performance, DORA adopts a novel on-chip memory management and computation parallelism management mechanism. DORA proposes a compilation framework that can generate instructions for given DNN workloads after a two-stage design space exploration. DORA framework also incorporates a MILP-based and a heuristic-based search engine to generate the schedule solution for different needs and constraints. We prototype DORA on the AMD Versal VCK190 platform, demonstrating its deployability on existing reconfigurable systems. Experimental results show that DORA maintains stable efficiency, with less than 5\% variation on a single vector processor across workloads exhibiting up to 6$\times$ variation in operation counts. Compared to state-of-the-art accelerators, DORA consistently achieves higher performance, delivering up to 5$\times$ throughput improvement. The heuristic-based scheduler further achieves up to 90\% optimality under practical time constraints. DORA is open-sourced at https://github.com/arc-research-lab/DORA.git.
Score Breakdown
More from this week
- Optimus: Elastic Decoding for Efficient Diffusion LLM Serving
- TubiFM: Unified Item, Carousel, and Search Ranking for Streaming Discovery
- Context Features Are Cheap: Rank-Aware Decomposition for Efficient Feature Interaction in Recommender Systems
- Rethinking Continual Anomaly Detection on the Edge: Benchmarking Under Realistic Industrial Conditions
- Learning High-Frequency Continuous Action Chunks in Latent Space
More in Computer Architecture
- ScaleDisturb: Exploiting Temporal Asymmetry to Amplify Read Disturbance in Modern DRAM Chips
- ViBE: Co-Optimizing Workload Skew and Hardware Variability for MoE Serving
- Optimus: Elastic Decoding for Efficient Diffusion LLM Serving
- VeriCache: Turning Lossy KV Cache into Lossless LLM Inference
- Loaded Dice: Solving the Non-Selection Problem for Scalable Probabilistic RowHammer Defense