vla.cpp: A Unified Inference Runtime for Vision-Language-Action Models
AIPR assessment
Problem difficulty is high. Portable edge inference for modern VLA policies is a crowded systems problem, and the paper has to contend with heterogeneous backbones, iterative action heads, low-memory hardware, and real-robot control latency. The strengths reinforce each other: the runtime abstraction is concrete, the benchmark coverage is broad, the kernel optimization is measured at the right granularity, and the on-robot test makes the latency argument believable. The weaknesses also interact:
Abstract
Vision-Language-Action (VLA) policies are typically shipped as Python/PyTorch stacks that assume a workstation-class GPU, a mismatch for the hardware on which robots actually run. We present vla.cpp, a portable C++ inference runtime built on llama.cpp. To our knowledge, it is the first ggml-class engine to natively serve the flow-matching and diffusion VLA inference pattern, in which a cached vision-language prefix is consumed by a cross-attending action expert integrated over several solver steps. A single runtime serves seven architectures spanning five backbone and four action-head families behind one request/response protocol, with each model packaged as a self-contained bundle. On LIBERO-Object, the engine matches a state-of-the-art checkpoint to within one episode out of 200, and runs BitVLA at 100% success in 1.3 GiB of memory. The same bundle runs unchanged across three hardware tiers, from a consumer GPU down to an 8 GB embedded module. A cross-hardware roofline analysis shows that batch-1 VLA inference is compute-bound, so utilization rather than bandwidth is the deployment lever; an IMMA ladder GEMM derived from this analysis cuts BitVLA per-step latency by 4.5x. We then frame an on-robot stress test on an ALOHA arm that isolates the latency constraint under which a learned VLA must replan against a moving target on the hardware it was trained for. Code, demo videos, and the reproducible benchmark scaffold are available at https://fai-modelopt-tech.github.io/vla-cpp.github.io/.
Score Breakdown
More from this week
- RealDocBench: A Benchmark for Field-Level QA and Layout Understanding on Real-World Regulated Documents
- GitInject: Real-World Prompt Injection Attacks in AI-Powered CI/CD Pipelines
- ScaleDisturb: Exploiting Temporal Asymmetry to Amplify Read Disturbance in Modern DRAM Chips
- The CIFAR Synthetic Evidence Corpus for Detecting AI-Generated Evidence
- The Windows IOCTL Census: A Corpus-Scale, Multi-Architecture Database of the Driver Control-Code Surface
More in Robotics
- Benchmarking Vision-Language-Action Models on SO-101: Failure and Recovery Analysis
- Learning High-Frequency Continuous Action Chunks in Latent Space
- Dynamic Neural Koopman Distillation for Real-Time Robot Control Using Diffusion Models
- AcroRL: Learning Aggressive Quadrotor Inversion using Bidirectional Thrust
- 123D: Unifying Multi-Modal Autonomous Driving Data at Scale