May 25 – May 31, 2026

Preprint Report: Agent reliability, reasoning efficiency, and safety-constrained robotics

Roughly 4,400 CS preprints hit arXiv this week. Approximately 3% touched LLM agents, about 6% fell into LLM inference research, and around 7% sat in robotics research. LLM agent reliability was one of the clearest slices inside its parent area, at roughly 33% of LLM agents research. Work on reasoning inference efficiency forms about 17% of LLM inference research, with effort shifting toward cheaper verification and decoding. Work on robotic autonomy under constraints forms about 14% of robotics research, where planning must respect geometry, dynamics, and safety limits.

Agent failure tracing

Agent work is moving away from polished workflow demos and toward the messier question of how systems fail once tools, memory, and long trajectories enter the loop. In FALAT: Tracing Failures in LLM Agent Trajectories via Dependency-Guided Search, Rafi et al. at Concordia University tackle the usual debugging problem, where a bad final answer hides the real earlier mistake, by searching dependency links to locate failure points in the trajectory. Momento: Evaluating Persistent Memory and Reasoning with Multi-Session Agentic Conversations by Merin et al. at Institut Teknologi Bandung addresses a different blind spot, showing that agents can look competent in one session yet fail when memory must persist across repeated interactions. Adversarial Feeds Steer LLM Agent Decisions Against Their Defaults pushes on external influence, testing how manipulated information streams can bend agent choices away from their nominal behavior.

Cheaper reasoning at run time

Reasoning work has a distinctly practical mood this week: the open problem is less how to make models think in principle, and more how to spend compute only where it helps. In Hybrid Verified Decoding: Learning to Allocate Verification in Speculative Decoding, Su et al. at Thoughtworks address the waste in checking every candidate token by learning where verification is worth paying for. WaveFilter: Enhancing the Long-Context Capability of Diffusion LLMs via Wavelet-Guided KV Cache Filtering by Yang et al. at Nanjing University of Science and Technology tackles long-context slowdown by filtering the key-value cache, the model's stored attention state, so generation keeps more useful context with less overhead. Latent Reward Steering: An Adaptive Inference-Time Framework that Implicitly Promotes Cognitive Behaviors in Reasoning LLMs tries to improve reasoning behavior at inference time rather than by retraining the base model.

Safety-bounded robot planning

Robotics keeps pressing toward systems that can act in the world without pretending the world is neat. In Robust Integrated Planning and Control for Quadrotors in Dynamic Environments via NMPC with CBF Penalties, Shayan et al. at Toronto Metropolitan University deal with moving obstacles by tying planning to control-barrier penalties, a way to encode safety limits directly in optimization. ImagineUAV: Aerial Vision-Language Navigation via World-Action Modeling and Kinodynamic Planning by Liu et al. at Pengcheng Laboratory tackles aerial navigation from language and vision by pairing a world-action model with kinodynamic planning, which keeps motions physically feasible. SkyShield: Occupancy as a Safety Interface for Low-Altitude UAV Autonomy by Gao et al. at Xiamen University addresses a common deployment gap by turning uncertain scene understanding into an occupancy-based safety interface that downstream autonomy stacks can actually use.

See this week's rankings