May 4, 2026

Preprint Report: LLM reasoning efficiency, live agent evaluation, and 3D scene reconstruction

Across approximately 5,000 preprints submitted to arxiv this week, language-model work remains the largest share by far, with roughly 14% touching LLMs in some form and a visible slice of that focused on reasoning or post-training efficiency. Agent and tool-use evaluation is smaller but still prominent, at about 4% of weekly submissions, especially around live workflows. Geometry-rich 3D vision is a comparable visible thread, around 5%, spanning reconstruction, scene models, and embodied perception.

Leaner reasoning loops

Amid the steady flow of LLM preprints, the sharper turn is toward trimming reasoning overhead without giving up reliability. Latent-GRPO: Group Relative Policy Optimization for Latent Reasoning attacks the problem of verbose, brittle reasoning traces by optimizing in a latent space so the model can learn shorter internal deliberation. Making Every Verified Token Count: Adaptive Verification for MoE Speculative Decoding tackles wasted verifier work by checking generated tokens adaptively, which makes speculative decoding fit mixture-of-experts systems better. Reliable Answers for Recurring Questions: Boosting Text-to-SQL Accuracy with Template Constrained Decoding addresses recurring structured queries by restricting generation to reusable templates, which reduces both errors and latency.

Agents in live workflows

On the agent side, evaluation is getting less toy-like and more operational. Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows addresses the problem of stale agent benchmarks by grounding tasks in changing workflows and grading actions taken during execution rather than polished final answers. AgentFloor: How Far Up the tool use Ladder Can Small Open-Weight Models Go? tackles vague claims about small models as tool users by building a controlled ladder of tool tasks with explicit cost and latency reporting. Semia: Auditing Agent Skills via Constraint-Guided Representation Synthesis targets opaque agent failure modes by generating targeted test cases from skill constraints, giving developers a more surgical way to probe where an agent actually breaks.

Faster scene representations

In 3D vision, the action sits in the representation itself: how scenes are parameterized, densified, and made useful early in training. Faster 3D Gaussian Splatting Convergence via Structure-Aware Densification addresses slow, somewhat wasteful splat growth by adding an analytic densification rule that pushes geometry into useful places sooner. Beyond Visual Fidelity: Benchmarking Super-Resolution Models for Large-Scale Remote Sensing Imagery via Downstream Task Integration tackles the problem of image enhancement judged only by appearance by evaluating super-resolution through downstream tasks, which better reflects whether reconstructed detail is actually useful. Learning while Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies confronts the gap between offline robotics training and messy deployment by learning from a live robot fleet, where scene understanding and control have to hold up outside the lab.

See this week's rankings