Apr 27 – May 3, 2026

Preprint Report: LLM reasoning control, live agent benchmarks, and faster 3D reconstruction


Across approximately 3,100 CS preprints on arxiv this week, three threads stood out. Language model research forms about 19% of submissions, AI agent research about 15%, and 3D vision research about 12%. Within language model research, cheaper tuning and test-time control for reasoning stands out as the most active thread this week, accounting for roughly 21% of language model submissions. Work on live workflow and tool-use benchmarking forms about 27% of AI agent research, with evaluation shifting toward verifiable actions under changing tasks. Work on faster Gaussian-splatting-based reconstruction forms about 25% of 3D vision research, where the pressure is to reach stable geometry sooner.

Cheaper reasoning control

Across language model work, the pressure is to get better reasoning traces without defaulting to bigger models or more expensive supervision. Latent-GRPO: Group Relative Policy Optimization for Latent Reasoning by Deng et al. at the Institute of Computing Technology, Chinese Academy of Sciences, pushes reinforcement learning into a latent reasoning space so the model can improve its internal steps more efficiently. Decouple before Integration: Test-time Synthesis of SFT and RLVR Task Vectors by Yuan et al. at Hong Kong Baptist University treats capabilities as pieces that can be mixed at inference time, instead of retraining one monolithic model for every tradeoff. AGoQ: Activation and Gradient Quantization for Memory-Efficient Distributed Training of LLMs by Lin et al. at Harbin Institute of Technology tackles the same budget problem from the systems side, shrinking training memory so reasoning-oriented tuning is easier to run at all.

Agent evaluation under use

Within agent work, the recurring question is whether a benchmark still means anything once tasks change and tool calls cost real time. Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows by Li et al. at CUHK in Hong Kong replaces static task sets with live workflow demand and grades agents on executed actions rather than fluent summaries. AgentFloor: How Far Up the tool use Ladder Can Small Open-Weight Models Go? by Karmakar et al. at Harvard asks how much practical tool use smaller open models can actually sustain, instead of assuming tool access is always a free win. To Call or Not to Call: A Framework to Assess and Optimize LLM Tool Calling by Wu et al. at the Max Planck Institute for Software Systems turns that concern into a measurable decision problem, trying to cut wasteful calls without losing task success.

Faster splats better geometry

In 3D vision, the center of gravity is moving from raw photorealism toward scene representations that settle quickly and stay geometrically trustworthy. Faster 3D Gaussian Splatting Convergence via Structure-Aware Densification by Lyu et al. at Max Planck speeds training by adding points in a way that respects scene structure, rather than relying on looser densification heuristics. 2D-SuGaR: Surface-Aware Gaussian Splatting for Geometrically Accurate Mesh Reconstruction by Gupta et al. at TU Darmstadt pushes splats closer to usable surfaces, addressing the common complaint that fast renderers still leave messy geometry. HERMES++: Toward a Unified Driving World Model for 3D Scene Understanding and Generation by Zhou et al. at Huazhong University of Science and Technology extends the same mood into driving scenes, using a shared world model so perception and generation operate on a common 3D description.