Apr 20 – Apr 26, 2026

Preprint Report: Clinical decision tools, agent evaluation, and efficient inference methods


Across approximately 3100 CS preprints on arxiv this week, three threads stood out. Clinical AI research forms about 6% of submissions, language model research about 18%, and machine learning systems research about 9%. Within clinical AI research, multimodal hospital decision support stands out as the most active thread this week, accounting for roughly 33% of clinical AI research submissions. Work on LLM and agent evaluation for real tasks forms about 17% of language model research, with the field testing models on messier tasks that reveal hidden failure modes. Work on efficient inference methods forms about 11% of machine learning systems research, where the pressure is to cut cost without losing usable structure.

Bedside models meet deployment

Across the clinical prediction work, the pressure has shifted from building another risk score to showing how a model might survive contact with hospital practice. An Integrated Framework for Explainable, Fair, and Observable Hospital Readmission Prediction: Development and Validation on MIMIC-IV adds calibrated outputs, subgroup checks, and observability hooks to the familiar readmission task, addressing the common problem of models that look fine offline but give little help once deployed. Machine Intelligence-Driven Forecasting for ED Triage and Dynamic Hospital Patient Routing tackles emergency department triage and routing with a broader operational framing, aiming to support flow decisions rather than a single diagnosis. High-resolution disconnectome predicts outcome and response to thrombectomy in basilar artery occlusion pushes the same translational instinct into stroke care, using lesion-disconnection structure to predict who benefits from intervention.

Harder tests for agents

In the agent and LLM block, the mood is less about claiming general capability and more about finding where systems break. AgentSearchBench: A Benchmark for AI Agent Search in the Wild addresses the easy-benchmark problem by grounding search tasks in live, messy environments where retrieval, planning, and execution all matter. Seeing the Whole Elephant: A Benchmark for Failure Attribution in LLM-based Multi-Agent Systems goes after another blind spot, separating whether a bad outcome came from coordination, tool use, or the underlying model itself. Measuring and Mitigating Persona Distortions from AI Writing Assistance adds a different kind of stress test, showing that assistance can warp the user's voice and then proposing mitigation strategies, which turns evaluation toward behavioral side effects rather than task accuracy alone.

Efficiency with structure

Away from the biomedical surge, several methods preprints are trying to speed up inference and retrieval without treating hardware or data patterns as afterthoughts. Guess-Verify-Refine: Data-Aware Top-K for Sparse-Attention Decoding on Blackwell via Temporal Correlation targets sparse-attention decoding, using temporal correlation to avoid wasting work on unlikely candidates while keeping the search focused enough to matter on modern GPUs. RouteLMT: Learned Sample Routing for Hybrid LLM Translation Deployment addresses a related efficiency problem by learning which inputs deserve the expensive model path and which do not. COMPASS: A Unified Decision-Intelligence System for Navigating Performance Trade-off in HPC tackles the same broader pressure from the systems side, helping operators choose among competing throughput and efficiency trade-offs instead of tuning by hand.