Intelligence is not the bottleneck: results from the AIPR study
AIPR reads a submitted manuscript and assigns a quality score. This study validates that score against 300ICLR 2026 submissions with public peer-review outcomes. Before we graded a single paper, the analysis plan was registered publicly, with its hypotheses, primary metric, and score thresholds fixed in advance. That is the ground for our confidence in what follows: nothing below was tuned after the outcomes were known. AIPR's frontier score agrees with the human outcome: it separates accepted from rejected work at AUROC 0.87, and on the production model every paper in its lowest-score quintile was rejected. A second result is about the model itself: even a bare one-paragraph prompt scores competently (AUROC 0.80), so within this cohort a frontier model already produces a first-pass signal that agrees substantially with human outcomes, with or without our prompt. The observed AIPR-vs-bare-prompt AUROC difference was not statistically resolved (p=0.09), so we do not claim equivalence; the paired cohort was not powered to resolve a small gap over an already strong baseline. What the pipeline adds is reliability (it grades far more consistently across runs than the bare model) and a grounded, rubric-anchored review in the same pass; the reviewer keeps the decision.
The score agrees with the human outcome
Each point is an ICLR submission: its AIPR overall score (x) against the mean rating its human reviewers gave (y), colored by decision tier. The score rises with reviewer rating (Spearman ρ = 0.60) and separates rejected from accepted work (AUROC 0.87). Mean score also climbs monotonically across the reject, poster, and oral tiers. Hover any point: its score breakdown (overall, reviewer rating, and the novelty / rigor / applicability / clarity subscores) shows in an overlay on the figure, and the other decision tiers dim so the point's own group stands out. Toggle above the plot between the frontier (production) model, the cheaper full-mini model, and the bare naive judge.
●reject ●poster ● oral
The model already carries much of the signal
We graded the same papers a second way: a single one-paragraph prompt to the same frontier model, with no rubric and no audit. On its own the bare prompt already separates accepted from rejected work (AUROC 0.80 [0.70, 0.88]). The full pipeline is higher, but the paired difference is not statistically resolved (ΔAUROC 0.07 [-0.01, 0.16], p=0.09 by a paired stratified bootstrap). That is not an equivalence result: we powered the study to detect a larger gap, and resolving a small difference over this already strong naive baseline would need a larger paired cohort. With or without our prompt, within this cohort the model already produces a first-pass signal that agrees substantially with human outcomes; the reviewer keeps the decision.
Score distributions by tier; brackets mark pairwise Mann–Whitney significance (* p<.05, ** p<.01, *** p<.001).
Hover a ROC curve to highlight it and read its AUROC with 95% CI; hover an AIPR distribution point for its dimension subscores (the naive judge emits only an overall).
AIPR grades far more consistently
Reliability is what the pipeline adds over the bare model. We graded the same papers repeatedly and measured how much each paper's score moves run to run. AIPR's scores barely move (median within-paper SD 0.7 points); the bare prompt swings several times more (median 2.8), occasionally by ten points or more on the same paper. A deployable signal has to return the same verdict on the same paper. That consistency is the engineering, not the raw intelligence.
A cheap model proxies the frontier score
Each point is a paper graded twice, once on a cheap model and once on the frontier model. They agree closely (Spearman ρ = 0.81; the dashed line is exact agreement), which is what lets the large-scale results run on the cheap model. Hover for the paired scores and the frontier dimension subscores.
Score versus manuscript covariates
A score that simply rewarded longer or more polished manuscripts would track a surface feature, not quality. Each point is a paper; the y-axis is its AIPR overall score and the x-axis is a non-identifying manuscript covariate you can switch below (word, page, token, or figure count). The rank correlation stays weak and the decision tiers overlap across the whole range, so the score is not a length or complexity proxy. The last option, review length, is the system's own output rather than a manuscript property — shown for interest, not as a confound control.
Conclusion
AIPR earns a specific role: a low score reliably flags work that is weak relative to the venue bar (on the production model, every paper in the bottom quintile was rejected), which is exactly what a first-pass triage tool is for. The distinction that keeps the claim honest is between agreement and prediction: the score agrees with human outcomes as a measurement (it discriminates and correlates), but it does not predict the acceptance decision for a given paper. It is not calibrated to an acceptance probability, it does not rank strong papers against each other, and the reviewer keeps the decision. We characterize where the score errs rather than burying it. The open question the study points to is not whether a model can produce a meaningful first-pass score, which it can in this cohort, but whether a system can do it reliably and with grounded evidence. That is an engineering problem, not a shortage of intelligence.