Methodology

Humans in the loop

Every AI-generated review on AIPR is a starting point for a human reviewer, not a final verdict. The model produces scores and per-dimension comments; a reviewer reads the draft, edits whatever needs editing, and decides what the author sees. The AI never sends output directly to an author. Reviewers are free to overrule any score the model assigned. What the author receives is the reviewer's final version, not the model's.

Borderline scores, citation concerns, and suspected misconduct are flagged for the reviewer's attention but are never decided by the AI alone.

A longer version of why this matters lives on the purpose page.

How It Works

Every week we look at thousands of newly published research papers. Each selected paper is read in full and graded across five scoring dimensions by the model, producing a structured draft for a human reviewer to take forward.

The grading pass produces per-dimension scores with a written rationale for each one, so the reviewer (and later the author) can see why the model assigned what it did. Reviewers can revise scores, rewrite rationales, or send the draft back for another pass before it is finalised.

Citations are checked against external academic databases to verify that referenced works exist and are attributed correctly. Within a given weekly cohort, papers are also compared against each other so that the relative ranking reflects differences the model can defend, not differences the model happened to imagine.

Once a reviewer approves the review, the paper and its full evaluation breakdown can be published to the leaderboard. Readers see the reviewer-approved version, not the raw model output.

Scoring Dimensions

Each paper is scored on five dimensions, each from 0 to 100. Scores are combined into an overall score, then adjusted by model confidence.

Novelty

Evaluates the originality of the contribution. How new are the ideas, methods, or findings? Does the work introduce genuinely novel concepts or is it an incremental improvement?

Rigor

Assesses the methodological soundness and technical correctness. Are the proofs valid? Are experiments well-designed with proper controls? Are claims supported by evidence?

Applicability

Measures real-world relevance and potential impact. Can the methods be applied to practical problems? How broad is the potential audience? Does it solve a real need?

Clarity

Judges the quality of writing and presentation. Is the paper well-organized? Are the key ideas explained clearly? Are figures and tables informative?

Citation Quality

Evaluates the quality of references and related work coverage. Does the paper cite relevant prior work? Are comparisons fair and comprehensive? Are key baselines included?

Working on something with us?

Press, partnerships, bug reports, anything.

Contact us