Evaluation infrastructure for AI in high-stakes domains — starting with life sciences

Evaluation infrastructure for AI that can’t afford to be wrong.

Hopperlace builds the evaluation layer that tells you when an AI system’s outputs can be trusted — and when they can’t. Our first product is in life sciences, where the cost of a confident mistake is highest.

Evidence Synthesis AI screens studies the way a careful expert would. Built for the teams who can’t afford a confident mistake: systematic reviewers, and the drug-safety groups monitoring literature for adverse events.

~80

systematic reviews published every day

>1 yr

average review, from registration to publication

swing in a model's confident-error rate between reviews

01 · The problem

Screening is the bottleneck. It’s also the trap.

Title-and-abstract screening is the single largest time sink in evidence work. The obvious fix is to automate it — but a naive AI screener makes high-stakes review worse, not faster. An overconfident one corrupts every downstream step. An over-cautious one wipes out the time savings that justified using AI at all. The question was never whether AI can screen. It’s whether you can trust how it decides — and prove it afterwards.

02 · How it works

Decisive where it’s sure. Careful where it isn’t.

1

Handles the confident calls on its own

Clear includes and excludes decided automatically where the system is well-calibrated.

2

Surfaces only the genuine judgment calls

Ambiguous studies are routed to a human, so reviewer attention goes where it actually matters.

3

Shows the reasoning behind every decision

Each include, exclude, or deferral comes with the rationale the system used to get there.

4

Override anything, audit everything

Reviewers can change any decision, and the full log is the trail regulators expect.

See the full product

03 · Why you can trust it

Built on deference-aware evaluation.

Most AI metrics reward confident answers and penalise hesitation — the wrong incentive when overconfidence carries real cost. Deference-aware evaluation measures whether a system recognises the limits of its own competence and steps back when it should. It credits considered deferral as correct, separates it from genuine confident error, and surfaces a class of failures that more data and bigger models won’t fix.

Validation

Validated across 6 frontier models and 5 medical domains — 2,729 studies, 16,374 screening decisions.

6

frontier models

5

medical domains

2,729

studies

16,374

screening decisions

04 · Who it’s for

The teams who can’t afford a confident mistake.

Systematic review teams

Title-and-abstract screening without the year-long grind, with an audit trail that holds up to peer scrutiny.

Pharmacovigilance

Continuous literature monitoring for adverse events, with the documented decision trail regulators expect.

Research consultancies

Evidence work at speed, with rigour you can defend to a client or regulator.

05 · Research

White Paper · 2026 · Hopperlace Research · DOI: 10.17605/OSF.IO/A69YH

Poster · Workshop on Technical AI Governance Research (TAIGR), ICML 2026

Deference-Aware Evaluation for Human-in-the-Loop AI Systems

A framework for evaluating AI systems on their capacity to recognise the limits of their own competence and defer when appropriate, alongside standard accuracy. The paper identifies two failure modes that conventional metrics conflate — penalised conservatism and genuine confident errors — and introduces deference-aware metrics that distinguish them. A cross-domain audit of six frontier models across five medical domains (2,729 studies, 16,374 screening decisions) shows that no single model is uniformly safe, and isolates a structural class of failures that calibration, ensembling, and model scaling cannot fix.

Read on OSF

“A model’s confident-error rate can swing more than eightfold from one review to the next — which is why screening needs an evaluation layer that knows when its own judgments can be trusted.”

06 · Team

Who we are

YS

Yuyu Shen

Founder

A decade building production AI in regulated industries — fintech, employment, and consumer banking. Founded Hopperlace after recurring exposure to the same gap: AI deployed in high-stakes work without evaluation infrastructure capable of distinguishing trustworthy outputs from overconfident ones. Author of the deference-aware evaluation framework; poster at ICML 2026 TAIGR workshop.

MW

Martin Walker, MPH

Co-founder, Evidence Synthesis

Background in evidence-based health and systematic review evidence synthesis; brings the domain experience that keeps the system honest about clinical reality.

07 · Where we go next

Where we go next

Life sciences is where we start — systematic review and pharmacovigilance are the domains where calibration failures have the clearest downstream cost, and where the compliance trail is non-negotiable. But the evaluation problem is domain-general.

The same failure mode — AI that doesn’t know the limits of its own competence — appears wherever consequential decisions rest on AI outputs: legal research and contract review, safety-critical engineering, financial and regulatory filings. Hopperlace’s methodology was designed from the ground up to travel across these domains.

We’re building the infrastructure layer first. The beachhead makes it real.

Contact

Get in touch

Running a systematic review or pharmacovigilance team? We’re onboarding early pilots. Investors and partners: hello@hopperlace.ai

hello@hopperlace.ai