Evaluation infrastructure for AI decisions that can’t afford to be wrong.

Hopperlace tells you when an AI system’s decisions can be trusted and shows you why. We start where confident mistakes carry both societal and financial costs: high-stakes evidence work in life sciences.

Start a pilot See the product

~80

systematic reviews published every day

>1 year

average review, from registration to publication

8×

swing in a model's confident-error rate between reviews

01 · The problem

Speed to accuracy is the real problem.

Dual human screening is the gold standard for a reason. Single reviewers miss 13% of relevant studies. It’s also why screening takes months. AI can now help reduce screening time. But which AI? On their most confident decisions, frontier models differ by 45× in error rate — and a model that isn’t aware of its limits doesn’t just risk the review, it slows it down, burying reviewers in false results.

Most AI tools don’t know about their limitations. Our product knows when it can’t make a confident decision and brings this to your attention.

02 · How it works

Decisive when it’s confident. Careful when it’s not.

Handles the confident calls on its own

Clear includes and excludes decided automatically where the system is well-calibrated.

Surfaces only the genuine judgment calls

Ambiguous decisions are routed to a human, so reviewer attention goes where it actually matters.

Shows the reasoning behind every decision

Each include, exclude, or deferral comes with the rationale the system used to get there.

Override anything, audit everything

Reviewers can change any decision, and the full log of human and AI decisions are auditable.

See the full product

03 · Why you can trust it

Validation

Validated across 6 frontier models and 5 medical domains — 2,729 studies, 16,374 screening decisions. No single model is uniformly safe; deference-aware evaluation is how you tell which decisions to trust.

45×

error-rate gap between the best and worst model on their most-confident decisions — 99.7% vs 87.6% accurate

independent metrics that converge on the same model ranking

04 · Who it’s for

The teams who can’t afford a confident mistake.

Systematic review teams

Title-and-abstract screening without the year-long grind, with an audit trail that holds up to peer scrutiny.

Pharmacovigilance / drug-safety teams

Continuous literature monitoring for adverse events, with the documented decision trails.

Research consultancies

Evidence work at speed, with rigour you can defend to clients.

05 · Research

White Paper · 2026 · Hopperlace Research · DOI: 10.17605/OSF.IO/A69YH

Poster · Workshop on Technical AI Governance Research (TAIGR), ICML 2026

Deference-Aware Evaluation for Human-in-the-Loop AI Systems

Read on OSF

“A model’s confident-error rate can swing more than eightfold from one review to the next — which is why screening needs an evaluation layer that knows when its own judgments can be trusted.”

06 · Team

Who we are

Yuyu Shen

Founder

A decade building production AI in regulated industries — fintech, employment, and consumer banking. Founded Hopperlace after recurring exposure to the same gap: AI deployed in high-stakes work without evaluation infrastructure capable of distinguishing trustworthy outputs from overconfident ones. Author of the deference-aware evaluation framework; poster at ICML 2026 TAIGR workshop.

Martin Walker, MPH

Co-founder, Evidence Synthesis

Background in evidence-based health and systematic review evidence synthesis; brings the domain experience that keeps the system honest about clinical reality. Resident AI skeptic.

07 · Where we go next

Where we go next

Life sciences is where we start. But the problem of AI that doesn’t know the limits of its own competence appears wherever consequential decisions rest on AI outputs. Hopperlace’s methodology was built to travel.

Contact

Get in touch

Running a systematic review or pharmacovigilance team? We’re onboarding early pilots. Investors and partners: hello@hopperlace.ai

hello@hopperlace.ai