The Prova Method · Standards

The Risk Taxonomy of Evaluation Failure

How large studies fail, and how to catch it before they do.

Large evaluations fail in predictable ways. Run a pre-mortem (imagine the study has already failed and ask what most likely went wrong) and the same culprits appear again and again. Most are visible in advance to someone who knows what to look for and has no stake in the answer. This is the catalog of those failure modes, across the three dimensions where studies come apart: the design, the context, and the ethics.

It is the standard behind the Pre-Mortem: the independent stress-test we run on a planned evaluation before the money is committed. We can run it cleanly because we don’t bid on the evaluation work we review.

Design

The study can’t answer the question, even if everything goes right.

Underpowered for a realistic effect

The power calculation assumes an effect larger than comparable programs have ever produced, so the study is built to “find nothing”: a false null dressed as a real one.

The tell: the assumed effect size sits above the field’s track record. The fix: power to an effect the program could plausibly produce, or state plainly that the study can only detect a large one.

The wrong outcome

The instrument measures something next to what the program actually changes, so real effects are missed.

The tell: the primary outcome doesn’t map to the program’s mechanism. The fix: tie the measure to the theory of change, and pre-specify it.

Contamination between groups

Treatment and comparison interact (shared markets, classrooms, information), and the comparison is polluted.

The tell: the groups aren’t actually isolated from each other. The fix: randomize at the level where isolation holds, or measure the spillover directly.

Differential attrition

People leave the study unevenly, and by endline the groups are no longer comparable.

The tell: long follow-up, a mobile population, no tracking plan. The fix: realistic attrition assumptions, a tracking budget, and bounds for what dropout could be hiding.

Open-ended analysis

With no locked plan, there are enough ways to slice the data that something will look significant.

The tell: no pre-analysis plan; outcomes and subgroups not fixed in advance. The fix: pre-register and lock before the data arrives.

Context

The study is sound on paper, but the world won’t let it run.

Political timing

An election, a leadership change, or a policy shift lands mid-study and disrupts the program or the comparison.

The tell: the timeline runs straight through a known political event. The fix: build in timing buffers and decide in advance how disruption will be handled.

Partner capacity

The implementing organization has never sustained a research protocol, and fidelity and data quality erode once the study is live.

The tell: no track record running a study; thin data systems. The fix: assess capacity honestly, lighten the instrumentation, or build readiness before launching.

Data that doesn’t exist in the form required

The measurement plan assumes records the partner doesn’t actually keep, or keeps on paper, fragmented.

The tell: the plan needs data nobody is capturing yet. The fix: instrument the data channel first, or rescope to what exists.

Ambition beyond the budget

Too many outcomes, sites, or arms for the time and money, so corners get cut once the study is underway.

The tell: the design’s reach exceeds its resources. The fix: narrow to the decision-relevant core, and do that part well.

Ethics

The study shouldn’t run as designed, or will lose the trust it needs.

Randomizing without equipoise or fairness

Randomizing access to something already known to help, or in a way the community reads as unfair, invites resistance, dropout, and harm.

The tell: no genuine uncertainty about the intervention, or no fair-allocation rationale. The fix: randomize only under real uncertainty, or where a lottery is the fair way to ration a scarce good; consider phase-in designs.

Consent that isn’t real

Consent is treated as a form to sign rather than a state of understanding, especially where power makes refusal hard.

The tell: consent-as-formality, a vulnerable population, no comprehension check. The fix: a genuine understanding process, revisited as the study goes on.

Data taken without a say

Data is drawn from a community that has no voice over how it is used or whether it benefits.

The tell: no governance role for the people whose data it is. The fix: shared data governance and a real account of who benefits.

A worked pre-mortem

Three failures stacked, caught before a dollar is spent.

Here is the kind of stack a pre-mortem is built to surface. Picture a foundation about to commission a $1.4M youth-employment trial across thirty sites.

A pre-mortem finds three failures stacked on top of each other: the power calculation assumes an effect larger than any comparable program has produced (a likely false null); the implementing partner has never run a randomized protocol (a fidelity risk); and the timeline runs through a state policy change that would alter the comparison (a context risk). None is visible in the glossy design. Revising all three before committing is the whole point: a $40,000 stress-test protecting a $1.4M decision. (Illustrative, not a specific engagement.)

The discipline

Likelihood, severity, and what would reduce it.

Every risk is graded the same way: how likely it is, how bad it would be if it happened, and what would reduce it; the few that are both likely and fatal are flagged before anything is signed.

Sometimes the honest finding is that the study shouldn’t run as designed at all. That is a real result, and we report it as one. We hold our own designs to the same catalog before we run anything ourselves.

Version 1. We add failure modes as we find them.

The Honesty Layer →See the Reads it powers →