How many replicates do we need to detect a 10% improvement?

Problem

It’s pretty obvious that we’d be wasting precious time and cash if we tried to run enough replicates to detect a 2% signal with 10% noise. Unfortunately, what’s not obvious – and all too common – is when we are wasting precious time and cash running too few replicates to measure commercially relevant signals that are just too small relative to our experimental noise:

  • Have we been running enough replicates to detect a 10% improvement with our 10% CV (or a 2% improvement with our 2% CV)?
  • How small an improvement can we detect with our 10% CV if we only have time or cash to run 3 replicates (or 1 replicate)?

(Do we even know how small a change would be commercially relevant? Do we even know our CV? And what does a t-test have to do with these questions, if anything?)

Solution

Noise is always going to cloud our judgment to some degree, so before we can precisely answer the questions above we have to get specific about the frequencies of False Negatives (FN) and False Positives (FP) that our R&D program can tolerate. Perhaps you’ve never thought about it this way before, but it makes sense: the greater the opportunity cost of making the wrong decision, the more replicates we ought to run to “detect” the same commercially relevant signal.

Each solid line in the interactive visualization above illustrates the combinations of N and CV that together make it possible to detect a D% improvement with FP and FN error rates less than their specified targets. Crucially, as described in more detail in the next paragraph, each pair of FP and FN error rates is calculated relative to a single decision threshold (T%), which is drawn as a vertical dotted line. (If only a single gray dotted line is visible, then the black decision threshold is underneath it.) Rolling over the underlined text within the text box above will highlight the corresponding inputs and/or graph elements. 

To be clear, our target values of FP and FN are achieved only if (1) both test and reference conditions are replicated ≥N times each, (2) process noise is the same for both conditions, (3) process noise is stable throughout all our experiments that generate these data, (4) process noise is normally distributed with width ≤%CV, and (5) we are disciplined in promoting 100% of test conditions measured as >T% improvements over the reference condition and 0% of test conditions measured as <T% improvement. It’s important to emphasize and expound on that last point: if we first use our known process CV to determine how many replicates are needed to achieve our target FP and FN rates, but then throw the 2*N measurements we collect into a t-test, our actual FP and FN rates will be larger than intended because a t-test assumes – incorrectly, in this case – that we don’t already know our process CV.

Insights

  • It’s really hard to detect a difference (D) that is equal to our process CV: ≥7 replicates of each condition are needed to achieve FP = FN < 0.10 and ≥10 replicates are needed to achieve FP = FN < 0.05. NOTE: even more replicates are needed if we intend to run a t-test on our data! E.g., ≥27 replicates each condition are needed to achieve FN = FP < 0.05. If we don’t want to run so many replicates, we need to invest in reducing our CVs.
  • If we can only afford to run 3 replicates of each condition, we need CV < 0.6*D to achieve FN < 0.05 and FP < 0.1.
  • If we can only afford to run 1 replicate of each condition, we need CV < 0.35*D to achieve FN < 0.05 and FP < 0.1.
  • Our risk tolerance generally decreases with increasing scale since opportunity costs are magnified at larger scales. If we assume our experimental noise (%CV) is independent of scale, that lower risk tolerance at the larger scale means we’ll need to run more replicates than we did at the smaller scale to “detect” the same difference of commercial significance (D%). Of course, the actual cost of each replicate also increases with scale, so it is important either (1) to run with such low CVs at the larger scale that we can “detect” our commercially relevant differences with just 1 or 2 replicates, or (2) to develop scaled-down processes with such conclusively demonstrated predictive power that our at-scale runs are conscientiously managed as being merely validations of the latest small-scale results. Sometimes it’s unavoidable that we need to “run experiments at scale” to get the data we need to answer an important question, but so are the statistics: at minimum, make sure any decisions made with the at-scale data utilize confidence intervals; better still, commit to running the number of replicates calculated above as a function of %D, %CV, FP and FN; best of all, strive to run with such low CVs at scale or to develop such highly predictive scale-down models that it never comes to this.
  • In order to achieve the target values of FN and FP, it’s critical that the replicates sample the largest sources of process variation. If we don’t already know (1) our process CV and (2) our sub-process CVs, it is strongly recommended that each replicate in the experiment is executed as an independent instance of the full end-to-end process, and that the sequence (WHEN) and location (WHERE) of the replicated conditions is randomized. Too often we don’t even realize how wrong we’ve been when we make “common sense assumptions” that it doesn’t matter which stock of reagent or oven or measurement device or scientist is used for each replicate.
  • Frequently our replicates of the reference condition were performed weeks (if not months) ago, and our plan is to compare those results with next week’s replicates of the test condition. This approach seems cost- and time-efficient, but that’s only true when our process has been and remains stable throughout the collection of all this data. Because there’s always some risk that even our most stable processes will shift, it’s common to run at least one replicate of the reference condition as a “control” in the larger experiment focused on replicating the test condition. No doubt this is a better practice than not running the control at all, but it can give a false sense of security unless two additional actions are taken: (1) the data for the control runs that are run in experiment after experiment are funneled into a control chart, and (2) the number of control runs that are run in each experiment is sized deliberately according to estimates of process variability (and not just set to 1 because that’s cheap).
 

Equations

 Interactive visualization was created by Chris @ engaging-data.com.

Share this post

Back to Top