Problem
Whether we are attempting to do profitable manufacturing or reputable science, our goal is the same: to predict the future. Among both scientists and engineers, there is one prediction that is so obvious it is frequently unspoken and so foundational it is frequently untested: the prediction that two people independently executing the same procedure should get the same result. To be sure, scientists will often try to reproduce the results of a new journal article before attempting to extend it, and corporate product managers will usually budget for time-consuming comparability testing when doing tech transfer or scaling up a manufacturing process to a new site. Yet these same people will implicitly assume that Joe’s and Jane’s results are comparable when they do R&D in the same building – without sufficient evidence to do so.
We take pride in our commitment to carefully designing and executing the positive controls for each experiment we do. However, if we are not compiling the data from weeks and months of these positive controls to look for statistically significant correlations among the researchers who ran them, we are not extracting nearly as much value from them as we could.
Solution
This white paper, written with support from the good folks at JMP, outlines how we can learn more from our positive controls by testing for statistically significant differences in either the mean, standard deviation or outlier frequency among two or more researchers doing experiments in an industrial or academic R&D setting.
You can download the complete white paper from JMP’s Resource Center. Although its insights apply regardless of whether you use JMP, Python or R as your platform for data analysis, the detailed step-by-step guides to data analysis are JMP-specific (and reference this specific JMP table). Some of those insights are reproduced below.
Insights
- Whenever we difference or normalize our measurements against a control, we implicitly assume: (1) the nonrandom drivers of variation within each experiment influence each of its measurements identically; and (2) the magnitude of random variation in each measurement is small relative to the signals we are trying to measure. These implicit assumptions are too often untrue when we have not done the work to validate them explicitly.
- If we update the conditions we run as positive controls from time to time, we ought to update how we use our statistical software as well. Often the hardest part about using these features of our statistical software is knowing where to look for them.
- If we rarely repeat the same positive controls the real shifts in Joe’s outcomes can become impossible to detect. These undetected signals can become so confounded with the factors tested by our experiments that well over 5% of our 95% confidence intervals no longer contain their true values. Some parameters may be observed to be statistically significant when they are truly zero, while other parameters could be estimated as positive when they are in fact negative.
- Tests for outliers can be performed using control chart logic. Control charts are constructed by coupling a plot of repeated data versus time (i.e., a run chart) to an algorithm for testing each individual replicate for evidence of nonrandom variation. A run chart becomes a control chart when the outputs of this algorithm are superimposed as two red lines called control limits, each equidistant from a third line representing the output mean. Each individual point that falls outside the control limits flags evidence of nonrandom variation in the illustrated data.
- Most outlier tests suffer a substantial false negative rate because the presence of nonrandom variation in a data set tends to inflate whatever estimate of the normal component of variation is used to test for statistical significance. To mitigate these false negative risks it is appropriate to iteratively filter and re-test our data for additional outliers.
- Every detail of this article would remain relevant if we substituted Instrument, Day of the Week, Site or any number of other nominal factors everywhere that it currently reads Researcher. Regardless of which nuisance factor turns out to explain a surprising amount of the variation in our data tables – whether it’s a who, a what, a when or a where – too often we assume the noise in our data is normally distributed when it is not. In other words, too often we assume our research processes are sufficiently stable before we have invested the effort to make them so.