As a chemical engineer with roots as an R&D process developer, the appeal of design of experiments (DOE) is its ability to handle multiple factors simultaneously. Traditional scientific methods restrict experimenters to one factor at a time (OFAT), which is inefficient and does not reveal interactions. However, a simple-comparative OFAT often suffices for a process improvement. If this is all that’s needed, you may as well do it right statistically. As industrial-statistical guru George Box reportedly said “DOE is a wonderful comparison machine.”

A fellow named William Sealy Gosset developed the statistical tools for simple-comparative experiments (SCE) in the early 1900s. As Head Experimental Brewer for Guiness in Dublin, he evaluated hops from various regions soft resin content—a critical ingredient for optimizing the bitterness on preserving their beer.1 To compare the results from one source versus another with statistical rigor, Gosset invented the t-test—a great tool for DOE even today (and far easier to do with modern software!).

The t-test simply compares two means relative to the standard deviation of the difference. The result can be easily interpreted with a modicum of knowledge about normal distributions: As t increases beyond 2 standard deviations, the difference becomes more and more significant. Gosset’s breakthrough came by his adjustment of the distribution for small sample sizes, which make the tails on the bell shape curve slightly fatter and the head somewhat lower as shown in Figure 1. The correction, in this case for a test comparing a sample of 4 results for one level versus 4 at the other, is minor but very important to get the statistics right.

Figure 1. Normal curve versus t-distribution (probabilities plotted by standard deviations from zero)

To illustrate a simple comparative DOE, consider a case study on the filling of 16-ounce plastic bottles with two production machines—line 1 and line 2.2 The packaging engineers must assess whether they differ. To make this determination, they set up an experiment to randomly select 10 bottles from each machine. Stat-Ease software makes this easy via its Factorial, Randomized, Multilevel Categorical design option as shown by the screen shot in Figure 2.

Figure 2. Setting up a simple comparative DOE in Stat-Ease software

The resulting volumes in ounces are shown below (mean outcome shown in parentheses).

- 16.03, 16.04, 16.05, 16.05, 16.02, 16.01, 15.96, 15.98, 16.02, 15.99 (16.02)
- 16.02, 15.97, 15.96, 16.01, 15.99, 16.03, 16.04, 16.02, 16.01, 16.00 (16.01)

Stat-Ease software translates the mean difference between the two machines (0.01 ounce) a t value of 0.7989, that is, less than one standard deviation apart, which produces a p-value of 0.4347—far above the generally acceptable standard of p<0.05 for significance. Its Model Graph in Figure 3 displays all the raw data, the means of each level and their least significant difference (LSD) bars based on a t test at p of 0.05—notice how they overlap from left to right—clearly the difference is not significant.

Figure 3. Graph showing effect on fill from one machine line to the other

Thus, from the stats and at first glance of the effect graph it seems that the packaging engineers need not worry about any differences between the two machine lines. But hold on before jumping to a final conclusion: What if a difference of 0.01 ounce adds up to a big expense over a long period of time? The managers overseeing the annual profit and loss for the filling operation would then be greatly concerned. Before doing any designed experiment, it pays to do a power calculation to work out how many runs are needed to see a minimal difference (signal ‘delta’) of importance relative to the variation (noise ‘sigma). In this case, the power for sample size 10 for a delta of 0.01 ounce with a sigma (standard deviation) of 0.028 ounces (provided by Stat-Ease software) generates a power of only 11.8%—far short of the generally acceptable level of 80%. Further calculations reveal that if this small of a difference really needed to be detected, they should fill 125 or more bottles on each line.

In conclusion, it turns out that simple comparative DOEs are not all that simple to do correctly from a statistical perspective. Some keys to getting these two level OFAT experiments done right are:

- Randomizing the run order (a DOE fundamental for washing out the impact of time-related lurking factors such as steadily increasing temperature or humidity).
- Performing at least 4 runs at each level—more if needed to achieve adequate power (always calculate this before pressing ahead!).
- Blocking out know sources of variation via a paired t-test,3 e.g., when assessing two runners, rather than them each running a number of time trials one after the other, race them together side-by-side, thus eliminating the impact of changing wind and other environmental conditions.
- Always deploying a non-directional two-tailed t-test4 (a fun alliteration!)—as done by default in Stat-Ease software; the option for a one-tailed t-test requires an assumption that one level of the tested factor will certainly be superior to the other (i.e., directional), which may produce false-positive significance; before going this route consult with our StatHelp consulting team.

- For more background on Gosset and his work for Guiness, see my 8/9/24 StatsMadeEasy blog on The secret sauce in Guinness beer?
- From Chapter 2, “Simple Comparative Experiments”, problem 2.24,
*Design and Analysis of Experiments, 8th Edition*, Douglas C. Montgomery, John Wiley and Sons, New York, NY, 2013. - “Letter to a Young Statistician: On ‘Student’ and the Lanarkshire Milk Experiment”,
*Chance Magazine*: Volume 37, No. 1, Stephen T. Ziliak. - Wikipedia, One- and two-tailed tests.

- One Factor tutorials in Program Help.
- Stat-Ease Academy eLearning PreDOE course, which includes a t-statistic software tutorial.