Note: If you are interested in learning more, and to see these graphs in action, check out this YouTube video “Dive into Diagnostics to Discover Data Discrepancies”
The purpose of running a statistically designed experiment (DOE) is to take a strategically selected small sample of data from a larger system, and then extract a prediction equation that appropriately models the overall system. The statistical tool used to relate the independent factors to the dependent responses is analysis of variance (ANOVA). This article will lay out the key assumptions for ANOVA and how to verify them using graphical diagnostic plots.
The first assumption (and one that is often overlooked) is that the chosen model is correct. This means that the terms in the model explain the relationship between the factors and the response, and there are not too many terms (over-fitting), or too few terms (under-fitting). The adjusted R-squared and predicted R-squared values specify the amount of variation in the data that is explained by the model, and the amount of variation in predictions that is explained by the model, respectively. A lack of fit test (assuming replicates have been run) is used to assess model fit over the design space. These statistics are important but are outside the scope of this article.
The next assumptions are focused on the residuals—the difference between an actual observed value and its predicted value from the model. If the model is correct (first assumption), then the residuals should have no “signal” or information left in them. They should look like a sample of random variables and behave as such. If the assumptions are violated, then all conclusions that come from the ANOVA table, such as p-values, and calculations like R-squared values, are wrong. The assumptions for validity of the ANOVA are that the residuals:
Independence: since the residuals are generated based on a model (the difference between actual and predicted values) they are never completely independent. But if the DOE runs are performed in a randomized order, this reduces correlations from run to run, and independence can be nearly achieved. Restrictions on the randomization of the runs degrade the statistical validity of the ANOVA. Use a “residuals versus run order” plot to assess independence.
Mean of zero: due to the method of calculating the residuals for the ANOVA in DOE, this is given mathematically and does not have to be proven.
Constant variance: the response values will range from smaller to larger. As the response values increase, the residuals should continue to exhibit the same variance. If the variation in the residuals increases as the response increases, then this is non-constant variance. It means that you are not able to predict larger response values as precisely as smaller response values. Use a “residuals versus predicted value” graph to check for non-constant variance or other patterns.
Well-behaved (nearly normal) distribution: the residuals should be approximately normally distributed, which you can check on a normal probability plot.
A frequent misconception by researchers is to believe that the raw response data needs to be normally distributed to use ANOVA. This is wrong. The normality assumption is on the residuals, not the raw data. A response transformation such as a log may be used on non-normal data to help the residuals meet the ANOVA assumptions.
Repeating a statement from above, if the assumptions are violated, then all conclusions that come from the ANOVA table, such as p-values, and calculations like R-squared values, are wrong, at least to some degree. Small deviations from the desired assumptions are likely to have small effects on the final predictions of the model, while large ones may have very detrimental effects. Every DOE needs to be verified with confirmation runs on the actual process to demonstrate that the results are reproducible.
Good luck with your experimentation!
The final stage of analyzing designed experiments data is determining the optimal set of process conditions that works for all responses. Stat-Ease software does this via a numerical optimization algorithm. This routine simultaneously optimizes all responses at once, based on goals set by the experimenter. This is achieved by deploying the Derringer-Suich(1) desirability criteria in conjunction with the Nelder-Mead(2) variable-sized simplex search algorithm. This optimization function balances competing response goals to find the “sweet spot” that produces the best of all worlds. Without getting deep into the mathematical weeds of these tools, I would like to provide some basic concepts and discuss how to use this method to optimize DOE results.
Starting point: minimum model requirements
Numerical optimization uses prediction models created by the analysis of each measured response. The stronger the prediction models, the more accurate the optimization results. If the analysis does not show a strong relationship between the factors and the response, then optimization will not work well. At a minimum, the model p-value should be less than 0.05, and the model should only include terms that are statistically significant plus those needed to maintain model hierarchy. If the DOE data included replicates, then there should be an insignificant lack of fit test (p-value >0.10). Key summary statistics for modeling include adjusted R-squared and predicted R-squared. Higher is better for each of these, meaning that more variation in the data and in the predictions is explained by the model. There is not a particular “cut-off” for these values but models that explain more than 50% of the variation are going to perform better than those that do not. In summary, start optimization with response models that explain the data and produce reliable predictions.
Desirability at a specific point
Numerical optimization is driven by a mathematical calculation called desirability. Points within the design space are evaluated via the desirability function that is defined by the user-specified goals for each response. The overall (multi-response) desirability (D) is the geometric mean of the individual desirability (di) for each response.
Figure 1: Desirability function
An individual desirability “little d” (range of 0 to 1) is defined by how closely the evaluated point meets the response goal. Typical response goals are maximize, minimize or target a specific value. In addition to the goal, upper and lower “acceptable” limits on the response values must be set.
Illustration: The experimenters study a process that has 3 input factors and 2 output responses. In this example, the first response (% Conversion) measurements has an observed range of 51-97 percent. The goal for conversion is maximize. Considering business expectations, the minimum acceptable conversion is determined to be 80%, so that is defined as the lower limit. The upper limit is set to the theoretical maximum of 100%. These limits, along with the goal, define the desirability function for the conversion response. When evaluating a particular point in the design space, if the measured conversion is less than 80% (defined lower limit), desirability = 0. If conversion is 80-100%, desirability equals the proportion of the way towards the upper limit (100). Therefore, a conversion of 90 gives d=.5 and a conversion of 95 gives d=0.75. Any point that gives % conversion at 100% or higher will result in d=1.
Figure 2: Response 1 goal: Maximize with an acceptable range of 80-100%.
Response 2 is Activity and the goal is a Target of 63 and a range of 60-66 (Figure 3). Desirability will be 1 only at the exact value of 63. Evaluated points that result in activity levels between 60-63 and 63-66 are rated with desirability values that are proportional to the distance from the target. Activity levels that are either below 60 or above 66 are assigned a desirability of 0.
Figure 3: Response 2 goal: Target 63, with acceptable range 60-66.
The optimization algorithm at work
Once the goals and limits for each response are defined, the search algorithm can start. Stat-Ease software begins with a set of starting points (locations in the design space). For a single starting point, overall desirability (D) is calculated. Then the simplex search starts evaluating desirability (D) in the nearby area and takes “steps” that increase desirability. Steps are taken across the design space until desirability is maximized. All the starting points follow this process, resulting in a set of final “solutions” which are process conditions that at least minimally meet the requirements for all responses (individual desirability is greater than 0).
If the process is easy to optimize (the responses don’t compete with each other too much), there may be a large robust space that meets the response goals. In this case a very large number of solutions (process conditions) may be found. These solutions are sorted by the desirability value. Common practice is to focus on the top solution(s). Remember however, all the solutions meet the goals set by the experimenter. Optimization does not mean there is a single set of conditions that is best. If the area is very large (many solutions found) then tightening up the upper or lower limits may be merited. There may also be other external criteria to consider such as cost of the solution, manufacturability, ease of implementation, etc. The experimenter should review all the solutions presented and consider which ones make sense from a business perspective.
Figure 4 shows the optimal conditions for the illustration. The red dots show the location of the optimal settings for the factors, within their range. In this case time is set mid-way in the range (47 min), while temperature is maximized at 90 degrees and catalyst is approximately 2.7%. These process conditions are predicted to result in a conversion of 91% and activity level of 63. Confirmation runs should be completed to verify these results.
Figure 4: Numerical solution “ramps view” for illustration
A side note: Desirability is only a mathematical evaluation tool to compare solutions. Although it ranges from 0 to 1, it is a relative measure within a set of solutions, and not a statistic that needs to be as high as possible. Within a specific DOE, higher desirability means that the solution (set of conditions) met the stated goals more closely than a solution with lower desirability.
The success of numerical optimization starts with strong prediction models from the DOE analysis. Once models are established, the experimenter specifies each response goal, as well as upper and lower limits around that goal. The numerical search algorithm evaluates areas within the design space, searching for areas that simultaneously meet the goals for all the responses. This optimization function balances competing response goals to find the “sweet spot” that produces the best of all worlds.
A central composite design (CCD) is a type of response surface design that will give you very good predictions in the middle of the design space. Many people ask how many center points (CPs) they need to put into a CCD. The number of CPs chosen (typically 5 or 6) influences how the design functions.
Two things need to be considered when choosing the number of CPs in a central composite design:
1) Replicated center points are used to estimate pure error for the lack of fit test. Lack of fit indicates how well the model you have chosen fits the data. With fewer than five or six replicates, the lack of fit test has very low power. You can compare the critical F-values (with a 5% risk level) for a three-factor CCD with 6 center points, versus a design with 3 center points. The 6 center point design will require a critical F-value for lack of fit of 5.05, while the 3 center point design uses a critical F-value of 19.30. This means that the design with only 3 center points is less likely to show a significant lack of fit, even if it is there, making the test almost meaningless.
TIP: True “replicates” are runs that are performed at random intervals during the experiment. It is very important that they capture the true normal process variation! Do not run all the center points grouped together as then most likely their variation will underestimate the real process variation.
2) The default number of center points provides near uniform precision designs. This means that the prediction error inside a sphere that has a radius equal to the ±1 levels is nearly uniform. Thus, your predictions in this region (±1) are equally good. Too few center points inflate the error in the region you are most interested in. This effect (a “bump” in the middle of the graph) can be seen by viewing the standard error plot, as shown in Figures 1 & 2 below. (To see this graph, click on Design Evaluation, Graph and then View, 3D Surface after setting up a design.)
Figure 1 (left): CCD with the 6 center points (5-6 recommended). Figure 2 (right): CCD with only 3 center points. Notice the jump in standard error at the center of figure 2.
Ask yourself this—where do you want the best predictions? Most likely at the middle of the design space. Reducing the number of center points away from the default will substantially damage the prediction capability here! Although it can seem tedious to run all of these replicates, the number of center points does ensure that the analysis of the design can be done well, and that the design is statistically sound.
There are a couple features in the latest release of Design-Expert and Stat-Ease 360 software programs (version 22.0) that I really love, and wanted to draw your attention to. These features are accessible to everyone, no matter if you are a novice or an expert in design of experiments.
First, the Analysis Summary in the Post Analysis section: This provides a quick view of all response analyses in a set of tables, making it easy to compare model terms, statistics such as R-squared values, equations and more. We are pleased to now have this feature that has been requested many times! When you have a large number of responses, understanding the similarities and differences between the model may lead to additional insights to your product or process.
Second, the Custom Graphs (previously Graph Columns): Functionality and flexibility have been greatly expanded so that you can now plot analysis or diagnostic values, as well as design column information. Customize the colors, shapes and sizes of the points to tell your story in the way that makes sense to your audience.
Figure 1 (left) shows the layout of points in a central composite design, where the points are colored by the their space point type (factorial, axial or center points) and then sized by the response value. We can visualize where in the design space the responses are smaller versus larger.
In Figure 2 (right), I had a set of existing runs that I wanted to visualize in the design space. Then I augmented the design with new runs. I set the Color By option to Block to clearly see the new (green) runs that were added to the design space.
These new features offer many new ways to visualize your design, response data, and other pieces of the analysis. What stories will you tell?
I am often asked if the results from one-factor-at-a-time (OFAT) studies can be used as a basis for a designed experiment. They can! This augmentation starts by picturing how the current data is laid out, and then adding runs to fill out either a factorial or response surface design space.
One way of testing multiple factors is to choose a starting point and then change the factor level in the direction of interest (Figure 1 – green dots). This is often done one variable at a time “to keep things simple”. This data can confirm an improvement in the response when any of the factors are changed individually. However, it does not tell you if making changes to multiple factors at the same time will improve the response due to synergistic interactions. With today’s complex processes, the one-factor-at-a-time experiment is likely to provide insufficient information.
The experimenter can augment the existing data by extending a factorial box/cube from the OFAT runs and completing the design by running the corner combinations of the factor levels (Figure 2 – blue dots). When analyzing this data together, the interactions become clear, and the design space is more fully explored.
In other cases, OFAT studies may be done by taking a standard process condition as a starting point and then testing factors at new levels both lower and higher than the standard condition (see Figure 3). This data can estimate linear and nonlinear effects of changing each factor individually. Again, it cannot estimate any interactions between the factors. This means that if the process optimum is anywhere other than exactly on the lines, it cannot be predicted. Data that more fully covers the design space is required.
A face-centered central composite design (CCD)—a response surface method (RSM)—has factorial (corner) points that define the region of interest (see Figure 4 – added blue dots). These points are used to estimate the linear and the interaction effects for the factors. The center point and mid points of the edges are used to estimate nonlinear (squared) terms.
If an experimenter has completed the OFAT portion of the design, they can augment the existing data by adding the corner points and then analyzing as a full response surface design. This set of data can now estimate up to the full quadratic polynomial. There will likely be extra points from the original OFAT runs, which although not needed for model estimation, do help reduce the standard error of the predictions.
Running a statistically designed experiment from the start will reduce the overall experimental resources. But it is good to recognize that existing data can be augmented to gain valuable insights!
Learn more about design augmentation at the January webinar: The Art of Augmentation – Adding Runs to Existing Designs.