Note

Screenshots may differ slightly depending on software version.

In this tutorial you will see how the regression tool in Stat-Ease^{®}
software, intended for response surface methods (RSM), is applied to historical
data. We don’t recommend you work with such happenstance variables if there’s any
possibility of performing a designed experiment. However, if you must, take
advantage of how easy Stat-Ease makes it to develop predictive models and
graph responses, as you will see by doing this tutorial. It is assumed that at
this stage you’ve mastered many program features by completing preceding tutorials.
At the very least you ought to first do the
one-factor RSM tutorials, both part 1 and 2, prior to starting
this one.

The historical data for this tutorial, shown below, comes from the U.S. Bureau
of Labor Statistics via James Longley (An Appraisal of Least Squares Programs for
the Electronic Computer from the Point of View of the User, *Journal of the
American Statistical Association*, 62 (1967): 819-841). As discussed in *RSM
Simplified, 2nd ed.* (Mark J. Anderson and Patrick J. Whitcomb, Productivity Press,
New York, 2016: Chapter 2). It presents some interesting challenges for regression
modeling.

Assume the objective for analyzing this data is to predict future employment as a function of leading economic indicators – factors labeled A through F in the table above. Longley’s goal was different: He wanted to test regression software circa 1967 for round-off error due to highly correlated inputs. Will Stat-Ease be up to the challenge? We will see!

Let’s begin by setting up this “experiment” (quotes added to emphasize this is not really an experiment, but rather an after-the-fact analysis of happenstance data).

To save you typing time, we will re-build a previously saved design rather than
entering it from scratch. Click on the **Help, Tutorial Data** menu and select
**Employment**.

To re-build this design (and thus see how it was created), press the blank-sheet icon () at the left of the toolbar.

Click **Yes** when the program queries “Use previous design info?”

Click **No** when asked to save changes. Now you can see how this design was
created via the **Custom Designs** tab and **Blank Spreadsheet** option.

Before moving ahead, you must set how many rows of data you want to type or copy/paste into the design layout. In this case there are 16 rows.

Press **Next** to view the factors. Note for each of the 6 numeric factors we
entered name, units, and range from minimum (“Min”) to maximum (“Max”). Press
**Next** to accept all entries on your screen.

You now see response details – in this case only one response.

Press **Finish** to see the resulting design layout in run order.

Note

Consider taking a shortcut by simply pressing the **Import Data**
button on the program’s opening screen (this option also appears under
Custom Designs). Then follow the on-screen instructions to bring in your
preexisting results straightaway—no need to first set up a blank
spreadsheet: Much easier!

You could now type in all data for factor levels and resulting responses, row-by-row. (Don’t worry: we won’t make you do this!) However, in most cases data is already available via a spreadsheet. If so, simply click/drag these data, or copy to the clipboard, then Edit, Paste (or right-click and Paste as shown below) into the design layout. (Be sure, as shown below, to first select the top row of all your destination cells.)

If you simply click the upper left cell in the empty run sheet, the program only pastes one value.

Normally you’d save your work at this stage, but because we already did this,
simply re-open our file: Click on the **Help, Tutorial Data** menu and select
**Employment**. Click **No** to pass up the opportunity to save what you did
previously.

Before we get started, be forewarned you will encounter many statistics related
to least squares regression and analysis of variance (ANOVA). If you are coming
into this without previous knowledge, pick up a copy of *RSM Simplified* and keep
it handy. For a good guided tour of statistics for RSM analysis, attend our
Stat-Ease workshop titled Modern DOE for Process Optimization.

Under the **Analysis** branch, click the **Employment** node. The program
displays a screen for transforming response. However, as noted by the program,
the response range in this case is so small that there is little advantage to
applying any transformation.

Press the **Start Analysis** button to bring up the **Fit Summary**.
The program evaluates each degree of the model from the mean on up. In this
case, the best that can be done is linear. Anything higher is aliased.

Move on by pressing **Model**.

It’s all set up how the program suggested. Notice many two-factor interactions
can’t be estimated due to aliasing – symbolized by a yellow triangle with an
exclamation point (). Hold on to your hats (because this upcoming data
is really a lot of hot air!) and press **ANOVA** (analysis of variance).

Notice although the overall model is significant, some terms are not.

Note

**Some statistical details on how |dex-name| does analysis of
variance**: You may have noticed this ANOVA is labeled “Sum of squares is
**Type III - Partial**. This approach to ANOVA, done by default, causes total
sums-of-squares (SS) for the terms to come up short of the overall model when
analyzing data from a nonorthogonal array, such as historical data. If you want
SS terms to add up to the model SS, go to Edit, Preferences for Analysis and
change the default to Sequential (Type I) for these numeric factors. However,
we do not recommend this approach because it favors the first term put into the
model. For example, in this case, ANOVA by partial SS (Type III – the default
of DX) for the response (employment total) calculates prob>F p-value for A as
0.8631 (F=0.031) as seen above, which is not significant. Recalculating ANOVA
by sequential sum of squares (Type I) changes the p to <0.0001 (F=1876), which
looks highly significant, but only because this term (main effect of factor A)
is fit first. This simply is not correct.

Assuming Factor A (prices) is least significant of all as indicated by default
ANOVA (partial SS), let’s see what happens with it removed. However, before we do,
move to the **Fit Statistics** pane (shown below) to help us compare what happens
before and after reducing the model.

Also look at the **Coefficients** estimates.

Notice the huge VIF (variance inflation factor) values. A value of 1 is ideal (orthogonal), but a VIF below 10 is generally accepted. A VIF above 1000, such as factor B (GNP), indicates severe multicollinearity in the model coefficients (That’s bad!). In the follow-up tutorial (Part 2) based on this same Longley data, we delve more into this and other statistics generated by Stat-Ease for purposes of design evaluation. For now, right-click any VIF result to access context-sensitive Help, or go to Help on the main menu and search on this statistic. You will find some details there.

Press **Model** again. Double-click **A-Prices** to remove the “” (model)
designation and exclude the term.

You could now go back to ANOVA, look for the next least significant term,
exclude it, and so on. However, this backward-elimination process can be performed
automatically. Here’s how. First, reset **Process Order** to
**Linear**.

Now click on the **Autoselect…** button. Then change the selection to
**Backward** and the Criterion to **p-value**.

Notice a new field called “Alpha” appears. By default the program removes
the least significant term, step-by-step, as long as it exceeds the risk level
(symbolized by statisticians with the Greek letter alpha) of 0.1 (estimated by
p-value). Let’s be a bit more conservative by changing **Alpha** to **0.05**.

Now press the **Start** button to see what happens.

The automatic selection is shown, step-by-step. Scroll up to see the whole thing if you like. For now, though, let’s move on and see what model is left and check out the more user friendly “selection log” to see what was done. The Start button becomes an Accept button, so click on that and then you click on the ANOVA to see the resulting model.

We are left with the same model we landed on by hand, but this was much easier.
We also get a nice summary of how we got here. Click on the **Model Selection
Log** pane.

Not surprisingly, the program first removed A and then E – that’s it. All of the other terms on the ANOVA table come out significant. (Note: If you do not see the report of the model being “significant” change your View to Annotated ANOVA.)

You may have noticed that in the full model, factor B had a much higher p-value
than what’s shown above. This instability is typical of models based on
historical data. Move over to the **Fit Statistics** and **Coefficients** panes.

Now let’s try a different regression approach – building the model from the
ground (mean) up, rather than tearing terms down from the top (all terms in chosen
polynomial). Press **Model**, then re-set **Process Order** to **Linear** and
click the **Auto Select…** button. This time choose **p-values** as your criterion
and leave **Forward** for the Selection method. To provide a fair comparison of
this forward approach with that done earlier going backward, change **Alpha** to
**0.05**.

Heed the text displayed by the program (When reducing your model…) because this
approach may not work as well for this highly collinear set of factors. Press
Start and then See what happens now in **ANOVA**.

Surprisingly, factor B now comes in first as the single most significant factor. Then comes factor C. That’s it! The next most significant factor evidently does not achieve the alpha-in significance threshold of p<0.05.

Move to the **Fit Statistics** pane.

This simpler model scores very high on all measures of R-squared, but it falls a bit short of what was achieved in the model derived from the backward regression.

Finally, go back to **Model**, re-set **Process Order** to **Linear** and go to
**Autoselect…** to try the last model **Selection** option offered by
Stat-Ease: **Stepwise** (be sure to also choose **p-value** as your
criterion). Note, AIC and BIC are newer model criterion that we will use in future
tutorials.

As you might infer from seeing both Alpha in and Alpha out now displayed,
stepwise algorithms involve elements of forward selection with bits of backward
added in for good measure. For details, search program Help, but consider this –
terms that pass the alpha test in (via forward regression) may later (after
further terms are added) become disposable according to the alpha test out (via
backward selection). If this seems odd, look back at how factor B’s p-value changed
depending on which other factors were chosen with it for modeling. To see what
happens with this forward-selection method, press **Start**, **Accept**, and then
**ANOVA** again. Results depend on what you do with Alpha in and Alpha out – both
which default back to 0.1000. With the defaults, the same model is selected by
this method as the backwards selection chose.

As you see in the message displayed for both forward and stepwise (in essence an enhancement of forward) approaches, we favor the backward approach if you decide to make use of an automated selection method. Ideally, an analyst is also a subject-matter expert, or such a person is readily accessible. Then they could do model reduction via the manual method filtered not only by the statistics, but also by simple common sense from someone with profound system knowledge.

Note

Consider preserving each alternative model for comparison via the
**Analysis [+]** button in the tree on the left. This creates multiple
analyses for any given response. Hint: To keep track of which is which,
name each of your models in the field provided, for example,
“p-value backward at 0.01”.

This concludes part 1 of our Longley data-set exploration. In Part 2 we mine deeper into Stat-Ease to see interesting residual analysis aspects within Diagnostics, and we also see what can be gleaned from its sophisticated tools within Design Evaluation.