# Historical Data¶

## Part 1 – The Basics¶

## Introduction¶

In this tutorial you will see how the regression tool in Design-Expert^{®}
software, intended for response surface methods (RSM), is applied to historical
data. We don’t recommend you work with such happenstance variables if there’s any
possibility of performing a designed experiment. However, if you must, take
advantage of how easy Design-Expert makes it to develop predictive models and
graph responses, as you will see by doing this tutorial. It is assumed that at
this stage you’ve mastered many program features by completing preceding tutorials.
At the very least you ought to first do the one-factor RSM tutorials, both basic
and advanced, prior to starting this one.

The historical data for this tutorial, shown below, comes from the U.S. Bureau
of Labor Statistics via James Longley (An Appraisal of Least Squares Programs for
the Electronic Computer from the Point of View of the User, *Journal of the
American Statistical Association*, 62 (1967): 819-841). As discussed in *RSM
Simplified* (Mark J. Anderson and Patrick J. Whitcomb, Productivity, Inc., New
York, 2005: Chapter 2). It presents some interesting challenges for regression
modeling.

Assume the objective for analyzing this data is to predict future employment as a function of leading economic indicators – factors labeled A through F in the table above. Longley’s goal was different: He wanted to test regression software circa 1967 for round-off error due to highly correlated inputs. Will Design-Expert be up to the challenge? We will see!

Let’s begin by setting up this “experiment” (quotes added to emphasize this is not really an experiment, but rather an after-the-fact analysis of happenstance data).

## Design the “Experiment”¶

Click the Design-Expert icon that may appear on your desktop. To save you typing
time, we will re-build a previously saved design rather than entering it from
scratch. Click on the **Help, Tutorial Data** menu and select **Employment**.

To re-build this design (and thus see how it was created), press the blank-sheet icon () at the left of the toolbar.

Click **Yes** when Design-Expert queries “Use previous design info?”

Now you can see how this design was created via the Response Surface tab and Historical Data option.

Before moving ahead, you must set Design-Expert to how many rows of data you want to key or copy/paste into the design layout. In this case there are 16 rows.

Press **Next** to view the factors. Note for each of the 6 numeric factors we
entered name, units, and range from minimum (“Min”) to maximum (“Max”). Press
**Next** to accept all entries on your screen.

You now see response details – in this case only one response.

Press **Finish** to see the resulting design layout in run order.

### A Peculiarity on Pasting Data¶

You could now type in all data for factor levels and resulting responses, row-by-row. (Don’t worry: we won’t make you do this!) However, in most cases data is already available via a spreadsheet. If so, simply click/drag these data, or copy to the clipboard, then Edit, Paste (or right-click and Paste as shown below) into the design layout within Design-Expert. (Be sure, as shown below, to first select the top row of all your destination cells.)

If you simply click the upper left cell in the empty run sheet, the program only pastes one value.

## Analyze the Results¶

Normally you’d save your work at this stage, but because we already did this,
simply re-open our file: Click on the **Help, Tutorial Data** menu and select
**Employment**. Click **No** to pass up the opportunity to save what you did
previously.

Before we get started, be forewarned you will encounter many statistics related
to least squares regression and analysis of variance (ANOVA). If you are coming
into this without previous knowledge, pick up a copy of *RSM Simplified* and keep
it handy. For a good guided tour of statistics for RSM analysis, attend our
Stat-Ease workshop titled RSM for Process Optimization. Details about this
computer-intensive, hands-on class – including prerequisites – are at
www.statease.com.

Under the **Analysis** branch, click the **Employment** node. Design-Expert
displays a screen for transforming response. However, as noted by the program,
the response range in this case is so small that there is little advantage to
applying any transformation.

Press **Fit Summary**. Design-Expert evaluates each degree of the model from the
mean on up. In this case, the best that can be done is linear. Anything higher
is aliased.

Move on by pressing **Model**.

It’s all set up how Design-Expert suggested. Notice many two-factor interactions
can’t be estimated due to aliasing – symbolized by a yellow triangle with an
exclamation point (). Hold on to your hats (because this upcoming data
is really a lot of hot air!) and press **ANOVA** (analysis of variance).

Notice although the overall model is significant, some terms are not.

Note

**Some statistical details on how Design-Expert does analysis of
variance**: You may have noticed this ANOVA is labeled “Sum of squares is
**Type III - Partial**. This approach to ANOVA, done by default, causes total
sums-of-squares (SS) for the terms to come up short of the overall model when
analyzing data from a nonorthogonal array, such as historical data. If you want
SS terms to add up to the model SS, go to Edit, Preferences for Analysis and
change the default to Sequential (Type I) for these numeric factors. However,
we do not recommend this approach because it favors the first term put into the
model. For example, in this case, ANOVA by partial SS (Type III – the default
of DX) for the response (employment total) calculates prob>F p-value for A as
0.8631 (F=0.031) as seen above, which is not significant. Recalculating ANOVA
by sequential sum of squares (Type I) changes the p to <0.0001 (F=1876), which
looks highly significant, but only because this term (main effect of factor A)
is fit first. This simply is not correct.

Assuming Factor A (prices) is least significant of all as indicated by default
ANOVA (partial SS), let’s see what happens with it removed. However, before we do,
move to the **Fit Statistics** pane (shown below) to help us compare what happens
before and after reducing the model.

Also look at the **Coefficients** estimates.

Notice the huge VIF (variance inflation factor) values. A value of 1 is ideal (orthogonal), but a VIF below 10 is generally accepted. A VIF above 1000, such as factor B (GNP), indicates severe multicollinearity in the model coefficients (That’s bad!). In the follow-up tutorial (Part 2) based on this same Longley data, we delve more into this and other statistics generated by Design-Expert for purposes of design evaluation. For now, right-click any VIF result to access context-sensitive Help, or go to Help on the main menu and search on this statistic. You will find some details there.

Press **Model** again. Double-click **A-Prices** to remove the “” (model)
designation and exclude the term.

You could now go back to ANOVA, look for the next least significant term,
exclude it, and so on. However, this backward-elimination process can be performed
automatically in Design-Expert. Here’s how. First, reset **Process Order** to
**Linear**.

Now click on the **Autoselect…** button. Then change the selection to
**Backward** and the Criterion to **p-value**.

Notice a new field called “Alpha” appears. By default the program removes
the least significant term, step-by-step, as long as it exceeds the risk level
(symbolized by statisticians with the Greek letter alpha) of 0.1 (estimated by
p-value). Let’s be a bit more conservative by changing **Alpha** to **0.05**.

Now press the **Start** button to see what happens.

The automatic selection is shown, step-by-step. Scroll up to see the whole thing if you like. For now, though, let’s move on and see what model is left and check out the more user friendly “selection log” to see what was done. The Start button becomes an Accept button, so click on that and then you click on the ANOVA to see the resulting model.

We are left with the same model we landed on by hand, but this was much easier.
We also get a nice summary of how we got here. Click on the **Model Selection
Log** pane.

Not surprisingly, the program first removed A and then E – that’s it. All of the other terms on the ANOVA table come out significant. (Note: If you do not see the report of the model being “significant” change your View to Annotated ANOVA.)

You may have noticed that in the full model, factor B had a much higher p-value
than what’s shown above. This instability is typical of models based on
historical data. Move over to the **Fit Statistics** and **Coefficients** panes.

Now let’s try a different regression approach – building the model from the
ground (mean) up, rather than tearing terms down from the top (all terms in chosen
polynomial). Press **Model**, then re-set **Process Order** to **Linear** and
click the **Auto Select…** button. This time choose **p-values** as your criterion
and leave **Forward** for the Selection method. To provide a fair comparison of
this forward approach with that done earlier going backward, change **Alpha** to
**0.05**.

Heed the text displayed by the program (When reducing your model…) because this
approach may not work as well for this highly collinear set of factors. Press
Start and then See what happens now in **ANOVA**.

Surprisingly, factor B now comes in first as the single most significant factor. Then comes factor C. That’s it! The next most significant factor evidently does not achieve the alpha-in significance threshold of p<0.05.

Move to the **Fit Statistics** pane.

This simpler model scores very high on all measures of R-squared, but it falls a bit short of what was achieved in the model derived from the backward regression.

Finally, go back to **Model**, re-set **Process Order** to **Linear** and go to
**Autoselect…** to try the last model **Selection** option offered by
Design-Expert: **Stepwise** (be sure to also choose **p-value** as your
criterion). Note, AIC and BIC are newer model criterion that we will use in future
tutorials.

As you might infer from seeing both Alpha in and Alpha out now displayed,
stepwise algorithms involve elements of forward selection with bits of backward
added in for good measure. For details, search program Help, but consider this –
terms that pass the alpha test in (via forward regression) may later (after
further terms are added) become disposable according to the alpha test out (via
backward selection). If this seems odd, look back at how factor B’s p-value changed
depending on which other factors were chosen with it for modeling. To see what
happens with this forward-selection method, press **Start**, **Accept**, and then
**ANOVA** again. Results depend on what you do with Alpha in and Alpha out – both
which default back to 0.1000. With the defaults, the same model is selected by
this method as the backwards selection chose.

As you see in the message displayed for both forward and stepwise (in essence an enhancement of forward) approaches, we favor the backward approach if you decide to make use of an automated selection method. Ideally, an analyst is also a subject-matter expert, or such a person is readily accessible. Then they could do model reduction via the manual method filtered not only by the statistics, but also by simple common sense from someone with profound system knowledge.

This concludes part 1 of our Longley data-set exploration. In Part 2 we mine deeper into Design-Expert to see interesting residual analysis aspects within Diagnostics, and we also see what can be gleaned from its sophisticated tools within Design Evaluation.