Transformation of the response is an important component of any data analysis. Transformation is needed if the error (residuals) is a function of the magnitude of the response (predicted values). Stat-Ease provides extensive diagnostic capabilities to check if the statistical assumptions underlying the data analysis are met. The normal plot of the residuals tests their normality. The residuals versus predicted response values plot will indicate a problem if a pattern exists. Unless the ratio of the maximum response to the minimum response is large, transforming the response will not make much difference.

The Box-Cox plot on the Diagnostics button will provide a recommended transformation from the power family. The two non-power law transformations, logit for bounded data and arcsin-sqrt for proportions, must be applied based on the type of response. The Box-Cox plot will often recommend a square-root transformation when proportion data is present, and the log transformation for bounded data.

Stat-Ease provides a broad range of possible transformations - most are from the power family, plus there are two additional transformations, the logit and the arcsine square root.

Most data transformations can be described by the power function, l) power gives a scale satisfying the equal variance requirement of the statistical model.

The appropriate choice of a response transformation relies on subject-matter knowledge and/or statistical considerations. The available transformations and examples for their use are:

**Square Root** – count, frequency data

**Natural log** – variance or growth data

**Base 10 log** – variance or growth data

**Inverse square root**

**Inverse** – rate/time, decay rate

**Power** – for more extreme transformation needs

The **power** transformation allows transformation to any power in the range –3
to +3, provided the data are positive. You may add a constant to the data to
avoid powers of negative numbers. If the standard deviation associated with an
observation is proportional to the mean raised to some power, then transforming
the observation by a power gives a scale satisfying the equal variance
requirement of the ANOVA. The Box-Cox plot is provided in the Diagnostics plots
to help you choose an appropriate power transformation.

**Logit**

The **logit** transformation is used when the response has a unreachable lower
and upper physical limits. One example is the yield of a chemical reaction. The
physical bounds are 0% and 100%, but in practice the actual yields will not
quite reach 100% due to impurities, energy loss, etc. The logit transform
spreads out the values near the boundaries. When using this transformation, it
is very important to correctly set the lower and upper limits to the natural
limits of the response.

\[\log_{e}\begin{bmatrix} \frac{Y\: -\: lower\: limit\: of\: Y} {upper\: limit\: of\: Y\: -\: Y } \end{bmatrix}\]

**Arcsine square root**

The **arcsine square root** should be used for proportion data. Proportion data
is a fraction between 0 and 1 inclusive. The assumption is a batch of size “n” is
generated by the settings of each run. Each individual member of the batch has a
binomial outcome, either passing or failing a specified criteria.

\[\arcsin \begin{pmatrix}{\sqrt{Y}}\end{pmatrix}\]

References

D. Miller. Reducing transformation bias in curve fitting.

*The American Statistician*, 38(2):124–126, 1984.

Logistic regression analysis estimates the odds or probability of an event by modeling the log odds of that event as a polynomial function of the input factors.

Note that odds and probability are related by:

\[\mathrm{odds} = \frac{\mathrm{probability}}{1-\mathrm{probability}} \leftrightarrow \mathrm{probability} = \frac{\mathrm{odds}}{1+\mathrm{odds}}\]

Because odds are usually given as a ratio, a:b, the software opts to report the probability and models the natural logarithm of the odds as a function of probability:

\[\ln(\mathrm{odds}) = \mathrm{logit}(p) = \ln\left[\frac{p(y=1)}{1 - p(y=1)}\right] = \beta_{0} +
\beta_{1}x_{1} + \beta_{2}x_{2} + \cdots + \beta_{k}x_{k} = z\]

An *iteratively reweighted least squares* algorithm is used to estimate the coeffiecients for the polynomial model:

\[\hat{z} = \hat{\beta}_{0}+\hat{\beta}_{1}x_{1}+\hat{\beta}_{2}x_{2}+\cdots+ \hat{\beta}_{k}x_{k}\]

Finally, the probability is estimated by applying the inverse transformation:

\[\hat{p}=\frac{e^{\hat{\beta}_{0}+\hat{\beta}_{1}x_{1}+\hat{\beta}_{2}x_{2}+\cdots+ \hat{\beta}_{k}x_{k}}}{1+e^{\hat{\beta}_{0}+\hat{\beta}_{1}x_{1}+\hat{\beta}_{2}x_{2}+\cdots+ \hat{\beta}_{k}x_{k}}}=\frac{1}{1+e^{\hat{\beta}_{0}+\hat{\beta}_{1}x_{1}+\hat{\beta}_{2}x_{2}+\cdots+ \hat{\beta}_{k}x_{k}}} = \frac{1}{1+e^{-\hat{z}}}\]

This estimate is always bounded between 0 and 1, consistent with a probability.

Poisson regression is used to model the relationship between input factors and a response that represents a count. The mean count, \(y\), is related to the input factors through:

\[\ln\left(y\right) = \beta_{0} +
\beta_{1}x_{1} + \beta_{2}x_{2} + \cdots + \beta_{k}x_{k} = z\]

An *iteratively reweighted least squares* algorithm is used to estimate the coeffiecients for the polynomial model:

\[\hat{z} = \hat{\beta}_{0}+\hat{\beta}_{1}x_{1}+\hat{\beta}_{2}x_{2}+\cdots+ \hat{\beta}_{k}x_{k}\]

Finally, the mean count is estimated by applying the inverse transformation:

\[\hat{y} = e^{\hat{\beta}_{0}+\hat{\beta}_{1}x_{1}+\hat{\beta}_{2}x_{2}+\cdots+ \hat{\beta}_{k}x_{k}} = e^{+\hat{z}}\]

Notice that this expression is always non-negative, consistent with count data. However, it represents a *mean* count, so it is not necessarily an integer.

References

Douglas C. Montgomery, Peck, Elizabeth A., and G. Geoffrey Vining.

*Introduction to Linear Regression Analysis*. Wiley, 5th edition, 2012.