Steps in a typical linear regression analysis

Steve Simon

1999-09-21

Categories: Blog post Tags: Linear regression

Let no man ignorant of geometry enter - Sign over Plato’s Academy in Athens

Linear regression models provide a good way to examine how various factors influence a continuous outcome measure. There are three steps in a typical linear regression analysis.

  1. Fit a crude model
  2. Fit an adjusted model
  3. Analyze predicted values and residuals

These steps may not be appropriate for every linear regression analysis, but they do serve as a general guideline. In this presentation

This presentation can only give the briefest introduction to this area. When I have time

**Step 1

There are two types of models

A crude model for comparing duration of breast feeding to feeding group would be a t-test. I prefer

Shown below is the table of tests from the general linear model procedure.

The general linear model uses an F test instead of the t test

The general linear model also has a table of estimates

The intercept represents the average duration of breast feeding for the NG tube group. We see that the average duration is 20 weeks for the NG tube group. The (FEED_TYP=1) term is an estimate of how much the average duration changes when we move from the NG tube group to the bottle group. We see that the bottle group has an average duration that is 7 weeks shorter.

Shown below is a table of means from the general linear model.

We see that the difference between the two means is roughly 7 weeks, which confirms the results shown previously.

**Step 2

The previous model was a crude model. We see a seven week difference between the two groups

Shown below is the table of tests for a general linear model that includes mother’s age in the model.

The p-value for feeding group is .009

Shown below is the table of estimates from the same general linear model.

This table shows that the effect of bottle feeding is to decrease duration of breast feeding by about six weeks

A previous descriptive analysis of this data revealed that the average age for mothers in the treatment group is 29 years and the average age for mothers in the control group is 25 years. When you see a discrepancy like this in an important covariate

This analysis shows that the four year gap only accounts for a small portion of the difference. Since each year of age changes the duration by a quarter week

Shown below is the table of means.

This table now adjusts for mother’s age. The mean for the bottle fed group is adjusted upward to what it would be if the average age of the mothers in this group were 27 rather than 25. The mean for the NG tube group is adjusted downward to what it would be if the average age were 27 instead of 29. Note that the adjusted mean duration is half a week higher than the crude mean duration in the bottle group and that the adjusted mean duration is half a week lower than the crude mean duration for the NG tube group. This confirms that the difference between the two feeding groups is roughly 6 weeks

This is not the final model. We should examine the effect of delivery type and account for the fact that we have some data on twins. I hope, though

**Step 3

A regression model gives you an equation that you can use to compute predicted values and residuals. In the regression model with mother’s age and feeding type

age_stop = 13 + 0.25 * age - 6 * feed_typ,

where feed_typ=1 if control

So

predicted age_stop = 13 + 0.25 * 30 - 6 * 0 = 20.5 weeks.

If you recruited a mother into the treatment group and she was 19 years old

predicted age_stop = 13 + 0.25 * 19 - 6 * 0 = 17.75 weeks.

If you recruited a mother into the control group and she was 37 years old

predicted age_stop = 13 + 0.25 * 37 - 6 * 1 = 16.25 weeks.

Now it turns out that the first three rows of your data set correspond to the three scenarios described above. The actual values we observed were 30 weeks

The residual is the difference between what we observed in the data and what the regression model would have predicted. For the first mother in the sample

residual = 30 - 22.5 = 7.5.

When the residual is positive

residual = 4 - 17.75 = -13.75.

This residual is negative. For the third mother

residual = 12 - 16.25 = -4.25.

Most statistical models require certain assumptions to be made about your data. These assumptions can be examined using residuals. If your model is good

The simplest plot is a plot of predicted values versus residuals (shown below).

The relatively random scatter of data values provides us with confidence in the assumptions of the linear model. There is no obvious trend or pattern in this plot.

I also looked at the residuals versus the feeding groups and versus mother’s age. Both showed no systematic trend or pattern (graphs not shown).

The following plot examines normality of the residuals.

The curved line indicates a non-normal distribution. Further investigation would identify that this distribution is rectangular: it has a sharp lower and upper bound that differs from a bell shaped curve. The design of this study produces these limits because the age at which the mother stops breast feeding can’t be shorter than 0 weeks and it can’t be longer than the duration of the study (roughly 6 months). In practice

Summary

There are three steps in a typical linear regression model analysis.

  1. Fit a crude model.
  2. Fit an adjusted model.
  3. Examine predicted values and residuals.

You can find an earlier version of this page on my original website.