Many statistical procedures are based on the assumption that your data has a normal distribution. The normal probability plot is a useful graphical tool for asessing this assumption. This plot is also called the qqplot (quantile-quantile plot).
You can use the housing data set to illustrate the use of the normal probability plot.
fn <- "http://www.pmean.com/00files/housing.txt"
al <- read.table(file=fn,header=TRUE)
head(al)
## Price SquareFeet AgeYears NumberFeatures Northeast CustomBuild
## 1 205000 2650 13 7 Yes Yes
## 2 208000 2600 * 4 Yes Yes
## 3 215000 2664 6 5 Yes Yes
## 4 215000 2921 3 6 Yes Yes
## 5 199900 2580 4 4 Yes Yes
## 6 190000 2580 4 4 Yes No
## CornerLot
## 1 No
## 2 No
## 3 No
## 4 No
## 5 No
## 6 No
tail(al)
## Price SquareFeet AgeYears NumberFeatures Northeast CustomBuild
## 112 87400 1236 3 4 No No
## 113 87200 1229 6 3 No No
## 114 87000 1273 4 4 No No
## 115 86900 1165 7 4 No No
## 116 76600 1200 7 4 No No
## 117 73900 970 4 4 No No
## CornerLot
## 112 No
## 113 No
## 114 No
## 115 No
## 116 Yes
## 117 Yes
al$age <- as.numeric(al$AgeYears)
## Warning: NAs introduced by coercion
summary(al$Price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 54000 78000 96000 106274 120000 215000
summary(al$SquareFeet)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 837 1280 1549 1654 1894 3750
summary(al$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 5.75 13.00 14.97 19.25 53.00 49
summary(al$NumberFeatures)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 3.00 4.00 3.53 4.00 8.00
summary(al$Northeast)
## Length Class Mode
## 117 character character
summary(al$CustomBuild)
## Length Class Mode
## 117 character character
summary(al$CornerLot)
## Length Class Mode
## 117 character character
The qqplot compares the data values to evenly spaced percentiles of the normal distribution. A straight line indicates that the normality assumption is reasonable. You should not over interpret minor deviations from linearity. A large deviation from linearity is an indication that the normality assumption may be questionable.
You can read more about this on my 2009 blog entry about normal probability plots.
qqnorm(al$Price)
Figure 1. Normal probability plot for house prices.
In a linear model, the critical assumption is NOT that the predictor (independent) variables are normally distributed. It is NOT that the outcome (dependent) variable is normally distributed. It is that the residuals are normally distributed.
al.model <- lm(Price~SquareFeet+CustomBuild,data=al)
summary(al.model)
##
## Call:
## lm(formula = Price ~ SquareFeet + CustomBuild, data = al)
##
## Residuals:
## Min 1Q Median 3Q Max
## -103813 -9596 738 8784 67151
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11413.482 6549.924 1.743 0.08411 .
## SquareFeet 55.364 4.123 13.428 < 2e-16 ***
## CustomBuildYes 14285.993 5103.019 2.800 0.00601 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19860 on 114 degrees of freedom
## Multiple R-squared: 0.7321, Adjusted R-squared: 0.7274
## F-statistic: 155.8 on 2 and 114 DF, p-value: < 2.2e-16
qqnorm(resid(al.model))
Figure 2. Normal probability plot of residuals
You can find an earlier version of this page on my old website.