I’m talking a bit about missing values in R this afternoon for the Kansas City R Users Group. Here is what I’ll be talking about.
When you import data into R, the first thing you should do is
KNOW YOUR COUNT
This means that you need to know how many rows are in your data, and for all of the key variables, how many missing values you have. In a nested or hierarchical model, you need to also know the counts at each level of nesting or each level of the hierarchy.
Here is a data set, freely available on the web, that shows sleep patterns and other statistics on various mammals. It has a handful of missing values and can help illustrate how you might tackle a new dataset. You can find a description of this data set in html format.
sl <- read.table(file="http://www.statsci.org/data/general/sleep.txt",header=TRUE)
dim(sl)
## [1] 62 11
names(sl)
## [1] "Species" "BodyWt" "BrainWt" "NonDreaming" "Dreaming"
## [6] "TotalSleep" "LifeSpan" "Gestation" "Predation" "Exposure"
## [11] "Danger"
This data set has 62 rows and 11 variables. Since the data set doesn’t have too many variables, let’s print out the first few and last few rows.
head(sl)
## Species BodyWt BrainWt NonDreaming Dreaming TotalSleep
## 1 Africanelephant 6654.000 5712.0 NA NA 3.3
## 2 Africangiantpouchedrat 1.000 6.6 6.3 2.0 8.3
## 3 ArcticFox 3.385 44.5 NA NA 12.5
## 4 Arcticgroundsquirrel 0.920 5.7 NA NA 16.5
## 5 Asianelephant 2547.000 4603.0 2.1 1.8 3.9
## 6 Baboon 10.550 179.5 9.1 0.7 9.8
## LifeSpan Gestation Predation Exposure Danger
## 1 38.6 645 3 5 3
## 2 4.5 42 3 1 3
## 3 14.0 60 1 1 1
## 4 NA 25 5 2 3
## 5 69.0 624 3 5 4
## 6 27.0 180 4 4 4
tail(sl)
## Species BodyWt BrainWt NonDreaming Dreaming TotalSleep
## 57 Tenrec 0.900 2.6 11.0 2.3 13.3
## 58 Treehyrax 2.000 12.3 4.9 0.5 5.4
## 59 Treeshrew 0.104 2.5 13.2 2.6 15.8
## 60 Vervet 4.190 58.0 9.7 0.6 10.3
## 61 Wateropossum 3.500 3.9 12.8 6.6 19.4
## 62 Yellow-belliedmarmot 4.050 17.0 NA NA NA
## LifeSpan Gestation Predation Exposure Danger
## 57 4.5 60 2 1 2
## 58 7.5 200 3 1 3
## 59 2.3 46 3 2 2
## 60 24.0 210 4 3 4
## 61 3.0 14 2 1 1
## 62 13.0 38 3 1 1
The summary function is useful, as it will display percentiles and the mean for numeric variables, but also (very importantly) display the number of missing values.
For factors, it will display a table of the first seven most frequently occuring values
summary(sl)
## Species BodyWt BrainWt NonDreaming
## Length:62 Min. : 0.005 Min. : 0.14 Min. : 2.100
## Class :character 1st Qu.: 0.600 1st Qu.: 4.25 1st Qu.: 6.250
## Mode :character Median : 3.342 Median : 17.25 Median : 8.350
## Mean : 198.790 Mean : 283.13 Mean : 8.673
## 3rd Qu.: 48.202 3rd Qu.: 166.00 3rd Qu.:11.000
## Max. :6654.000 Max. :5712.00 Max. :17.900
## NA's :14
## Dreaming TotalSleep LifeSpan Gestation
## Min. :0.000 Min. : 2.60 Min. : 2.000 Min. : 12.00
## 1st Qu.:0.900 1st Qu.: 8.05 1st Qu.: 6.625 1st Qu.: 35.75
## Median :1.800 Median :10.45 Median : 15.100 Median : 79.00
## Mean :1.972 Mean :10.53 Mean : 19.878 Mean :142.35
## 3rd Qu.:2.550 3rd Qu.:13.20 3rd Qu.: 27.750 3rd Qu.:207.50
## Max. :6.600 Max. :19.90 Max. :100.000 Max. :645.00
## NA's :12 NA's :4 NA's :4 NA's :4
## Predation Exposure Danger
## Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.000
## Median :3.000 Median :2.000 Median :2.000
## Mean :2.871 Mean :2.419 Mean :2.613
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :5.000 Max. :5.000 Max. :5.000
##
The first variable, Species, has 62 unique values. You can tell this by the fact that the first seven most frequently occurring values all had frequency 1. There are several other ways to check for uniqueness.
table(table(sl$Species))
##
## 1
## 62
length(unique(sl$Species))
## [1] 62
## [1] 62
The table function does not show NAs, at least with the default option
table(sl$Gestation)
##
## 12 14 16 17 19 21 21.5 25 28 30 31 33 35 38 42
## 1 1 1 1 1 1 1 1 2 2 1 1 1 1 3
## 45 46 50 52 60 63 68 90 100 112 115 120 140 148 150
## 1 1 1 1 2 3 1 1 1 1 1 2 1 1 1
## 151 164 170 180 200 210 225 230 252 267 281 310 336 365 392
## 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1
## 400 440 624 645
## 1 1 1 1
You would notice this if you added a total to the end of the table.
addmargins(table(sl$Gestation))
##
## 12 14 16 17 19 21 21.5 25 28 30 31 33 35 38 42
## 1 1 1 1 1 1 1 1 2 2 1 1 1 1 3
## 45 46 50 52 60 63 68 90 100 112 115 120 140 148 150
## 1 1 1 1 2 3 1 1 1 1 1 2 1 1 1
## 151 164 170 180 200 210 225 230 252 267 281 310 336 365 392
## 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1
## 400 440 624 645 Sum
## 1 1 1 1 58
The total is only 58, which is less than the magic number of 62. This means that there are 4 missing values.
You can also force the table to show counts for missing values.
table(sl$Gestation,useNA="always")
##
## 12 14 16 17 19 21 21.5 25 28 30 31 33 35 38 42
## 1 1 1 1 1 1 1 1 2 2 1 1 1 1 3
## 45 46 50 52 60 63 68 90 100 112 115 120 140 148 150
## 1 1 1 1 2 3 1 1 1 1 1 2 1 1 1
## 151 164 170 180 200 210 225 230 252 267 281 310 336 365 392
## 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1
## 400 440 624 645 <NA>
## 1 1 1 1 4
The default option for useNA is “never” and a third option is “ifany”. Notice how “ifany” differs from “always”
table(sl$Gestation,useNA="ifany")
##
## 12 14 16 17 19 21 21.5 25 28 30 31 33 35 38 42
## 1 1 1 1 1 1 1 1 2 2 1 1 1 1 3
## 45 46 50 52 60 63 68 90 100 112 115 120 140 148 150
## 1 1 1 1 2 3 1 1 1 1 1 2 1 1 1
## 151 164 170 180 200 210 225 230 252 267 281 310 336 365 392
## 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1
## 400 440 624 645 <NA>
## 1 1 1 1 4
table(sl$Danger,useNA="ifany")
##
## 1 2 3 4 5
## 19 14 10 10 9
table(sl$Danger,useNA="always")
##
## 1 2 3 4 5 <NA>
## 19 14 10 10 9 0
Many other functions in R offer multiple ways to handle NAs.
mean(sl$Gestation,na.rm=FALSE)
## [1] NA
mean(sl$Gestation,na.rm=TRUE)
## [1] 142.3534
mod1 <- lm(TotalSleep~Gestation,data=sl,na.action=na.omit)
summary(mod1)
##
## Call:
## lm(formula = TotalSleep ~ Gestation, data = sl, na.action = na.omit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.5934 -2.6042 0.2316 1.4075 7.6317
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.305786 0.688282 19.33 < 2e-16 ***
## Gestation -0.020749 0.003535 -5.87 3.08e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.619 on 52 degrees of freedom
## (8 observations deleted due to missingness)
## Multiple R-squared: 0.3986, Adjusted R-squared: 0.387
## F-statistic: 34.46 on 1 and 52 DF, p-value: 3.083e-07
The alternate option for na.action in the lm function is na.fail, but I cannot run this directly in knitr, because the resulting error message causes knitr to stop.
Logic values involving NAs are a bit tricky. You cannot get at NAs directly using the == comparison operator.
which(sl$Gestation==NA)
## integer(0)
## integer(0)
What’s happening here is that most logic comparisons invloving NA will evaluate not to TRUE or FALSE, but rather to NA.
If you want to select missing values, you must use the is.na function.
which(is.na(sl$Gestation))
## [1] 13 19 20 56
## [1] 13 19 20 56
The is.finite function represents “opposite” function, though there are a few subtle differences.
which(is.finite(sl$Gestation))
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 14 15 16 17 18 21 22 23 24 25 26
## [24] 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
## [47] 50 51 52 53 54 55 57 58 59 60 61 62
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 14 15 16 17 18 21 22 23 24 25 26
## [24] 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
## [47] 50 51 52 53 54 55 57 58 59 60 61 62
Logic involving missing values can seem at times a bit illogical.
sl$Gestation==NA
## [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [24] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [47] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [24] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [47] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
sl$Gestation==645
## [1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE NA FALSE FALSE FALSE FALSE FALSE NA NA FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [56] NA FALSE FALSE FALSE FALSE FALSE FALSE
## [1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE NA FALSE FALSE FALSE FALSE FALSE NA NA FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [56] NA FALSE FALSE FALSE FALSE FALSE FALSE
tst.na <- NA
tst.na==tst.na
## [1] NA
## [1] NA
Why would NA==NA not evaluate to TRUE? You could argue that a missing value might be 3 or 7 or 923.427, so you would never know if it equaled something else, even another missing value.
This produces a three valued logic system.
tst.tvl <- c(TRUE,FALSE,NA)
not.table <- !tst.tvl
names(not.table) <- tst.tvl
not.table
## TRUE FALSE <NA>
## FALSE TRUE NA
## TRUE FALSE <NA>
## FALSE TRUE NA
and.table <- outer(tst.tvl,tst.tvl,"&")
dimnames(and.table) <- list(tst.tvl,tst.tvl)
and.table
## TRUE FALSE <NA>
## TRUE TRUE FALSE NA
## FALSE FALSE FALSE FALSE
## <NA> NA FALSE NA
## TRUE FALSE <NA>
## TRUE TRUE FALSE NA
## FALSE FALSE FALSE FALSE
## <NA> NA FALSE NA
or.table <- outer(tst.tvl,tst.tvl,"|")
dimnames(or.table) <- list(tst.tvl,tst.tvl)
or.table
## TRUE FALSE <NA>
## TRUE TRUE TRUE TRUE
## FALSE TRUE FALSE NA
## <NA> TRUE NA NA
Other programs, such as SAS, use a two valued logic system. What SAS does is to set missing values equal to negative infinity. This seems simpler because it allows you to keep a two-valued logic system. But it has its own problems.
Note how R handles the less than comparison operator with missing values.
sl$Gestation<50
## [1] FALSE TRUE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## [12] FALSE NA FALSE TRUE TRUE TRUE FALSE NA NA FALSE FALSE
## [23] TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE FALSE
## [34] FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
## [45] FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [56] NA FALSE FALSE TRUE FALSE TRUE TRUE
In SAS, this comparison would evaluate to TRUE for the four missing values.
The only way to select missing values is with the is.na function.
There are three more values in R that are closely related to missing.
1/0
## [1] Inf
-1/0
## [1] -Inf
sqrt(-1)
## Warning in sqrt(-1): NaNs produced
## [1] NaN
The difference between Inf and NaN is in how comparisons are made.
Inf>3
## [1] TRUE
NaN>3
## [1] NA
Inf==Inf
## [1] TRUE
NaN==NaN
## [1] NA
R allows you to make a few computations using Inf, but others are not allowed.
Inf+1
## [1] Inf
1/Inf
## [1] 0
0*Inf
## [1] NaN
Inf-Inf
## [1] NaN
It’s also worth noting that you can force R to use complex numbers for the square root of -1, although this is not the default.
sqrt(as.complex(-1))
## [1] 0+1i
The is.na and is.finite functions are not perfect opposites.
tst.cases <- c(3,NA,NaN,Inf,-Inf)
is.na.table <- is.na(tst.cases)
names(is.na.table) <- tst.cases
is.na.table
## 3 <NA> NaN Inf -Inf
## FALSE TRUE TRUE FALSE FALSE
is.finite.table <- is.finite(tst.cases)
names(is.finite.table) <- tst.cases
is.finite.table
## 3 <NA> NaN Inf -Inf
## TRUE FALSE FALSE FALSE FALSE
You can find an earlier version of this page on my blog.