Missing values in R talk

Steve Simon

2015-02-07

I’m talking a bit about missing values in R this afternoon for the Kansas City R Users Group. Here is what I’ll be talking about.

When you import data into R, the first thing you should do is

KNOW YOUR COUNT

This means that you need to know how many rows are in your data, and for all of the key variables, how many missing values you have. In a nested or hierarchical model, you need to also know the counts at each level of nesting or each level of the hierarchy.

Here is a data set, freely available on the web, that shows sleep patterns and other statistics on various mammals. It has a handful of missing values and can help illustrate how you might tackle a new dataset. You can find a description of this data set in html format.

sl <- read.table(file="http://www.statsci.org/data/general/sleep.txt",header=TRUE)
dim(sl)
## [1] 62 11
names(sl)
##  [1] "Species"     "BodyWt"      "BrainWt"     "NonDreaming" "Dreaming"   
##  [6] "TotalSleep"  "LifeSpan"    "Gestation"   "Predation"   "Exposure"   
## [11] "Danger"

This data set has 62 rows and 11 variables. Since the data set doesn’t have too many variables, let’s print out the first few and last few rows.

head(sl)
##                  Species   BodyWt BrainWt NonDreaming Dreaming TotalSleep
## 1        Africanelephant 6654.000  5712.0          NA       NA        3.3
## 2 Africangiantpouchedrat    1.000     6.6         6.3      2.0        8.3
## 3              ArcticFox    3.385    44.5          NA       NA       12.5
## 4   Arcticgroundsquirrel    0.920     5.7          NA       NA       16.5
## 5          Asianelephant 2547.000  4603.0         2.1      1.8        3.9
## 6                 Baboon   10.550   179.5         9.1      0.7        9.8
##   LifeSpan Gestation Predation Exposure Danger
## 1     38.6       645         3        5      3
## 2      4.5        42         3        1      3
## 3     14.0        60         1        1      1
## 4       NA        25         5        2      3
## 5     69.0       624         3        5      4
## 6     27.0       180         4        4      4
tail(sl)
##                 Species BodyWt BrainWt NonDreaming Dreaming TotalSleep
## 57               Tenrec  0.900     2.6        11.0      2.3       13.3
## 58            Treehyrax  2.000    12.3         4.9      0.5        5.4
## 59            Treeshrew  0.104     2.5        13.2      2.6       15.8
## 60               Vervet  4.190    58.0         9.7      0.6       10.3
## 61         Wateropossum  3.500     3.9        12.8      6.6       19.4
## 62 Yellow-belliedmarmot  4.050    17.0          NA       NA         NA
##    LifeSpan Gestation Predation Exposure Danger
## 57      4.5        60         2        1      2
## 58      7.5       200         3        1      3
## 59      2.3        46         3        2      2
## 60     24.0       210         4        3      4
## 61      3.0        14         2        1      1
## 62     13.0        38         3        1      1

The summary function is useful, as it will display percentiles and the mean for numeric variables, but also (very importantly) display the number of missing values.

For factors, it will display a table of the first seven most frequently occuring values

summary(sl)
##    Species              BodyWt            BrainWt         NonDreaming    
##  Length:62          Min.   :   0.005   Min.   :   0.14   Min.   : 2.100  
##  Class :character   1st Qu.:   0.600   1st Qu.:   4.25   1st Qu.: 6.250  
##  Mode  :character   Median :   3.342   Median :  17.25   Median : 8.350  
##                     Mean   : 198.790   Mean   : 283.13   Mean   : 8.673  
##                     3rd Qu.:  48.202   3rd Qu.: 166.00   3rd Qu.:11.000  
##                     Max.   :6654.000   Max.   :5712.00   Max.   :17.900  
##                                                          NA's   :14      
##     Dreaming       TotalSleep       LifeSpan         Gestation     
##  Min.   :0.000   Min.   : 2.60   Min.   :  2.000   Min.   : 12.00  
##  1st Qu.:0.900   1st Qu.: 8.05   1st Qu.:  6.625   1st Qu.: 35.75  
##  Median :1.800   Median :10.45   Median : 15.100   Median : 79.00  
##  Mean   :1.972   Mean   :10.53   Mean   : 19.878   Mean   :142.35  
##  3rd Qu.:2.550   3rd Qu.:13.20   3rd Qu.: 27.750   3rd Qu.:207.50  
##  Max.   :6.600   Max.   :19.90   Max.   :100.000   Max.   :645.00  
##  NA's   :12      NA's   :4       NA's   :4         NA's   :4       
##    Predation        Exposure         Danger     
##  Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:1.000   1st Qu.:1.000  
##  Median :3.000   Median :2.000   Median :2.000  
##  Mean   :2.871   Mean   :2.419   Mean   :2.613  
##  3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000  
## 

The first variable, Species, has 62 unique values. You can tell this by the fact that the first seven most frequently occurring values all had frequency 1. There are several other ways to check for uniqueness.

table(table(sl$Species))
## 
##  1 
## 62
length(unique(sl$Species))
## [1] 62
## [1] 62

The table function does not show NAs, at least with the default option

table(sl$Gestation)
## 
##   12   14   16   17   19   21 21.5   25   28   30   31   33   35   38   42 
##    1    1    1    1    1    1    1    1    2    2    1    1    1    1    3 
##   45   46   50   52   60   63   68   90  100  112  115  120  140  148  150 
##    1    1    1    1    2    3    1    1    1    1    1    2    1    1    1 
##  151  164  170  180  200  210  225  230  252  267  281  310  336  365  392 
##    1    1    1    1    1    1    2    1    1    1    1    1    1    1    1 
##  400  440  624  645 
##    1    1    1    1

You would notice this if you added a total to the end of the table.

addmargins(table(sl$Gestation))
## 
##   12   14   16   17   19   21 21.5   25   28   30   31   33   35   38   42 
##    1    1    1    1    1    1    1    1    2    2    1    1    1    1    3 
##   45   46   50   52   60   63   68   90  100  112  115  120  140  148  150 
##    1    1    1    1    2    3    1    1    1    1    1    2    1    1    1 
##  151  164  170  180  200  210  225  230  252  267  281  310  336  365  392 
##    1    1    1    1    1    1    2    1    1    1    1    1    1    1    1 
##  400  440  624  645  Sum 
##    1    1    1    1   58

The total is only 58, which is less than the magic number of 62. This means that there are 4 missing values.

You can also force the table to show counts for missing values.

table(sl$Gestation,useNA="always")
## 
##   12   14   16   17   19   21 21.5   25   28   30   31   33   35   38   42 
##    1    1    1    1    1    1    1    1    2    2    1    1    1    1    3 
##   45   46   50   52   60   63   68   90  100  112  115  120  140  148  150 
##    1    1    1    1    2    3    1    1    1    1    1    2    1    1    1 
##  151  164  170  180  200  210  225  230  252  267  281  310  336  365  392 
##    1    1    1    1    1    1    2    1    1    1    1    1    1    1    1 
##  400  440  624  645 <NA> 
##    1    1    1    1    4

The default option for useNA is “never” and a third option is “ifany”. Notice how “ifany” differs from “always”

table(sl$Gestation,useNA="ifany")
## 
##   12   14   16   17   19   21 21.5   25   28   30   31   33   35   38   42 
##    1    1    1    1    1    1    1    1    2    2    1    1    1    1    3 
##   45   46   50   52   60   63   68   90  100  112  115  120  140  148  150 
##    1    1    1    1    2    3    1    1    1    1    1    2    1    1    1 
##  151  164  170  180  200  210  225  230  252  267  281  310  336  365  392 
##    1    1    1    1    1    1    2    1    1    1    1    1    1    1    1 
##  400  440  624  645 <NA> 
##    1    1    1    1    4
table(sl$Danger,useNA="ifany")
## 
##  1  2  3  4  5 
## 19 14 10 10  9
table(sl$Danger,useNA="always")
## 
##    1    2    3    4    5 <NA> 
##   19   14   10   10    9    0

Many other functions in R offer multiple ways to handle NAs.

mean(sl$Gestation,na.rm=FALSE)
## [1] NA
mean(sl$Gestation,na.rm=TRUE)
## [1] 142.3534
mod1 <- lm(TotalSleep~Gestation,data=sl,na.action=na.omit)
summary(mod1)
## 
## Call:
## lm(formula = TotalSleep ~ Gestation, data = sl, na.action = na.omit)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.5934 -2.6042  0.2316  1.4075  7.6317 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.305786   0.688282   19.33  < 2e-16 ***
## Gestation   -0.020749   0.003535   -5.87 3.08e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.619 on 52 degrees of freedom
##   (8 observations deleted due to missingness)
## Multiple R-squared:  0.3986,	Adjusted R-squared:  0.387 
## F-statistic: 34.46 on 1 and 52 DF,  p-value: 3.083e-07

The alternate option for na.action in the lm function is na.fail, but I cannot run this directly in knitr, because the resulting error message causes knitr to stop.

Logic values involving NAs are a bit tricky. You cannot get at NAs directly using the == comparison operator.

which(sl$Gestation==NA)
## integer(0)
## integer(0)

What’s happening here is that most logic comparisons invloving NA will evaluate not to TRUE or FALSE, but rather to NA.

If you want to select missing values, you must use the is.na function.

which(is.na(sl$Gestation))
## [1] 13 19 20 56
## [1] 13 19 20 56

The is.finite function represents “opposite” function, though there are a few subtle differences.

which(is.finite(sl$Gestation))
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 14 15 16 17 18 21 22 23 24 25 26
## [24] 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
## [47] 50 51 52 53 54 55 57 58 59 60 61 62
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 14 15 16 17 18 21 22 23 24 25 26
## [24] 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
## [47] 50 51 52 53 54 55 57 58 59 60 61 62

Logic involving missing values can seem at times a bit illogical.

sl$Gestation==NA
##  [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [24] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [47] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [24] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [47] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
sl$Gestation==645
##  [1]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE    NA FALSE FALSE FALSE FALSE FALSE    NA    NA FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [56]    NA FALSE FALSE FALSE FALSE FALSE FALSE
##  [1]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE    NA FALSE FALSE FALSE FALSE FALSE    NA    NA FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [56]    NA FALSE FALSE FALSE FALSE FALSE FALSE
tst.na <- NA
tst.na==tst.na
## [1] NA
## [1] NA

Why would NA==NA not evaluate to TRUE? You could argue that a missing value might be 3 or 7 or 923.427, so you would never know if it equaled something else, even another missing value.

This produces a three valued logic system.

tst.tvl <- c(TRUE,FALSE,NA)
not.table <- !tst.tvl
names(not.table) <- tst.tvl
not.table
##  TRUE FALSE  <NA> 
## FALSE  TRUE    NA
##  TRUE FALSE  <NA> 
## FALSE  TRUE    NA
and.table <- outer(tst.tvl,tst.tvl,"&")
dimnames(and.table) <- list(tst.tvl,tst.tvl)
and.table
##        TRUE FALSE  <NA>
## TRUE   TRUE FALSE    NA
## FALSE FALSE FALSE FALSE
## <NA>     NA FALSE    NA
##        TRUE FALSE  <NA>
## TRUE   TRUE FALSE    NA
## FALSE FALSE FALSE FALSE
## <NA>     NA FALSE    NA
or.table <- outer(tst.tvl,tst.tvl,"|")
dimnames(or.table) <- list(tst.tvl,tst.tvl)
or.table
##       TRUE FALSE <NA>
## TRUE  TRUE  TRUE TRUE
## FALSE TRUE FALSE   NA
## <NA>  TRUE    NA   NA

Other programs, such as SAS, use a two valued logic system. What SAS does is to set missing values equal to negative infinity. This seems simpler because it allows you to keep a two-valued logic system. But it has its own problems.

Note how R handles the less than comparison operator with missing values.

sl$Gestation<50
##  [1] FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
## [12] FALSE    NA FALSE  TRUE  TRUE  TRUE FALSE    NA    NA FALSE FALSE
## [23]  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE
## [34] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE
## [45] FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [56]    NA FALSE FALSE  TRUE FALSE  TRUE  TRUE

In SAS, this comparison would evaluate to TRUE for the four missing values.

The only way to select missing values is with the is.na function.

There are three more values in R that are closely related to missing.

1/0
## [1] Inf
-1/0
## [1] -Inf
sqrt(-1)
## Warning in sqrt(-1): NaNs produced
## [1] NaN

The difference between Inf and NaN is in how comparisons are made.

Inf>3
## [1] TRUE
NaN>3
## [1] NA
Inf==Inf
## [1] TRUE
NaN==NaN
## [1] NA

R allows you to make a few computations using Inf, but others are not allowed.

Inf+1
## [1] Inf
1/Inf
## [1] 0
0*Inf
## [1] NaN
Inf-Inf
## [1] NaN

It’s also worth noting that you can force R to use complex numbers for the square root of -1, although this is not the default.

sqrt(as.complex(-1))
## [1] 0+1i

The is.na and is.finite functions are not perfect opposites.

tst.cases <- c(3,NA,NaN,Inf,-Inf)
is.na.table <- is.na(tst.cases)
names(is.na.table) <- tst.cases
is.na.table
##     3  <NA>   NaN   Inf  -Inf 
## FALSE  TRUE  TRUE FALSE FALSE
is.finite.table <- is.finite(tst.cases)
names(is.finite.table) <- tst.cases
is.finite.table
##     3  <NA>   NaN   Inf  -Inf 
##  TRUE FALSE FALSE FALSE FALSE

You can find an earlier version of this page on my blog.