Different creatures in the R zoo

Steve Simon

2022-04-14

R has a variety of ways of storing information. It is a hodge-podge of different things. I’ll call these “things” creatures, though that is probably not the right word. Let me review some of the more important creatures in R.

Scalars

A single number, string, or date is called a scalar.

secret_of_the_universe <- 42
on_first <- "Who"
y2k <- as.Date("2000-01-01")

Vectors

A combination of two or more scalars of the same type is a vector. Strictly speaking, you say that a scalar is actually a vector of length 1. That does work for the most part. You can use a scalar in places where R is expecting a vector. Even so, most people reserve the term “vector” for anything with a length of 2 or more.

You use the c function to combine two or more scalars into a vector.

buckle_my_shoe <- c(1, 2)
vowels <- c("A", "E", "I", "O", "U")

You can use the colon to create a vector that is sequence of consecutive integers.

fantastic_four <- 1:4
fantastic_four
## [1] 1 2 3 4
not_so_fantastic_four <- 5:8
not_so_fantastic_four
## [1] 5 6 7 8

You can use the length function to calculate the length of a vector.

length(vowels)
## [1] 5

You get individual elements of a vector using square brackets.

vowels[1]
## [1] "A"

You can also use square brackets to extract chunks from a vector

vowels[1:2]
## [1] "A" "E"
vowels[c(2, 4)]
## [1] "E" "O"

You can use c to make a longer vector out of two shorter vectors.

c(buckle_my_shoe, fantastic_four)
## [1] 1 2 1 2 3 4

Named vectors

You can assign names to some or all of the individual elements in a vector. This is not done too frequently, but it does have its uses.

primes <- c(
  "first"=2,
  "second"=3,
  "third"=5,
  "fourth"=7)
primes
##  first second  third fourth 
##      2      3      5      7

You can extract individual elements of a named vector using the names instead of their numeric positions.

primes["first"]
## first 
##     2
primes[c("second", "fourth")]
## second fourth 
##      3      7

You list the names of a named vector using the names function.

names(primes)
## [1] "first"  "second" "third"  "fourth"

Lists

You can’t mix and match within a vector. If you try to combine a number and a string, for example, R will convert the number to a string first.

mixed_vector <- c(1, "two")
mixed_vector
## [1] "1"   "two"

You can mix numbers and strings in a list, which is created using the list function.

mixed_list <- list(1, "two")
mixed_list
## [[1]]
## [1] 1
## 
## [[2]]
## [1] "two"

Notice the odd collection of double and single brackets in the list. Why R uses single brackets in some places and double brackets in other locations is something that can confound even an experienced R programmer. The only important thing to remember as a beginner is that double brackets allow you to select individual items from a list.

mixed_list[[2]]
## [1] "two"

You can create lists that are mix of some of the creatures shown below. You can even make lists of lists.

groceries <- list("bread", "fruit")
clothes <- list("socks", "underwear", "shirts")
shopping <- list(groceries, clothes)
shopping
## [[1]]
## [[1]][[1]]
## [1] "bread"
## 
## [[1]][[2]]
## [1] "fruit"
## 
## 
## [[2]]
## [[2]][[1]]
## [1] "socks"
## 
## [[2]][[2]]
## [1] "underwear"
## 
## [[2]][[3]]
## [1] "shirts"

This can get pretty messy. You can recast all the individual components into a single vector using the unlist function.

unlist(shopping)
## [1] "bread"     "fruit"     "socks"     "underwear" "shirts"

Of course, you lose the ability to mix and match once you unlist.

You can (and probably should) use names for each element of your list. This makes it easier to pick out pieces of list. Either of the first two statements will add names to a list.

names(shopping) <- c("grocery_list", "clothing_list")
shopping <- list(grocery_list=groceries, clothing_list=clothes)
shopping
## $grocery_list
## $grocery_list[[1]]
## [1] "bread"
## 
## $grocery_list[[2]]
## [1] "fruit"
## 
## 
## $clothing_list
## $clothing_list[[1]]
## [1] "socks"
## 
## $clothing_list[[2]]
## [1] "underwear"
## 
## $clothing_list[[3]]
## [1] "shirts"
shopping[["clothing_list"]]
## [[1]]
## [1] "socks"
## 
## [[2]]
## [1] "underwear"
## 
## [[3]]
## [1] "shirts"

If the list has names, you can use the dollar sign as a substitute for the double square brackets.

shopping$clothing_list
## [[1]]
## [1] "socks"
## 
## [[2]]
## [1] "underwear"
## 
## [[3]]
## [1] "shirts"

Matrices and arrays

Matrices are a two dimensional arrangement of scalars of the same type.

diagonal_matrix <- matrix(c(1, 0, 0, 0, 1, 0, 0, 0, 1), nrow=3)
diagonal_matrix
##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    1    0
## [3,]    0    0    1

You can get the number of rows and columns of a matrix using the dim function.

dim(diagonal_matrix)
## [1] 3 3

Use single square brackets to extract individual elements or chunks of a matrix.

diagonal_matrix[1, 3]
## [1] 0
diagonal_matrix[1:2, 1:2]
##      [,1] [,2]
## [1,]    1    0
## [2,]    0    1

To extract an entire row or column from a matrix, leave the index blank.

diagonal_matrix[1, ]
## [1] 1 0 0
diagonal_matrix[ , 3]
## [1] 0 0 1

You can assign names to the rows and/or columns of a matrix using the dimnames function.

dimnames(diagonal_matrix) <- list(c("A", "B", "C"), c("D", "E", "F"))
diagonal_matrix
##   D E F
## A 1 0 0
## B 0 1 0
## C 0 0 1

Arrays are a three or higher dimensional arrangement of scalars of the same type. They work pretty much the same way that matrices work, so I won’t show any examples here.

You can combine vectors into a matrix using the cbind or rbind functions. Notice the difference in how the two functions work.

cbind(fantastic_four, not_so_fantastic_four)
##      fantastic_four not_so_fantastic_four
## [1,]              1                     5
## [2,]              2                     6
## [3,]              3                     7
## [4,]              4                     8
rbind(fantastic_four, not_so_fantastic_four)
##                       [,1] [,2] [,3] [,4]
## fantastic_four           1    2    3    4
## not_so_fantastic_four    5    6    7    8

Data frames

A data frame is a list of vectors, where each vector can be a different type, but they all have to have the same length.

It is a compromise between the simple but rigid matrix, and the complex but flexible list. The data frame is a very popular approach for storing datasets.

The data shown here is real, and it comes from the Data and Story Library. I added some fictional names to illustrate how you can mix and match vectors of different types.

chain <- c(
  "Burger Queen",
  "McBurger",
  "Hardly's",
  "Out-n-In",
  "What? A Burger?",
  "Burger Bell",
  "Hedgehog")

fat <- c(
  19,
  31,
  34,
  35,
  39,
  39,
  43)

sodium <- c(
  920,
  1500,
  1310,
  860,
  1180,
  940,
  1260)

calories <- c(
  410,
  580,
  590,
  570,
  640,
  680,
  660)

fast_food <- data.frame(chain, fat, sodium, calories)
fast_food
##             chain fat sodium calories
## 1    Burger Queen  19    920      410
## 2        McBurger  31   1500      580
## 3        Hardly's  34   1310      590
## 4        Out-n-In  35    860      570
## 5 What? A Burger?  39   1180      640
## 6     Burger Bell  39    940      680
## 7        Hedgehog  43   1260      660

Notice how R created names for each column. It also allows you to create names for each row. Here’s what it would look like if you used chain as names for the rows.

fast_food_alt <- data.frame(fat, sodium, calories)
rownames(fast_food_alt) <- chain
fast_food_alt
##                 fat sodium calories
## Burger Queen     19    920      410
## McBurger         31   1500      580
## Hardly's         34   1310      590
## Out-n-In         35    860      570
## What? A Burger?  39   1180      640
## Burger Bell      39    940      680
## Hedgehog         43   1260      660

The current thinking among R professionals is that a variable like chain belongs as a direct part of the data frame rather than as a row name, but both approaches are still common.

You can select individual columns of a data frame using numbers or names.

fast_food[, 2]
## [1] 19 31 34 35 39 39 43
fast_food[, "fat"]
## [1] 19 31 34 35 39 39 43
fast_food$fat
## [1] 19 31 34 35 39 39 43

The last approach, using the dollar sign, is the most common way to work with a single column of a data frame.

Tibbles

The programming team working on the tidyverse has developed an alternative to the data frame called a tibble. You use the tibble function to create a tibble. But first you have to load the tidyverse packages (or just the tidyr package by itself)

library(tidyr)
fast_food_tibble <- tibble(chain, fat, sodium, calories)

It works mostly like a data frame but with a few important differences.

First, as you can see above, the tibble describes itself as a tibble when you print it, and it lets you know what type of data is in each column.

Second, when you print a tibble, it refrains from presenting all of the data for anything except the most modest-sized of datasets.

It does become important, however, when you work with data with more than just a few rows or more than just a few columns.

There is a dataset in the survival package about colon cancer that has 1,848 observations (rows) and 16 variables (columns). Notice how only a portion of the data is shown.

library(survival)
tibble(colon)
## # A tibble: 1,858 x 16
##       id study rx        sex   age obstruct perfor adhere nodes status differ
##    <dbl> <dbl> <fct>   <dbl> <dbl>    <dbl>  <dbl>  <dbl> <dbl>  <dbl>  <dbl>
##  1     1     1 Lev+5FU     1    43        0      0      0     5      1      2
##  2     1     1 Lev+5FU     1    43        0      0      0     5      1      2
##  3     2     1 Lev+5FU     1    63        0      0      0     1      0      2
##  4     2     1 Lev+5FU     1    63        0      0      0     1      0      2
##  5     3     1 Obs         0    71        0      0      1     7      1      2
##  6     3     1 Obs         0    71        0      0      1     7      1      2
##  7     4     1 Lev+5FU     0    66        1      0      0     6      1      2
##  8     4     1 Lev+5FU     0    66        1      0      0     6      1      2
##  9     5     1 Obs         1    69        0      0      0    22      1      2
## 10     5     1 Obs         1    69        0      0      0    22      1      2
## # ... with 1,848 more rows, and 5 more variables: extent <dbl>, surg <dbl>,
## #   node4 <dbl>, time <dbl>, etype <dbl>

You can ask a tibble to display more or all of itself.

print(tibble(colon), n=25)
## # A tibble: 1,858 x 16
##       id study rx        sex   age obstruct perfor adhere nodes status differ
##    <dbl> <dbl> <fct>   <dbl> <dbl>    <dbl>  <dbl>  <dbl> <dbl>  <dbl>  <dbl>
##  1     1     1 Lev+5FU     1    43        0      0      0     5      1      2
##  2     1     1 Lev+5FU     1    43        0      0      0     5      1      2
##  3     2     1 Lev+5FU     1    63        0      0      0     1      0      2
##  4     2     1 Lev+5FU     1    63        0      0      0     1      0      2
##  5     3     1 Obs         0    71        0      0      1     7      1      2
##  6     3     1 Obs         0    71        0      0      1     7      1      2
##  7     4     1 Lev+5FU     0    66        1      0      0     6      1      2
##  8     4     1 Lev+5FU     0    66        1      0      0     6      1      2
##  9     5     1 Obs         1    69        0      0      0    22      1      2
## 10     5     1 Obs         1    69        0      0      0    22      1      2
## 11     6     1 Lev+5FU     0    57        0      0      0     9      1      2
## 12     6     1 Lev+5FU     0    57        0      0      0     9      1      2
## 13     7     1 Lev         1    77        0      0      0     5      1      2
## 14     7     1 Lev         1    77        0      0      0     5      1      2
## 15     8     1 Obs         1    54        0      0      0     1      0      2
## 16     8     1 Obs         1    54        0      0      0     1      0      2
## 17     9     1 Lev         1    46        0      0      1     2      0      2
## 18     9     1 Lev         1    46        0      0      1     2      0      2
## 19    10     1 Lev+5FU     0    68        0      0      0     1      0      2
## 20    10     1 Lev+5FU     0    68        0      0      0     1      0      2
## 21    11     1 Lev         0    47        0      0      1     1      0      2
## 22    11     1 Lev         0    47        0      0      1     1      0      2
## 23    12     1 Lev+5FU     1    52        0      0      0     2      0      3
## 24    12     1 Lev+5FU     1    52        0      0      0     2      0      3
## 25    13     1 Obs         1    64        0      0      0     1      1      2
## # ... with 1,833 more rows, and 5 more variables: extent <dbl>, surg <dbl>,
## #   node4 <dbl>, time <dbl>, etype <dbl>

Another important difference is in how tibbles handle subsetting. Look carefully at these examples.

fast_food[5, ]
##             chain fat sodium calories
## 5 What? A Burger?  39   1180      640
fast_food_tibble[5, ]
## # A tibble: 1 x 4
##   chain             fat sodium calories
##   <chr>           <dbl>  <dbl>    <dbl>
## 1 What? A Burger?    39   1180      640
fast_food[ , 2]
## [1] 19 31 34 35 39 39 43
fast_food_tibble[ , 2]
## # A tibble: 7 x 1
##     fat
##   <dbl>
## 1    19
## 2    31
## 3    34
## 4    35
## 5    39
## 6    39
## 7    43
fast_food[5, 2]
## [1] 39
fast_food_tibble[5, 2]
## # A tibble: 1 x 1
##     fat
##   <dbl>
## 1    39

When you extract a single row from a data frame, it just becomes a smaller data frame. Same for a tibble. When extract a single row from a tibble, it just becomes a smaller tibble.

But notice how this changes when you extract a single column. With the data frame, the single column is displayed horizontally, because R has converted it to a vector. With the tibble, there is no conversion and a single column subset from a tibble is still a tibble.

When you extract a single value from a data frame, it displays it as a scalar. A single value from a tibble is still a tibble, albeit a very small tibble.

The point is that subsets of a data frame are sometimes implicitly converted to a vector or scalar and sometimes not. Programmers sometimes forget about the implicit conversion and treat any subset of a data frame as a smaller data frame. This is a common source of bugs in R programs. Tibbles force you to do the conversion explicitly, which may mean a bit of extra work, but the code, especially for complex operations, is more reliable.

A practical illustration

The examples shown earlier are mostly just simple cute examples, but here is a practical problem that shows some of the various creatures in the R zoo.

Let’s use the data frame on fast food burgers. Pretend that you have discovered that the “Out-n-In” burger (in the fourth row) is actually made of tofu and should be discarded from the analysis.

real_fast_food <- fast_food[-4, ]

You want to run a regression analysis and examine the various statistics from the output individually. You store the output in a new variable, fast_food_regression.

fast_food_regression <- lm(calories~fat+sodium, data=real_fast_food)

Since the output is complex and a mixture of different elements, R uses a list to store everything.

names(fast_food_regression)
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"

Let’s look at the first item in the list, coefficents.

fast_food_regression$coefficients
##  (Intercept)          fat       sodium 
## 2.126014e+02 1.111342e+01 8.636131e-04

This is a named vector with values for the intercept, the slope for the fat variable, and the slope for the sodium variable.

Now let’s look at the residuals and fitted values.

fast_food_regression$residuals
##          1          2          3          5          6          7 
## -14.550921  21.587129  -1.589048  -7.043884  33.163383 -31.566658
fast_food_regression$fitted.values
##        1        2        3        5        6        7 
## 424.5509 558.4129 591.5890 647.0439 646.8366 691.5667

These are both vectors.

Let’s look at df.residual.

fast_food_regression$df.residual
## [1] 3

This is a scalar. There are 3 degrees of freedom for error.

Let’s look at qr. This will provide some rather complex details about some intermediate calculations that R used to produce this regression model.

fast_food_regression$qr
## $qr
##   (Intercept)         fat        sodium
## 1  -2.4494897 -83.6908995 -2902.6453452
## 2   0.4082483 -19.1006108  -128.5299211
## 3   0.4082483   0.2214651  -483.9732011
## 5   0.4082483   0.4832368    -0.3397222
## 6   0.4082483   0.4832368    -0.8356174
## 7   0.4082483   0.6926542    -0.3859595
## attr(,"assign")
## [1] 0 1 2
## 
## $qraux
## [1] 1.408248 1.064402 1.193307
## 
## $pivot
## [1] 1 2 3
## 
## $tol
## [1] 1e-07
## 
## $rank
## [1] 3
## 
## attr(,"class")
## [1] "qr"

This is a list inside the list. If you are not familiar with what the qr decomposition is or how it works, then this list has little value for you. For some advanced and specialized applications, knowledge of the qr decomposition is very important. Such information is very difficult to get from other statistical packages like SAS or SPSS.

The summary function produces another list from the list we just examined. This provides more information, such as numbers needed to reproduce some of the important tests and measures for this regression model.

fast_food_summary <- summary(fast_food_regression)
names(fast_food_summary)
##  [1] "call"          "terms"         "residuals"     "coefficients" 
##  [5] "aliased"       "sigma"         "df"            "r.squared"    
##  [9] "adj.r.squared" "fstatistic"    "cov.unscaled"

The third item in this list is coefficients, and it is different from the coeeficients item in the previous list.

fast_food_summary$coefficients
##                 Estimate  Std. Error    t value    Pr(>|t|)
## (Intercept) 2.126014e+02 82.70445317 2.57061605 0.082448966
## fat         1.111342e+01  1.66260089 6.68435897 0.006829065
## sodium      8.636131e-04  0.06341833 0.01361772 0.989989955

This is a matrix with names for each row and for each column. If you wanted to produce a nice table summarizing the results of the regresion model, customized to your specifications, here is where you would start.

You might also be interested in some post-processing of the regression results. You can, for example, extract the p-values for the two slopes.

fast_food_summary$coefficients[c("fat", "sodium"), "Pr(>|t|)"]
##         fat      sodium 
## 0.006829065 0.989989955

You might need to do this if you wanted to make a Bonferroni correction to the p-values.

Let’s look at r.squared and adj.r.squared.

fast_food_summary$r.squared
## [1] 0.9410402
fast_food_summary$adj.r.squared
## [1] 0.9017337

These are scalars. These are also key numbers for your custom regression summary.

Let’s look at fstatistic.

fast_food_summary$fstatistic
##    value    numdf    dendf 
## 23.94107  2.00000  3.00000

This is a named vector with the statistic itself and the numerator and denominator degrees of freedom. Yet another key piece.

There’s a lot more that you could explore in each of these lists. Some of it is only for advanced applications. I am showing it here, just to show you the wide range of creatures that R provides for its regression models.

Summary

There are many creatures in the R zoo: scalars, vectors, named vectors, lists, matrices, arrays, data frames, and tibbles are some of the more important ones. Vectors, matrices, and arrays require you to use the same type of data (number, string, or date) for each element. Lists and data frames allow you to mix and match different types of data. You got to see the inside of a couple of real lists used by R and view the different creatures found in each list.