Examining relationships in R

I’m giving a talk for the Kansas City R Users Group on how to get a preliminary impression of relationships between pairs of variables. Here is the R code and output that I will use.

Simple measures of association

There are several different ways of measuring bivariate relationships in a descriptive fashion prior to data analysis. The methods can be largely grouped into measures of relationship between two continuous variables, two categorical variables and measures of a relationship between a categorical variable and a continuous variable.

suppressMessages(suppressWarnings(library(tidyverse)))
fn <- "https://raw.githubusercontent.com/pmean/introduction-to-SAS/master/data/housing.txt"
home <- read_tsv(fn, na=".")

## Rows: 117 Columns: 7-- Column specification ---------------------------------------------------
## Delimiter: "\t"
## dbl (7): price, sqft, age, features, northeast, custom_build, corner_lot
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

home

## # A tibble: 117 x 7
##     price  sqft   age features northeast custom_build corner_lot
##     <dbl> <dbl> <dbl>    <dbl>     <dbl>        <dbl>      <dbl>
##  1 205000  2650    13        7         1            1          0
##  2 208000  2600    NA        4         1            1          0
##  3 215000  2664     6        5         1            1          0
##  4 215000  2921     3        6         1            1          0
##  5 199900  2580     4        4         1            1          0
##  6 190000  2580     4        4         1            0          0
##  7 180000  2774     2        4         1            0          0
##  8 156000  1920     1        5         1            1          0
##  9 145000  2150    NA        4         1            0          0
## 10 144900  1710     1        3         1            1          0
## # ... with 107 more rows

The best graphical summary of two continuous variables is a scatterplot. You should include a smoothing curve or spline model to the graph to emphasize the general trend and any departures from linearity.

plot(home$sqft,home$price)
lines(lowess(home$price~home$sqft))

plot of chunk relationships-in-r-02

sb <- is.finite(home$age)
plot(home$age[sb],home$price[sb])
lines(lowess(home$price[sb]~home$age[sb]))

plot of chunk relationships-in-r-03

The best numeric summary of two continuous variables is a correlation coefficient.

cor(home[,c("price","sqft","age")],use="pairwise.complete.obs")

##            price        sqft         age
## price  1.0000000  0.84479510 -0.16867888
## sqft   0.8447951  1.00000000 -0.03965489
## age   -0.1686789 -0.03965489  1.00000000

Correlations should always be rounded to two or maybe even just one significant digit.

round(cor(home[,c("price","sqft","age")],use="pairwise.complete.obs"),1)

##       price sqft  age
## price   1.0  0.8 -0.2
## sqft    0.8  1.0  0.0
## age    -0.2  0.0  1.0

Anything larger than 0.7 or smaller than -0.7 is a strong linear relationship. Anything between 0.3 and 0.7 or between -0.3 and -0.7 is a weak linear relationship. Anything between -0.3 and 0.3 represents little or no linear relationship.

The best graphical summary between a continuous variable and a categorical variable is a boxplot.

boxplot(home$price~home$features)

plot of chunk relationships-in-r-06

boxplot(home$price~home$northeast)

plot of chunk relationships-in-r-07

boxplot(home$price~home$custom_build)

plot of chunk relationships-in-r-08

boxplot(home$price~home$corner_lot)

plot of chunk relationships-in-r-09

If your categorical variable is binary, you can also use a scatterplot. The binary variable goes on the y axis and a trend line is critical.

plot(home$price,home$northeast)
lines(lowess(home$northeast~home$price))

plot of chunk relationships-in-r-10

plot(home$price,home$custom_build)
lines(lowess(home$custom_build~home$price))

plot of chunk relationships-in-r-11

plot(home$price,home$corner_lot)
lines(lowess(home$corner_lot~home$price))

plot of chunk relationships-in-r-12

You can also compute a correlaton between a binary variable and a categorical variable. It is equivalent to the point-bisearial correlation.

round(cor(home[,c("northeast","custom_build","corner_lot")],home$price),1)

##              [,1]
## northeast     0.2
## custom_build  0.6
## corner_lot   -0.1

Let’s save the display of a relationship involving two categorical variables until another day.

Examining relationships in R

Steve Simon

2015-04-03

Simple measures of association