# Examining relationships in R

## 2015-04-03

I’m giving a talk for the Kansas City R Users Group on how to get a preliminary impression of relationships between pairs of variables. Here is the R code and output that I will use.

### Simple measures of association

There are several different ways of measuring bivariate relationships in a descriptive fashion prior to data analysis. The methods can be largely grouped into measures of relationship between two continuous variables, two categorical variables and measures of a relationship between a categorical variable and a continuous variable.

suppressMessages(suppressWarnings(library(tidyverse)))
fn <- "https://raw.githubusercontent.com/pmean/introduction-to-SAS/master/data/housing.txt"
home <- read_tsv(fn, na=".")
## Rows: 117 Columns: 7-- Column specification --------------------------------------------------------
## Delimiter: "\t"
## dbl (7): price, sqft, age, features, northeast, custom_build, corner_lot
## i Use spec() to retrieve the full column specification for this data.
## i Specify the column types or set show_col_types = FALSE to quiet this message.
home
## # A tibble: 117 x 7
##     price  sqft   age features northeast custom_build corner_lot
##     <dbl> <dbl> <dbl>    <dbl>     <dbl>        <dbl>      <dbl>
##  1 205000  2650    13        7         1            1          0
##  2 208000  2600    NA        4         1            1          0
##  3 215000  2664     6        5         1            1          0
##  4 215000  2921     3        6         1            1          0
##  5 199900  2580     4        4         1            1          0
##  6 190000  2580     4        4         1            0          0
##  7 180000  2774     2        4         1            0          0
##  8 156000  1920     1        5         1            1          0
##  9 145000  2150    NA        4         1            0          0
## 10 144900  1710     1        3         1            1          0
## # ... with 107 more rows

The best graphical summary of two continuous variables is a scatterplot. You should include a smoothing curve or spline model to the graph to emphasize the general trend and any departures from linearity.

plot(home$sqft,home$price)
lines(lowess(home$price~home$sqft))

sb <- is.finite(home$age) plot(home$age[sb],home$price[sb]) lines(lowess(home$price[sb]~home$age[sb])) The best numeric summary of two continuous variables is a correlation coefficient. cor(home[,c("price","sqft","age")],use="pairwise.complete.obs") ## price sqft age ## price 1.0000000 0.84479510 -0.16867888 ## sqft 0.8447951 1.00000000 -0.03965489 ## age -0.1686789 -0.03965489 1.00000000 Correlations should always be rounded to two or maybe even just one significant digit. round(cor(home[,c("price","sqft","age")],use="pairwise.complete.obs"),1) ## price sqft age ## price 1.0 0.8 -0.2 ## sqft 0.8 1.0 0.0 ## age -0.2 0.0 1.0 Anything larger than 0.7 or smaller than -0.7 is a strong linear relationship. Anything between 0.3 and 0.7 or between -0.3 and -0.7 is a weak linear relationship. Anything between -0.3 and 0.3 represents little or no linear relationship. The best graphical summary between a continuous variable and a categorical variable is a boxplot. boxplot(home$price~home$features) boxplot(home$price~home$northeast) boxplot(home$price~home$custom_build) boxplot(home$price~home$corner_lot) If your categorical variable is binary, you can also use a scatterplot. The binary variable goes on the y axis and a trend line is critical. plot(home$price,home$northeast) lines(lowess(home$northeast~home$price)) plot(home$price,home$custom_build) lines(lowess(home$custom_build~home$price)) plot(home$price,home$corner_lot) lines(lowess(home$corner_lot~home$price)) You can also compute a correlaton between a binary variable and a categorical variable. It is equivalent to the point-bisearial correlation. round(cor(home[,c("northeast","custom_build","corner_lot")],home$price),1)
##              [,1]
## northeast     0.2
## custom_build  0.6
## corner_lot   -0.1

Let’s save the display of a relationship involving two categorical variables until another day.