I’m giving a talk for the Kansas City R Users Group on how to get a preliminary impression of relationships between pairs of variables. Here is the R code and output that I will use.
Simple measures of association
There are several different ways of measuring bivariate relationships in a descriptive fashion prior to data analysis. The methods can be largely grouped into measures of relationship between two continuous variables, two categorical variables and measures of a relationship between a categorical variable and a continuous variable.
suppressMessages(suppressWarnings(library(tidyverse)))
fn <- "https://raw.githubusercontent.com/pmean/introduction-to-SAS/master/data/housing.txt"
home <- read_tsv(fn, na=".")
## Rows: 117 Columns: 7-- Column specification --------------------------------------------------------
## Delimiter: "\t"
## dbl (7): price, sqft, age, features, northeast, custom_build, corner_lot
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
home
## # A tibble: 117 x 7
## price sqft age features northeast custom_build corner_lot
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 205000 2650 13 7 1 1 0
## 2 208000 2600 NA 4 1 1 0
## 3 215000 2664 6 5 1 1 0
## 4 215000 2921 3 6 1 1 0
## 5 199900 2580 4 4 1 1 0
## 6 190000 2580 4 4 1 0 0
## 7 180000 2774 2 4 1 0 0
## 8 156000 1920 1 5 1 1 0
## 9 145000 2150 NA 4 1 0 0
## 10 144900 1710 1 3 1 1 0
## # ... with 107 more rows
The best graphical summary of two continuous variables is a scatterplot. You should include a smoothing curve or spline model to the graph to emphasize the general trend and any departures from linearity.
plot(home$sqft,home$price)
lines(lowess(home$price~home$sqft))
sb <- is.finite(home$age)
plot(home$age[sb],home$price[sb])
lines(lowess(home$price[sb]~home$age[sb]))
The best numeric summary of two continuous variables is a correlation coefficient.
cor(home[,c("price","sqft","age")],use="pairwise.complete.obs")
## price sqft age
## price 1.0000000 0.84479510 -0.16867888
## sqft 0.8447951 1.00000000 -0.03965489
## age -0.1686789 -0.03965489 1.00000000
Correlations should always be rounded to two or maybe even just one significant digit.
round(cor(home[,c("price","sqft","age")],use="pairwise.complete.obs"),1)
## price sqft age
## price 1.0 0.8 -0.2
## sqft 0.8 1.0 0.0
## age -0.2 0.0 1.0
Anything larger than 0.7 or smaller than -0.7 is a strong linear relationship. Anything between 0.3 and 0.7 or between -0.3 and -0.7 is a weak linear relationship. Anything between -0.3 and 0.3 represents little or no linear relationship.
The best graphical summary between a continuous variable and a categorical variable is a boxplot.
boxplot(home$price~home$features)
boxplot(home$price~home$northeast)
boxplot(home$price~home$custom_build)
boxplot(home$price~home$corner_lot)
If your categorical variable is binary, you can also use a scatterplot. The binary variable goes on the y axis and a trend line is critical.
plot(home$price,home$northeast)
lines(lowess(home$northeast~home$price))
plot(home$price,home$custom_build)
lines(lowess(home$custom_build~home$price))
plot(home$price,home$corner_lot)
lines(lowess(home$corner_lot~home$price))
You can also compute a correlaton between a binary variable and a categorical variable. It is equivalent to the point-bisearial correlation.
round(cor(home[,c("northeast","custom_build","corner_lot")],home$price),1)
## [,1]
## northeast 0.2
## custom_build 0.6
## corner_lot -0.1
Let’s save the display of a relationship involving two categorical variables until another day.