Merging in R

Steve Simon

2005-07-26

Merging is tricky in any program, and you have to be sure that you know what you are doing. I have found that keeping an index variable will allow me to backtrack and find information in the original files if needed. Here’s a simple example of how this would work.

# randomly select 20 letters without replacement.
x0 <- sample(letters,20)
x1 <- sample(letters,20)
x0
##  [1] "v" "i" "k" "w" "s" "n" "z" "t" "f" "x" "y" "o" "m" "g" "u" "b" "c"
## [18] "e" "a" "r"
x1
##  [1] "e" "h" "f" "m" "w" "o" "k" "q" "a" "p" "v" "d" "z" "x" "n" "j" "t"
## [18] "i" "u" "s"

The intersect functions lets us know which letters are in both x0 and x1.

intersect(x0,x1)
##  [1] "v" "i" "k" "w" "s" "n" "z" "t" "f" "x" "o" "m" "u" "e" "a"

The setdiff function lets us know which letters are in x0 but not x1 (reverse the arguments to get letters in x1 but not in x1).

setdiff(x0,x1)
## [1] "y" "g" "b" "c" "r"
setdiff(x1,x0)
## [1] "h" "q" "p" "d" "j"

By default, the merge function includes only those values in common with the two files. Note that the id1 and id2 values tells us that the first row of x2 comes from the 19th and 8th rows of x0 and x1, respectively. Notice also that the merged file is sorted by the intersection of the two files.

x2 <- merge(list(x=x0,id0=1:20),list(x=x1,id1=1:20))
x2
##    x id0 id1
## 1  a  19   9
## 2  e  18   1
## 3  f   9   3
## 4  i   2  18
## 5  k   3   7
## 6  m  13   4
## 7  n   6  15
## 8  o  12   6
## 9  s   5  20
## 10 t   8  17
## 11 u  15  19
## 12 v   1  11
## 13 w   4   5
## 14 x  10  14
## 15 z   7  13

To include all values, including mismatches, specify all=T. Notice now that the merged file is sorted by the values of x0, followed by the sorted values of x1 that are not found in x0. An NA for an index value tells you that the letter is not found in the original x0 or x1 vector.

x3 <- merge(list(x=x0,id0=1:20),list(x=x1,id1=1:20),all=T)
x3
##    x id0 id1
## 1  a  19   9
## 2  b  16  NA
## 3  c  17  NA
## 4  d  NA  12
## 5  e  18   1
## 6  f   9   3
## 7  g  14  NA
## 8  h  NA   2
## 9  i   2  18
## 10 j  NA  16
## 11 k   3   7
## 12 m  13   4
## 13 n   6  15
## 14 o  12   6
## 15 p  NA  10
## 16 q  NA   8
## 17 r  20  NA
## 18 s   5  20
## 19 t   8  17
## 20 u  15  19
## 21 v   1  11
## 22 w   4   5
## 23 x  10  14
## 24 y  11  NA
## 25 z   7  13

The %in% operator is also useful. The %in% operator compares each element of the first set against all the elements of the second set and returns TRUE if there is a match.

x0 %in% x1
##  [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
## [12]  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE

This function is not symmetric, of course,

x1 %in% x0
##  [1]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE
## [12] FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE

Using the %in% function, you can re-create the intersect function,

x0[x0 %in% x1]
##  [1] "v" "i" "k" "w" "s" "n" "z" "t" "f" "x" "o" "m" "u" "e" "a"

or the setdiff function,

x0[!x0 %in% x1]
## [1] "y" "g" "b" "c" "r"

A less intuitive version of the %in% operator is the match function. Additional information on merging can be found at the UCLA Statistics website.

You can find an earlier version of this page on my original website.