Merging in R

Steve Simon

2005-07-26

Categories: Blog post Tags: R software

Merging is tricky in any program, and you have to be sure that you know what you are doing. I have found that keeping an index variable will allow me to backtrack and find information in the original files if needed. Here's a simple example of how this would work.

> # randomly select 20 letters without replacement. > x0 <- sample(letters,20) > x1 <- sample(letters,20) > x0 [1] "k" "e" "f" "d" "c" "h" "z" "l" "w" "a" "p" "q" "y" [14] "x" "v" "t" "m" "r" "b" "n" > x1 [1] "w" "f" "g" "j" "d" "e" "o" "b" "c" "u" "v" "q" "x" [14] "i" "n" "z" "m" "t" "k" "l" > # The intersect functions lets us know which letters > # are in both x0 and x1. > intersect(x0,x1) [1] "k" "e" "f" "d" "c" "z" "l" "w" "q" "x" "v" "t" "m" [14] "b" "n" > # The setdiff function lets us know which letters > # are in x0 but not x1 (reverse the arguments to > # get letters in x1 but not in x1). > setdiff(x0,x1) [1] "h" "a" "p" "y" "r" > setdiff(x1,x0) [1] "g" "j" "o" "u" "i" > # By default, the merge function includes only those > # values in common with the two files. Note that the > # id1 and id2 values tells us that the first row of > # x2 comes from the 19th and 8th rows of x0 and x1, > # respectively. Notice also that the merged file is > # sorted by the intersection of the two files. > x2 <- merge(list(x=x0,id0=1:20),list(x=x1,id1=1:20)) > x2    x id0 id1  1 b  19   8  2 c   5   9  3 d   4   5  4 e   2   6  5 f   3   2  6 k   1  19  7 l   8  20  8 m  17  17  9 n  20  15 10 q  12  12 11 t  16  18 12 v  15  11 13 w   9   1 14 x  14  13 15 z   7  16 > # To include all values, including mismatches, specify > # all=T. Notice now that the merged file is sorted by the > # values of x0, followed by the sorted values of x1 that > # are not found in x0. An NA for an index value tells you > # that the letter is not found in the original x0 or x1 > # vector. > x3 <- merge(list(x=x0,id0=1:20),list(x=x1,id1=1:20),all=T) > x3    x id0 id1  1 a  10  NA  2 b  19   8  3 c   5   9  4 d   4   5  5 e   2   6  6 f   3   2  7 h   6  NA  8 k   1  19  9 l   8  20 10 m  17  17 11 n  20  15 12 p  11  NA 13 q  12  12 14 r  18  NA 15 t  16  18 16 v  15  11 17 w   9   1 18 x  14  13 19 y  13  NA 20 z   7  16 21 g  NA   3 22 i  NA  14 23 j  NA   4 24 o  NA   7 25 u  NA  10

Merging files when there are duplicate values for the key leads to some further trickiness.

> # Use the argument replace=T to get a few duplicate letters. > y0 <- sample(letters,20,replace=T) > y1 <- sample(letters,20,replace=T) > y0  [1] "r" "p" "d" "r" "l" "s" "t" "y" "z" "t" "a" "v" "m" "l" [15] "u" "o" "n" "m" "x" "g" > y1  [1] "z" "g" "n" "r" "y" "n" "m" "v" "q" "b" "p" "t" "a" "c" [15] "z" "l" "d" "x" "b" "r" > # For the letters "l" and "m" which appear twice in y0, > # each one is paired with the matching row in y1. For > # the letter "n" which appears twice in y1, each one > # is paired with the matching row in y0. For the letter > # "r" which appears twice in both y0 and y1, each row > # is paired with each other row, creating four rows in > # the new file. > y2 <- merge(list(y=y0,id0=1:20),list(y=y1,id1=1:20)) > y2    y id0 id1  1 a  11  13  2 d   3  17  3 g  20   2  4 l   5  16  5 l  14  16  6 m  13   7  7 m  18   7  8 n  17   3  9 n  17   6 10 p   2  11 11 r   1   4 12 r   4   4 13 r   1  20 14 r   4  20 15 t   7  12 16 t  10  12 17 v  12   8 18 x  19  18 19 y   8   5 20 z   9   1 21 z   9  15

The %in% operator is also useful.

> # the %in% operator compares each element of the first > # set against all the elements of the second set and > # returns TRUE if there is a match. > x0 %in% x1  [1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  [9]  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE [17]  TRUE FALSE  TRUE  TRUE > # This function is not symmetric, of course, > x1 %in% x0  [1]  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE  TRUE  [9]  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE [17]  TRUE  TRUE  TRUE  TRUE > # Using the %in% function, you can re-create the > # intersect function, > x0[x0 %in% x1] [1] "k" "e" "f" "d" "c" "z" "l" "w" "q" "x" "v" "t" "m" [14] "b" "n" > # or the setdiff function, > x0[!x0 %in% x1] [1] "h" "a" "p" "y" "r"

A less intuitive version of the %in% operator is the match function. Additional information on merging can be found at

You can find an earlier version of this page on my original website.