Earlier today I ran into a situation where I had to compare two data frames for some analysis I was doing. In particular I needed to identify the rows in data frame A which were not present in data frame B. I have used several different methods for this task in the past, but recently I have been using the anti_join function in the dplyr package.
I provided an example below using the sleep dataset.
library(dplyr)
set.seed(7)
sleep.A <- sample_n(sleep, 5 )
> sleep.A
extra group ID
20 3.4 2 10
8 0.8 1 8
3 -0.2 1 3
2 -1.6 1 2
4 -1.2 1 4
sleep.B <- sample_n(sleep, 5 )
> sleep.B
extra group ID
16 4.4 2 6
7 3.7 1 7
18 1.6 2 8
3 -0.2 1 3
8 0.8 1 8
I want to find all the rows of sleep.A not present in sleep.B based on the columns group and ID. The group and ID combinations present in rows 20, 2, and 4 of sleep.A are not present in sleep.B. Using anti_join we can confirm this as shown below.
> anti_join(sleep.A, sleep.B, by = c("group", "ID"))
extra group ID
1 -1.2 1 4
2 -1.6 1 2
3 3.4 2 10