Identifying Non-Matching Rows Between Data Frames Using Dplyr

Earlier today I ran into a situation where I had to compare two data frames for some analysis I was doing. In particular I needed to identify the rows in data frame A which were not present in data frame B. I have used several different methods for this task in the past, but recently I have been using the anti_join function in the dplyr package.

I provided an example below using the sleep dataset.

library(dplyr)

set.seed(7)
sleep.A <- sample_n(sleep, 5 )
> sleep.A
   extra group ID
20   3.4     2 10
8    0.8     1  8
3   -0.2     1  3
2   -1.6     1  2
4   -1.2     1  4
sleep.B <- sample_n(sleep, 5 )
> sleep.B
   extra group ID
16   4.4     2  6
7    3.7     1  7
18   1.6     2  8
3   -0.2     1  3
8    0.8     1  8

I want to find all the rows of sleep.A not present in sleep.B based on the columns group and ID. The group and ID combinations present in rows 20, 2, and 4 of sleep.A are not present in sleep.B. Using anti_join we can confirm this as shown below.

> anti_join(sleep.A, sleep.B, by = c("group", "ID"))
  extra group ID
1  -1.2     1  4
2  -1.6     1  2
3   3.4     2 10

Mining the Details

blog

Identifying Non-Matching Rows Between Data Frames Using Dplyr