Negative filtering with dplyr

January 31, 2024 by Aaron Schiff

When doing data analysis, sometimes you want to filter out (exclude) rows that meet some conditions. With dplyr, one way to do this is to use filter() with a logical not operator. For example:

df |> filter(!((name == "John") & (daughter == "Jenny")))

This will filter out all the rows where name is John and daughter is Jenny. This is fine as long as both logical expressions only return either TRUE or FALSE for all rows, but there is another possibility that is always lurking: NA. With missing values we might run into trouble because filter() only returns rows where the filter condition is TRUE, but !NA is still NA, which is not the same as TRUE.

Suppose that in our dataset, every row has name but daughter is NA for some rows (some people don’t have a daughter). We would probably want these rows to be included in the results of our filter, because they don’t have daughter == "Jenny". However, if daughter is NA, then daughter == "Jenny" also returns NA, and our filter may not work properly.

There is an extra complication from evaluating the logical and operation inside filter(). In R, TRUE & NA returns NA, but FALSE & NA returns FALSE. This is because R treats NA as ‘unknown’ in logical operations. TRUE and unknown is unknown since the result depends on the unknown thing. But FALSE and unknown is FALSE, because regardless of whether the unknown thing is TRUE or FALSE, the result will be FALSE.

So what happens when we apply our filter when some rows are missing daughter? It’s easier to see with an example. Let’s say our data is this:

name	daughter
John	Jenny
John	Mary
John
Peter	Jenny
Peter	Alice
Peter

When we run the filter as above, we get:

name	daughter
John	Mary
Peter	Jenny
Peter	Alice
Peter

The John / NA row has been omitted, because for this row, the test !((name == "John") & (daughter == "Jenny")) returns NA. However, the Peter / NA row has been included. This is because although daughter == "Jenny" returns NA for this row, name == "John" returns FALSE, and for the reason explained above, !(FALSE & NA) evaluates as TRUE.

The moral of the story is, be careful with negative filtering! One way to fix the above example is to explicitly check for possible NA values:

df |> filter(!((name == "John") & (daughter == "Jenny") & !is.na(daughter)))

This works because if daughter is missing, !is.na(daughter) will return FALSE, and as above this will cause the logical and expression to return FALSE rather than NA.